LLM Training — Learning Strategy & Portfolio Projects

For: ML Engineer at Meta, 13yr SWE. Goal: deep technical LLM training knowledge + visible portfolio → apply broadly across Bay Area AI companies.
Pushback on "learn LLM training stuff." You already have 13 years of SWE + ML engineering at Meta. You are not starting from zero. The trap is doing beginner projects (fine-tune Llama on your journal, run a LoRA on Colab) that signal "hobbyist" not "serious engineer." Every project below is scoped to signal "I understand this at the level required to do it professionally at a frontier lab" — not "I followed a tutorial." The bar is: could you explain every design choice in a 45-minute interview deep-dive?

1. What actually matters (ordered by leverage)

Not everything in "LLM training" is equally valuable. Here's what the companies on your list actually hire for, ranked by how many doors it opens:

#Skill areaOpens doors atYour current gap
1Post-training: RLHF / DPO / GRPOAnthropic, OpenAI, Periodic Labs, Thinking Machines, Mechanize, HarmonicMedium — you understand RL conceptually but likely haven't run reward-model training or PPO on a language model
2Inference systems: serving, batching, KV cache, speculative decodingTogether AI, Fireworks, Anthropic, OpenAI, Groq, Cerebras, any company deploying modelsLow-medium — your Meta infra skills transfer; you need LLM-specific serving patterns
3Pretraining: data pipelines, distributed training, scaling lawsAnthropic, OpenAI, Thinking Machines, SSI, Lila SciencesMedium-high — pretraining at scale requires specific distributed training patterns (FSDP, tensor/pipeline parallelism) you may not have done
4Kernels & low-level optimization: Triton, CUDA, FlashAttentionTogether AI, Fireworks, Etched, MatX, Cerebras, any hardware coHigh — kernel engineering is a distinct skill from ML engineering
5Evaluation & benchmarkingLMArena, Guide Labs, all frontier labs (everyone needs eval)Low — you can build this quickly
6Data: curation, filtering, decontamination, synthetic dataScale AI, Snorkel, Anthropic, OpenAIMedium — less glamorous but critically important
The meta-strategy: pick 3 projects that cover skills #1, #2, and #3. Each project produces a GitHub repo + a blog post. That's your portfolio. Don't spread across 6 — depth on 3 beats breadth on 6 every time.

2. The three projects

Project 1: Train a small LLM from scratch → post-train it with GRPO

~4 weeks part-time Pretraining Post-training RL Anthropic Periodic Labs Harmonic Mechanize Together AI

What: Pretrain a ~125M-350M parameter transformer (GPT-2 architecture) on a clean subset of FineWeb or RedPajama. Then post-train it with GRPO (Group Relative Policy Optimization — the method behind DeepSeek-R1) on a verifiable task like math (GSM8K) or code (MBPP). Document the entire pipeline end-to-end.

Why this and not just fine-tune an existing model: Anyone can call trl.SFTTrainer. Training from scratch forces you to understand tokenizer construction, data pipeline architecture, learning rate schedules, loss curves, and the entire data→pretrain→SFT→RL pipeline. GRPO specifically is the hottest post-training method right now — it's what Anthropic, DeepSeek, and others are actively using and iterating on. Showing you've implemented it from the paper is a strong signal.

Scope (not more, not less):

  1. Build a BPE tokenizer from scratch (don't use tiktoken — build one, then benchmark against tiktoken)
  2. Pretrain a 125M-350M model on 10-50B tokens using PyTorch FSDP on 2-8 GPUs (rent from Lambda/Together/Modal). Log loss curves, gradient norms, learning rate.
  3. SFT on a small instruction dataset (UltraChat or OpenHermes subset)
  4. Implement GRPO from the DeepSeek-R1 paper — reward model = verifier (math correctness or code execution). Train for ~1000 steps.
  5. Evaluate: show the model improves on the verifiable task after GRPO. Compare to SFT-only baseline.
  6. Write a blog post: "I trained a 125M model from scratch and post-trained it with GRPO. Here's what I learned."

What you'll be able to talk about in interviews:

Cost: ~$200-500 in GPU rental (a few hundred H100-hours for a 125M model). This is cheap.

Project 2: Build an LLM serving system with continuous batching + speculative decoding

~3 weeks part-time Inference Systems GPU optimization Together AI Fireworks Anthropic Groq Anyscale Modal

What: Build a toy but functional LLM inference server from scratch in Python that implements: (a) continuous batching, (b) PagedAttention-style KV cache management, (c) speculative decoding with a draft model. Serve a small open model (Llama-3.2-1B or Qwen2.5-1.5B) and benchmark throughput + latency against naive generation.

Why: vLLM has 40K+ GitHub stars but very few people actually understand how it works inside. Building a simplified version from scratch demonstrates you understand the core ideas — not that you can pip-install a library. Every inference-focused company (Together, Fireworks, Anthropic's serving team) interviews on these concepts.

Scope:

  1. Naive baseline: serve one request at a time, full attention recomputation each step. Benchmark tokens/sec.
  2. Add KV caching: cache key/value tensors, only compute attention for the new token. Measure speedup.
  3. Add continuous batching: multiple requests in-flight, new requests join mid-generation (no wait for batch boundary). Measure throughput vs latency tradeoff.
  4. Add paged KV cache: allocate KV memory in fixed blocks (PagedAttention concept). Show memory utilization improvement.
  5. Add speculative decoding: use a smaller draft model to propose N tokens, verify with the large model in one forward pass. Measure speedup.
  6. Write a blog post: "Building an LLM serving system from scratch — what vLLM actually does under the hood."

What you'll talk about in interviews:

Cost: ~$50-100 in GPU rental. You can do most of this on a single A100.

Project 3: Distributed pretraining with FSDP + a scaling-law experiment

~3 weeks part-time Distributed training Scaling laws Infrastructure Anthropic OpenAI Thinking Machines Lila Sciences Physical Intelligence Together AI

What: Train 4-5 models at different scales (10M, 30M, 100M, 300M, 1B parameters) on the same data, with matched compute budgets where possible. Plot the scaling curves (loss vs compute, loss vs parameters, loss vs data). Compare your empirical curves to the Chinchilla/Kaplan predictions. Do this on a multi-node FSDP setup (4-8 GPUs across 1-2 nodes).

Why: Scaling laws are the intellectual foundation of why frontier labs exist. Everyone at Anthropic/OpenAI/DeepMind understands them intuitively. Most candidates can recite the Chinchilla result but have never reproduced it. Actually fitting the curves on your own models — and finding where they break — is a conversation piece that instantly separates you from every other applicant.

Scope:

  1. Set up a multi-GPU training pipeline using PyTorch FSDP (or DeepSpeed ZeRO-3). Document the setup — this is itself a useful artifact.
  2. Train 5 models: 10M, 30M, 100M, 300M, 1B params. Same architecture family (decoder-only transformer), same data (FineWeb subset), same tokenizer.
  3. For each model, log: final loss, loss vs step, loss vs FLOPs, throughput (tokens/sec/GPU).
  4. Plot: loss vs compute (IsoFLOP curves), loss vs N (parameter scaling), loss vs D (data scaling).
  5. Fit the power law L(N,D) = E + A/N^α + B/D^β. Report your fitted α, β and compare to Chinchilla (α≈0.34, β≈0.28).
  6. Write a blog post: "I reproduced scaling laws on a $1000 GPU budget. Here's what matched and what didn't."

What you'll talk about in interviews:

Cost: ~$500-1500 in GPU rental. The 1B model is the expensive part; if budget-constrained, do 10M-300M and skip 1B (still excellent).

3. What NOT to do (common mistakes)

4. Required reading (do this BEFORE the projects)

Don't just read these — take notes on what surprised you. The notes become your blog post framing.

Paper / ResourceWhyTime
Attention Is All You Need (Vaswani et al, 2017)The architecture. You probably already know this; skim to confirm.1 hr
GPT-2 / GPT-3 papers (Radford et al, Brown et al)Decoder-only design choices, in-context learning discovery.2 hrs
Chinchilla (Hoffmann et al, 2022)Scaling laws. This is the intellectual foundation for Project 3.2 hrs
Llama 2 + Llama 3 papers (Touvron/Grattafiori et al)The most detailed public training recipes. Data, SFT, RLHF details.3 hrs
DeepSeek-R1 paperGRPO explained. The post-training method you'll implement in Project 1.2 hrs
FlashAttention 1 + 2 (Tri Dao)The kernel that changed everything. Understand IO-awareness.2 hrs
vLLM / PagedAttention paper (Kwon et al)How LLM serving actually works. Foundation for Project 2.1.5 hrs
RLHF paper (Ouyang et al, "InstructGPT")The original RLHF recipe. Understand the reward model → PPO pipeline.2 hrs
DPO paper (Rafailov et al)Why you can skip the reward model. Compare to PPO and GRPO.1.5 hrs
Andrej Karpathy's "Let's build GPT from scratch" (YouTube)Fast warmup if you want a hands-on refresher. Watch at 2x.1.5 hrs
Your own LLM book chapters (the ones you've already written at /docs/llm-learning/)You've literally been writing about this. Re-read your own work — it's targeted study material.2 hrs

5. Timeline: 10-12 weeks while working full-time at Meta

Weeks 1-2: Reading sprint
Weeks 3-6: Project 1 (pretrain + GRPO)
Weeks 7-9: Project 2 (inference serving system)
Weeks 10-12: Project 3 (scaling laws)
Week 13+: Apply

6. How each project maps to your top companies

CompanyProject 1 (GRPO)Project 2 (Inference)Project 3 (Scaling)
Anthropic✅ Core — they hire heavily for post-training/RL✅ Their serving infra is a major team✅ Pretraining team uses scaling laws daily
Periodic Labs✅ RL is their methodology✅ Scaling on scientific data
Together AI— (they're infra, not post-training)✅ Core business✅ Training clusters are the product
Physical Intelligence✅ RL for robot foundation models✅ Scaling for robotics models
Harmonic✅ RL + verifiable rewards (math is their domain)✅ Scaling for reasoning
Mechanize✅ RL environments directly
World Labs— (more pre/multimodal)✅ Foundation model scaling
insitro— (bio-specific)✅ ML training methodology transfers
Fireworks AI✅ Core business
Guide Labs✅ Interpretable models need post-training✅ Scaling interpretable architectures
OpenAI✅ o-series is RL-heavy✅ Serving at massive scale✅ They defined scaling laws
Project 1 (GRPO) is the highest-leverage single project. It covers the most in-demand skill (post-training RL), applies to the most companies you care about (Anthropic, Periodic Labs, Harmonic, Mechanize, Physical Intelligence, OpenAI), and the end-to-end pipeline (data → pretrain → SFT → RL → eval) demonstrates breadth + depth simultaneously. If you only do one project, do this one.

7. Where to host and how to present

8. The application strategy

  1. Batch your applications. Don't trickle-apply over 6 months. Finish all 3 projects first, then apply to 8-10 companies in a 2-week window. Competing offers create leverage.
  2. For each company, pick ONE role. Don't shotgun 5 applications at Anthropic. Pick the best-fit role and apply once, with a referral if possible.
  3. The referral email template: "Hi [Name], I'm a senior ML engineer at Meta exploring my next move in LLM training/post-training. I recently [1-sentence summary of best project + link]. I'm interested in [specific role] at [company] — would you be open to a brief chat or a referral? [Your name]"
  4. Start referral outreach at week 6 (after Project 1 ships). Don't wait until all 3 are done — the first project is enough to start conversations.
  5. Interview prep is separate from project work. Weeks 13-14 should be dedicated to: system design mocks (LLM serving scenarios), coding practice (Python concurrency, data structures), and values/culture prep (Anthropic-specific reading).

9. Honest assessment of this plan

What this plan does well:
What this plan does NOT do:

10. What to do this week

  1. Set up a Lambda Cloud or Modal account. Run torchrun --nproc_per_node=2 on a hello-world FSDP training script. Confirm multi-GPU works. (1 hour)
  2. Read the DeepSeek-R1 paper. Take notes on GRPO specifically — reward function, KL penalty, group sampling. (2 hours)
  3. Read the vLLM/PagedAttention paper. Draw the block-table diagram by hand. (1.5 hours)
  4. Decide: do all 3 projects, or just Project 1 first? I'd recommend doing Project 1 first, shipping it, then deciding if you need 2 and 3 based on where your applications land.