LLM Training — Learning Strategy & Portfolio Projects

Pushback on "learn LLM training stuff." You already have 13 years of SWE + ML engineering at Meta. You are not starting from zero. The trap is doing beginner projects (fine-tune Llama on your journal, run a LoRA on Colab) that signal "hobbyist" not "serious engineer." Every project below is scoped to signal "I understand this at the level required to do it professionally at a frontier lab" — not "I followed a tutorial." The bar is: could you explain every design choice in a 45-minute interview deep-dive?

1. What actually matters (ordered by leverage)

Not everything in "LLM training" is equally valuable. Here's what the companies on your list actually hire for, ranked by how many doors it opens:

#	Skill area	Opens doors at	Your current gap
1	Post-training: RLHF / DPO / GRPO	Anthropic, OpenAI, Periodic Labs, Thinking Machines, Mechanize, Harmonic	Medium — you understand RL conceptually but likely haven't run reward-model training or PPO on a language model
2	Inference systems: serving, batching, KV cache, speculative decoding	Together AI, Fireworks, Anthropic, OpenAI, Groq, Cerebras, any company deploying models	Low-medium — your Meta infra skills transfer; you need LLM-specific serving patterns
3	Pretraining: data pipelines, distributed training, scaling laws	Anthropic, OpenAI, Thinking Machines, SSI, Lila Sciences	Medium-high — pretraining at scale requires specific distributed training patterns (FSDP, tensor/pipeline parallelism) you may not have done
4	Kernels & low-level optimization: Triton, CUDA, FlashAttention	Together AI, Fireworks, Etched, MatX, Cerebras, any hardware co	High — kernel engineering is a distinct skill from ML engineering
5	Evaluation & benchmarking	LMArena, Guide Labs, all frontier labs (everyone needs eval)	Low — you can build this quickly
6	Data: curation, filtering, decontamination, synthetic data	Scale AI, Snorkel, Anthropic, OpenAI	Medium — less glamorous but critically important

The meta-strategy: pick 3 projects that cover skills #1, #2, and #3. Each project produces a GitHub repo + a blog post. That's your portfolio. Don't spread across 6 — depth on 3 beats breadth on 6 every time.

2. The three projects

Project 1: Train a small LLM from scratch → post-train it with GRPO

~4 weeks part-time Pretraining Post-training RL Anthropic Periodic Labs Harmonic Mechanize Together AI

What: Pretrain a ~125M-350M parameter transformer (GPT-2 architecture) on a clean subset of FineWeb or RedPajama. Then post-train it with GRPO (Group Relative Policy Optimization — the method behind DeepSeek-R1) on a verifiable task like math (GSM8K) or code (MBPP). Document the entire pipeline end-to-end.

Why this and not just fine-tune an existing model: Anyone can call trl.SFTTrainer. Training from scratch forces you to understand tokenizer construction, data pipeline architecture, learning rate schedules, loss curves, and the entire data→pretrain→SFT→RL pipeline. GRPO specifically is the hottest post-training method right now — it's what Anthropic, DeepSeek, and others are actively using and iterating on. Showing you've implemented it from the paper is a strong signal.

Scope (not more, not less):

Build a BPE tokenizer from scratch (don't use tiktoken — build one, then benchmark against tiktoken)
Pretrain a 125M-350M model on 10-50B tokens using PyTorch FSDP on 2-8 GPUs (rent from Lambda/Together/Modal). Log loss curves, gradient norms, learning rate.
SFT on a small instruction dataset (UltraChat or OpenHermes subset)
Implement GRPO from the DeepSeek-R1 paper — reward model = verifier (math correctness or code execution). Train for ~1000 steps.
Evaluate: show the model improves on the verifiable task after GRPO. Compare to SFT-only baseline.
Write a blog post: "I trained a 125M model from scratch and post-trained it with GRPO. Here's what I learned."

What you'll be able to talk about in interviews:

Why GRPO works without a separate reward model (and when it doesn't)
The KL penalty term and how it prevents reward hacking
How data quality at pretraining affects post-training ceiling
Practical issues: gradient explosion in early RL steps, learning rate sensitivity, sample efficiency
Comparison: GRPO vs PPO vs DPO — when each is appropriate

Cost: ~$200-500 in GPU rental (a few hundred H100-hours for a 125M model). This is cheap.

Project 2: Build an LLM serving system with continuous batching + speculative decoding

~3 weeks part-time Inference Systems GPU optimization Together AI Fireworks Anthropic Groq Anyscale Modal

What: Build a toy but functional LLM inference server from scratch in Python that implements: (a) continuous batching, (b) PagedAttention-style KV cache management, (c) speculative decoding with a draft model. Serve a small open model (Llama-3.2-1B or Qwen2.5-1.5B) and benchmark throughput + latency against naive generation.

Why: vLLM has 40K+ GitHub stars but very few people actually understand how it works inside. Building a simplified version from scratch demonstrates you understand the core ideas — not that you can pip-install a library. Every inference-focused company (Together, Fireworks, Anthropic's serving team) interviews on these concepts.

Scope:

Naive baseline: serve one request at a time, full attention recomputation each step. Benchmark tokens/sec.
Add KV caching: cache key/value tensors, only compute attention for the new token. Measure speedup.
Add continuous batching: multiple requests in-flight, new requests join mid-generation (no wait for batch boundary). Measure throughput vs latency tradeoff.
Add paged KV cache: allocate KV memory in fixed blocks (PagedAttention concept). Show memory utilization improvement.
Add speculative decoding: use a smaller draft model to propose N tokens, verify with the large model in one forward pass. Measure speedup.
Write a blog post: "Building an LLM serving system from scratch — what vLLM actually does under the hood."

What you'll talk about in interviews:

Why continuous batching beats static batching (request-level vs batch-level scheduling)
PagedAttention: memory fragmentation problem, block table, copy-on-write for beam search
Speculative decoding: why verification is cheap (parallel forward pass), acceptance rate, when it helps vs hurts
Prefill vs decode phases and why they have different compute profiles
Real-world constraints: maximum sequence length, GPU memory budget, SLO guarantees

Cost: ~$50-100 in GPU rental. You can do most of this on a single A100.

Project 3: Distributed pretraining with FSDP + a scaling-law experiment

~3 weeks part-time Distributed training Scaling laws Infrastructure Anthropic OpenAI Thinking Machines Lila Sciences Physical Intelligence Together AI

What: Train 4-5 models at different scales (10M, 30M, 100M, 300M, 1B parameters) on the same data, with matched compute budgets where possible. Plot the scaling curves (loss vs compute, loss vs parameters, loss vs data). Compare your empirical curves to the Chinchilla/Kaplan predictions. Do this on a multi-node FSDP setup (4-8 GPUs across 1-2 nodes).

Why: Scaling laws are the intellectual foundation of why frontier labs exist. Everyone at Anthropic/OpenAI/DeepMind understands them intuitively. Most candidates can recite the Chinchilla result but have never reproduced it. Actually fitting the curves on your own models — and finding where they break — is a conversation piece that instantly separates you from every other applicant.

Scope:

Set up a multi-GPU training pipeline using PyTorch FSDP (or DeepSpeed ZeRO-3). Document the setup — this is itself a useful artifact.
Train 5 models: 10M, 30M, 100M, 300M, 1B params. Same architecture family (decoder-only transformer), same data (FineWeb subset), same tokenizer.
For each model, log: final loss, loss vs step, loss vs FLOPs, throughput (tokens/sec/GPU).
Plot: loss vs compute (IsoFLOP curves), loss vs N (parameter scaling), loss vs D (data scaling).
Fit the power law L(N,D) = E + A/N^α + B/D^β. Report your fitted α, β and compare to Chinchilla (α≈0.34, β≈0.28).
Write a blog post: "I reproduced scaling laws on a $1000 GPU budget. Here's what matched and what didn't."

What you'll talk about in interviews:

Compute-optimal training: why Chinchilla says "use more data, fewer params" vs the Llama school of "overtrain smaller models for inference efficiency"
FSDP vs tensor parallelism vs pipeline parallelism: when each, why
Practical distributed training: communication bottlenecks, gradient accumulation, mixed precision, activation checkpointing
Where your curves deviated from theory — this is the most interesting part and proves you actually ran the experiments

Cost: ~$500-1500 in GPU rental. The 1B model is the expensive part; if budget-constrained, do 10M-300M and skip 1B (still excellent).

3. What NOT to do (common mistakes)

Don't fine-tune Llama on a niche dataset and call it a project. Every ML hobbyist does this. It demonstrates you can run trl.SFTTrainer, not that you understand training. Zero signal for frontier labs.
Don't build a RAG chatbot. This is application-layer work. It tells a frontier lab nothing about whether you understand model training.
Don't do 6 shallow projects. Three deep ones with blog posts beats six GitHub repos with no write-up. The blog post IS the resume item — it proves you can explain what you did and why.
Don't spend time on Kaggle competitions. Different signal, different audience. Frontier labs don't care about your leaderboard rank.
Don't skip the write-up. A GitHub repo without a blog post is invisible. The blog post gets shared on Twitter/LinkedIn, gets indexed by Google, and is what a recruiter or hiring manager can actually read in 5 minutes.

4. Required reading (do this BEFORE the projects)

Don't just read these — take notes on what surprised you. The notes become your blog post framing.

Paper / Resource	Why	Time
Attention Is All You Need (Vaswani et al, 2017)	The architecture. You probably already know this; skim to confirm.	1 hr
GPT-2 / GPT-3 papers (Radford et al, Brown et al)	Decoder-only design choices, in-context learning discovery.	2 hrs
Chinchilla (Hoffmann et al, 2022)	Scaling laws. This is the intellectual foundation for Project 3.	2 hrs
Llama 2 + Llama 3 papers (Touvron/Grattafiori et al)	The most detailed public training recipes. Data, SFT, RLHF details.	3 hrs
DeepSeek-R1 paper	GRPO explained. The post-training method you'll implement in Project 1.	2 hrs
FlashAttention 1 + 2 (Tri Dao)	The kernel that changed everything. Understand IO-awareness.	2 hrs
vLLM / PagedAttention paper (Kwon et al)	How LLM serving actually works. Foundation for Project 2.	1.5 hrs
RLHF paper (Ouyang et al, "InstructGPT")	The original RLHF recipe. Understand the reward model → PPO pipeline.	2 hrs
DPO paper (Rafailov et al)	Why you can skip the reward model. Compare to PPO and GRPO.	1.5 hrs
Andrej Karpathy's "Let's build GPT from scratch" (YouTube)	Fast warmup if you want a hands-on refresher. Watch at 2x.	1.5 hrs
Your own LLM book chapters (the ones you've already written at `/docs/llm-learning/`)	You've literally been writing about this. Re-read your own work — it's targeted study material.	2 hrs

5. Timeline: 10-12 weeks while working full-time at Meta

Weeks 1-2: Reading sprint

Read all papers above. Take handwritten notes (not typed — forces you to actually think).
Set up cloud GPU access (Lambda Cloud, Together GPU Clusters, or Modal). Run a "hello world" multi-GPU training job to confirm your setup works.
Buy the compute budget upfront (~$1000-2000 for all three projects). This is a career investment that pays back 100x.

Weeks 3-6: Project 1 (pretrain + GRPO)

Week 3: Build tokenizer + data pipeline. Start pretraining 125M model.
Week 4: Complete pretraining. Run SFT. Begin GRPO implementation.
Week 5: Debug GRPO (this WILL be hard — RL training is unstable). Get it working on GSM8K or MBPP.
Week 6: Run final eval. Write blog post. Push to GitHub.

Weeks 7-9: Project 2 (inference serving system)

Week 7: Naive baseline + KV caching. Benchmark.
Week 8: Continuous batching + paged KV cache. Benchmark.
Week 9: Speculative decoding. Final benchmarks. Blog post. Push to GitHub.

Weeks 10-12: Project 3 (scaling laws)

Week 10: Set up FSDP pipeline. Train 10M and 30M models.
Week 11: Train 100M and 300M models. Start plotting curves.
Week 12: (Optional) train 1B. Fit power law. Blog post. Push to GitHub.

Week 13+: Apply

Your GitHub now has 3 repos with READMEs and blog posts.
Your personal site (ravikant.dev) links to all three.
You have concrete project-depth stories for every HM interview.
Start referral outreach with blog post links as your introduction.

6. How each project maps to your top companies

Company	Project 1 (GRPO)	Project 2 (Inference)	Project 3 (Scaling)
Anthropic	✅ Core — they hire heavily for post-training/RL	✅ Their serving infra is a major team	✅ Pretraining team uses scaling laws daily
Periodic Labs	✅ RL is their methodology	—	✅ Scaling on scientific data
Together AI	— (they're infra, not post-training)	✅ Core business	✅ Training clusters are the product
Physical Intelligence	✅ RL for robot foundation models	—	✅ Scaling for robotics models
Harmonic	✅ RL + verifiable rewards (math is their domain)	—	✅ Scaling for reasoning
Mechanize	✅ RL environments directly	—	—
World Labs	— (more pre/multimodal)	—	✅ Foundation model scaling
insitro	— (bio-specific)	—	✅ ML training methodology transfers
Fireworks AI	—	✅ Core business	—
Guide Labs	✅ Interpretable models need post-training	—	✅ Scaling interpretable architectures
OpenAI	✅ o-series is RL-heavy	✅ Serving at massive scale	✅ They defined scaling laws

Project 1 (GRPO) is the highest-leverage single project. It covers the most in-demand skill (post-training RL), applies to the most companies you care about (Anthropic, Periodic Labs, Harmonic, Mechanize, Physical Intelligence, OpenAI), and the end-to-end pipeline (data → pretrain → SFT → RL → eval) demonstrates breadth + depth simultaneously. If you only do one project, do this one.

7. Where to host and how to present

GitHub: public repos. One repo per project. Clean README with a "Results" section (graphs, tables, numbers). Don't dump code without documentation.
Blog: host on ravikant.dev (you already have the infrastructure). One post per project. Lead with the result, not the process. Title pattern: "I [did X]. Here's [surprising finding]." Not: "My journey learning about X."
LinkedIn: post each blog post. This is how you get into referral conversations. Senior engineers at frontier labs scroll LinkedIn.
Twitter/X: post a thread summarizing each project with 1-2 key graphs. Tag relevant researchers (Tri Dao, Lilian Weng, etc.) — not desperately, but if your results are interesting, they'll notice.

8. The application strategy

Batch your applications. Don't trickle-apply over 6 months. Finish all 3 projects first, then apply to 8-10 companies in a 2-week window. Competing offers create leverage.
For each company, pick ONE role. Don't shotgun 5 applications at Anthropic. Pick the best-fit role and apply once, with a referral if possible.
The referral email template: "Hi [Name], I'm a senior ML engineer at Meta exploring my next move in LLM training/post-training. I recently [1-sentence summary of best project + link]. I'm interested in [specific role] at [company] — would you be open to a brief chat or a referral? [Your name]"
Start referral outreach at week 6 (after Project 1 ships). Don't wait until all 3 are done — the first project is enough to start conversations.
Interview prep is separate from project work. Weeks 13-14 should be dedicated to: system design mocks (LLM serving scenarios), coding practice (Python concurrency, data structures), and values/culture prep (Anthropic-specific reading).

9. Honest assessment of this plan

What this plan does well:

Covers the three most-transferable skill areas across your target companies
Produces visible artifacts (not just knowledge in your head)
Fits in 12 weeks of part-time work alongside a full-time Meta job
Total cost ~$1000-2000 in GPU rental — trivial relative to the comp delta between jobs

What this plan does NOT do:

Kernel engineering. If you want Together AI or hardware companies specifically, you'd need a 4th project: write a fused attention kernel in Triton. I left this out because it's a distinct skill that takes 4+ weeks alone and only matters for ~5 companies on your list.
Publications. These projects are portfolio pieces, not papers. If you're targeting a research-scientist role (vs research-engineer or ML engineer), you'd need to extend Project 1 or 3 into a paper submission. That adds ~4-8 weeks.
Robotics / embodied AI. If Physical Intelligence or World Labs are top targets, you'd want a project that involves multi-modal training (vision + language or vision + action). That's a different project scope — tell me if you want me to design one.
Domain credentialing for AI-for-science. insitro and Lila Sciences want domain knowledge (biology/chemistry) alongside ML. These projects don't cover that. If bio-AI is a priority, you'd need to pair with a bio collaborator or take a different approach.

10. What to do this week

Set up a Lambda Cloud or Modal account. Run torchrun --nproc_per_node=2 on a hello-world FSDP training script. Confirm multi-GPU works. (1 hour)
Read the DeepSeek-R1 paper. Take notes on GRPO specifically — reward function, KL penalty, group sampling. (2 hours)
Read the vLLM/PagedAttention paper. Draw the block-table diagram by hand. (1.5 hours)
Decide: do all 3 projects, or just Project 1 first? I'd recommend doing Project 1 first, shipping it, then deciding if you need 2 and 3 based on where your applications land.