1. What actually matters (ordered by leverage)
Not everything in "LLM training" is equally valuable. Here's what the companies on your list actually hire for, ranked by how many doors it opens:
| # | Skill area | Opens doors at | Your current gap |
|---|---|---|---|
| 1 | Post-training: RLHF / DPO / GRPO | Anthropic, OpenAI, Periodic Labs, Thinking Machines, Mechanize, Harmonic | Medium — you understand RL conceptually but likely haven't run reward-model training or PPO on a language model |
| 2 | Inference systems: serving, batching, KV cache, speculative decoding | Together AI, Fireworks, Anthropic, OpenAI, Groq, Cerebras, any company deploying models | Low-medium — your Meta infra skills transfer; you need LLM-specific serving patterns |
| 3 | Pretraining: data pipelines, distributed training, scaling laws | Anthropic, OpenAI, Thinking Machines, SSI, Lila Sciences | Medium-high — pretraining at scale requires specific distributed training patterns (FSDP, tensor/pipeline parallelism) you may not have done |
| 4 | Kernels & low-level optimization: Triton, CUDA, FlashAttention | Together AI, Fireworks, Etched, MatX, Cerebras, any hardware co | High — kernel engineering is a distinct skill from ML engineering |
| 5 | Evaluation & benchmarking | LMArena, Guide Labs, all frontier labs (everyone needs eval) | Low — you can build this quickly |
| 6 | Data: curation, filtering, decontamination, synthetic data | Scale AI, Snorkel, Anthropic, OpenAI | Medium — less glamorous but critically important |
2. The three projects
Project 1: Train a small LLM from scratch → post-train it with GRPO
What: Pretrain a ~125M-350M parameter transformer (GPT-2 architecture) on a clean subset of FineWeb or RedPajama. Then post-train it with GRPO (Group Relative Policy Optimization — the method behind DeepSeek-R1) on a verifiable task like math (GSM8K) or code (MBPP). Document the entire pipeline end-to-end.
Why this and not just fine-tune an existing model: Anyone can call trl.SFTTrainer. Training from scratch forces you to understand tokenizer construction, data pipeline architecture, learning rate schedules, loss curves, and the entire data→pretrain→SFT→RL pipeline. GRPO specifically is the hottest post-training method right now — it's what Anthropic, DeepSeek, and others are actively using and iterating on. Showing you've implemented it from the paper is a strong signal.
Scope (not more, not less):
- Build a BPE tokenizer from scratch (don't use tiktoken — build one, then benchmark against tiktoken)
- Pretrain a 125M-350M model on 10-50B tokens using PyTorch FSDP on 2-8 GPUs (rent from Lambda/Together/Modal). Log loss curves, gradient norms, learning rate.
- SFT on a small instruction dataset (UltraChat or OpenHermes subset)
- Implement GRPO from the DeepSeek-R1 paper — reward model = verifier (math correctness or code execution). Train for ~1000 steps.
- Evaluate: show the model improves on the verifiable task after GRPO. Compare to SFT-only baseline.
- Write a blog post: "I trained a 125M model from scratch and post-trained it with GRPO. Here's what I learned."
What you'll be able to talk about in interviews:
- Why GRPO works without a separate reward model (and when it doesn't)
- The KL penalty term and how it prevents reward hacking
- How data quality at pretraining affects post-training ceiling
- Practical issues: gradient explosion in early RL steps, learning rate sensitivity, sample efficiency
- Comparison: GRPO vs PPO vs DPO — when each is appropriate
Cost: ~$200-500 in GPU rental (a few hundred H100-hours for a 125M model). This is cheap.
Project 2: Build an LLM serving system with continuous batching + speculative decoding
What: Build a toy but functional LLM inference server from scratch in Python that implements: (a) continuous batching, (b) PagedAttention-style KV cache management, (c) speculative decoding with a draft model. Serve a small open model (Llama-3.2-1B or Qwen2.5-1.5B) and benchmark throughput + latency against naive generation.
Why: vLLM has 40K+ GitHub stars but very few people actually understand how it works inside. Building a simplified version from scratch demonstrates you understand the core ideas — not that you can pip-install a library. Every inference-focused company (Together, Fireworks, Anthropic's serving team) interviews on these concepts.
Scope:
- Naive baseline: serve one request at a time, full attention recomputation each step. Benchmark tokens/sec.
- Add KV caching: cache key/value tensors, only compute attention for the new token. Measure speedup.
- Add continuous batching: multiple requests in-flight, new requests join mid-generation (no wait for batch boundary). Measure throughput vs latency tradeoff.
- Add paged KV cache: allocate KV memory in fixed blocks (PagedAttention concept). Show memory utilization improvement.
- Add speculative decoding: use a smaller draft model to propose N tokens, verify with the large model in one forward pass. Measure speedup.
- Write a blog post: "Building an LLM serving system from scratch — what vLLM actually does under the hood."
What you'll talk about in interviews:
- Why continuous batching beats static batching (request-level vs batch-level scheduling)
- PagedAttention: memory fragmentation problem, block table, copy-on-write for beam search
- Speculative decoding: why verification is cheap (parallel forward pass), acceptance rate, when it helps vs hurts
- Prefill vs decode phases and why they have different compute profiles
- Real-world constraints: maximum sequence length, GPU memory budget, SLO guarantees
Cost: ~$50-100 in GPU rental. You can do most of this on a single A100.
Project 3: Distributed pretraining with FSDP + a scaling-law experiment
What: Train 4-5 models at different scales (10M, 30M, 100M, 300M, 1B parameters) on the same data, with matched compute budgets where possible. Plot the scaling curves (loss vs compute, loss vs parameters, loss vs data). Compare your empirical curves to the Chinchilla/Kaplan predictions. Do this on a multi-node FSDP setup (4-8 GPUs across 1-2 nodes).
Why: Scaling laws are the intellectual foundation of why frontier labs exist. Everyone at Anthropic/OpenAI/DeepMind understands them intuitively. Most candidates can recite the Chinchilla result but have never reproduced it. Actually fitting the curves on your own models — and finding where they break — is a conversation piece that instantly separates you from every other applicant.
Scope:
- Set up a multi-GPU training pipeline using PyTorch FSDP (or DeepSpeed ZeRO-3). Document the setup — this is itself a useful artifact.
- Train 5 models: 10M, 30M, 100M, 300M, 1B params. Same architecture family (decoder-only transformer), same data (FineWeb subset), same tokenizer.
- For each model, log: final loss, loss vs step, loss vs FLOPs, throughput (tokens/sec/GPU).
- Plot: loss vs compute (IsoFLOP curves), loss vs N (parameter scaling), loss vs D (data scaling).
- Fit the power law L(N,D) = E + A/N^α + B/D^β. Report your fitted α, β and compare to Chinchilla (α≈0.34, β≈0.28).
- Write a blog post: "I reproduced scaling laws on a $1000 GPU budget. Here's what matched and what didn't."
What you'll talk about in interviews:
- Compute-optimal training: why Chinchilla says "use more data, fewer params" vs the Llama school of "overtrain smaller models for inference efficiency"
- FSDP vs tensor parallelism vs pipeline parallelism: when each, why
- Practical distributed training: communication bottlenecks, gradient accumulation, mixed precision, activation checkpointing
- Where your curves deviated from theory — this is the most interesting part and proves you actually ran the experiments
Cost: ~$500-1500 in GPU rental. The 1B model is the expensive part; if budget-constrained, do 10M-300M and skip 1B (still excellent).
3. What NOT to do (common mistakes)
- Don't fine-tune Llama on a niche dataset and call it a project. Every ML hobbyist does this. It demonstrates you can run
trl.SFTTrainer, not that you understand training. Zero signal for frontier labs. - Don't build a RAG chatbot. This is application-layer work. It tells a frontier lab nothing about whether you understand model training.
- Don't do 6 shallow projects. Three deep ones with blog posts beats six GitHub repos with no write-up. The blog post IS the resume item — it proves you can explain what you did and why.
- Don't spend time on Kaggle competitions. Different signal, different audience. Frontier labs don't care about your leaderboard rank.
- Don't skip the write-up. A GitHub repo without a blog post is invisible. The blog post gets shared on Twitter/LinkedIn, gets indexed by Google, and is what a recruiter or hiring manager can actually read in 5 minutes.
4. Required reading (do this BEFORE the projects)
Don't just read these — take notes on what surprised you. The notes become your blog post framing.
| Paper / Resource | Why | Time |
|---|---|---|
| Attention Is All You Need (Vaswani et al, 2017) | The architecture. You probably already know this; skim to confirm. | 1 hr |
| GPT-2 / GPT-3 papers (Radford et al, Brown et al) | Decoder-only design choices, in-context learning discovery. | 2 hrs |
| Chinchilla (Hoffmann et al, 2022) | Scaling laws. This is the intellectual foundation for Project 3. | 2 hrs |
| Llama 2 + Llama 3 papers (Touvron/Grattafiori et al) | The most detailed public training recipes. Data, SFT, RLHF details. | 3 hrs |
| DeepSeek-R1 paper | GRPO explained. The post-training method you'll implement in Project 1. | 2 hrs |
| FlashAttention 1 + 2 (Tri Dao) | The kernel that changed everything. Understand IO-awareness. | 2 hrs |
| vLLM / PagedAttention paper (Kwon et al) | How LLM serving actually works. Foundation for Project 2. | 1.5 hrs |
| RLHF paper (Ouyang et al, "InstructGPT") | The original RLHF recipe. Understand the reward model → PPO pipeline. | 2 hrs |
| DPO paper (Rafailov et al) | Why you can skip the reward model. Compare to PPO and GRPO. | 1.5 hrs |
| Andrej Karpathy's "Let's build GPT from scratch" (YouTube) | Fast warmup if you want a hands-on refresher. Watch at 2x. | 1.5 hrs |
Your own LLM book chapters (the ones you've already written at /docs/llm-learning/) | You've literally been writing about this. Re-read your own work — it's targeted study material. | 2 hrs |
5. Timeline: 10-12 weeks while working full-time at Meta
- Read all papers above. Take handwritten notes (not typed — forces you to actually think).
- Set up cloud GPU access (Lambda Cloud, Together GPU Clusters, or Modal). Run a "hello world" multi-GPU training job to confirm your setup works.
- Buy the compute budget upfront (~$1000-2000 for all three projects). This is a career investment that pays back 100x.
- Week 3: Build tokenizer + data pipeline. Start pretraining 125M model.
- Week 4: Complete pretraining. Run SFT. Begin GRPO implementation.
- Week 5: Debug GRPO (this WILL be hard — RL training is unstable). Get it working on GSM8K or MBPP.
- Week 6: Run final eval. Write blog post. Push to GitHub.
- Week 7: Naive baseline + KV caching. Benchmark.
- Week 8: Continuous batching + paged KV cache. Benchmark.
- Week 9: Speculative decoding. Final benchmarks. Blog post. Push to GitHub.
- Week 10: Set up FSDP pipeline. Train 10M and 30M models.
- Week 11: Train 100M and 300M models. Start plotting curves.
- Week 12: (Optional) train 1B. Fit power law. Blog post. Push to GitHub.
- Your GitHub now has 3 repos with READMEs and blog posts.
- Your personal site (ravikant.dev) links to all three.
- You have concrete project-depth stories for every HM interview.
- Start referral outreach with blog post links as your introduction.
6. How each project maps to your top companies
| Company | Project 1 (GRPO) | Project 2 (Inference) | Project 3 (Scaling) |
|---|---|---|---|
| Anthropic | ✅ Core — they hire heavily for post-training/RL | ✅ Their serving infra is a major team | ✅ Pretraining team uses scaling laws daily |
| Periodic Labs | ✅ RL is their methodology | — | ✅ Scaling on scientific data |
| Together AI | — (they're infra, not post-training) | ✅ Core business | ✅ Training clusters are the product |
| Physical Intelligence | ✅ RL for robot foundation models | — | ✅ Scaling for robotics models |
| Harmonic | ✅ RL + verifiable rewards (math is their domain) | — | ✅ Scaling for reasoning |
| Mechanize | ✅ RL environments directly | — | — |
| World Labs | — (more pre/multimodal) | — | ✅ Foundation model scaling |
| insitro | — (bio-specific) | — | ✅ ML training methodology transfers |
| Fireworks AI | — | ✅ Core business | — |
| Guide Labs | ✅ Interpretable models need post-training | — | ✅ Scaling interpretable architectures |
| OpenAI | ✅ o-series is RL-heavy | ✅ Serving at massive scale | ✅ They defined scaling laws |
7. Where to host and how to present
- GitHub: public repos. One repo per project. Clean README with a "Results" section (graphs, tables, numbers). Don't dump code without documentation.
- Blog: host on ravikant.dev (you already have the infrastructure). One post per project. Lead with the result, not the process. Title pattern: "I [did X]. Here's [surprising finding]." Not: "My journey learning about X."
- LinkedIn: post each blog post. This is how you get into referral conversations. Senior engineers at frontier labs scroll LinkedIn.
- Twitter/X: post a thread summarizing each project with 1-2 key graphs. Tag relevant researchers (Tri Dao, Lilian Weng, etc.) — not desperately, but if your results are interesting, they'll notice.
8. The application strategy
- Batch your applications. Don't trickle-apply over 6 months. Finish all 3 projects first, then apply to 8-10 companies in a 2-week window. Competing offers create leverage.
- For each company, pick ONE role. Don't shotgun 5 applications at Anthropic. Pick the best-fit role and apply once, with a referral if possible.
- The referral email template: "Hi [Name], I'm a senior ML engineer at Meta exploring my next move in LLM training/post-training. I recently [1-sentence summary of best project + link]. I'm interested in [specific role] at [company] — would you be open to a brief chat or a referral? [Your name]"
- Start referral outreach at week 6 (after Project 1 ships). Don't wait until all 3 are done — the first project is enough to start conversations.
- Interview prep is separate from project work. Weeks 13-14 should be dedicated to: system design mocks (LLM serving scenarios), coding practice (Python concurrency, data structures), and values/culture prep (Anthropic-specific reading).
9. Honest assessment of this plan
- Covers the three most-transferable skill areas across your target companies
- Produces visible artifacts (not just knowledge in your head)
- Fits in 12 weeks of part-time work alongside a full-time Meta job
- Total cost ~$1000-2000 in GPU rental — trivial relative to the comp delta between jobs
- Kernel engineering. If you want Together AI or hardware companies specifically, you'd need a 4th project: write a fused attention kernel in Triton. I left this out because it's a distinct skill that takes 4+ weeks alone and only matters for ~5 companies on your list.
- Publications. These projects are portfolio pieces, not papers. If you're targeting a research-scientist role (vs research-engineer or ML engineer), you'd need to extend Project 1 or 3 into a paper submission. That adds ~4-8 weeks.
- Robotics / embodied AI. If Physical Intelligence or World Labs are top targets, you'd want a project that involves multi-modal training (vision + language or vision + action). That's a different project scope — tell me if you want me to design one.
- Domain credentialing for AI-for-science. insitro and Lila Sciences want domain knowledge (biology/chemistry) alongside ML. These projects don't cover that. If bio-AI is a priority, you'd need to pair with a bio collaborator or take a different approach.
10. What to do this week
- Set up a Lambda Cloud or Modal account. Run
torchrun --nproc_per_node=2on a hello-world FSDP training script. Confirm multi-GPU works. (1 hour) - Read the DeepSeek-R1 paper. Take notes on GRPO specifically — reward function, KL penalty, group sampling. (2 hours)
- Read the vLLM/PagedAttention paper. Draw the block-table diagram by hand. (1.5 hours)
- Decide: do all 3 projects, or just Project 1 first? I'd recommend doing Project 1 first, shipping it, then deciding if you need 2 and 3 based on where your applications land.