This path is only for review. The original chapters remain unchanged in the main book.
All four Part III rewrite chapters are available here, along with a standalone worked-example guide. Once the content and rendering are right, these can replace the existing Part III chapters.
Companion Guide
One standalone walkthrough that keeps every RL quantity tied to the same toy LLM answer
Part III: Reinforcement Learning & Alignment
Alternative chapters for RL fundamentals, policy optimization, RLHF, and DPO/alignment methods
-
9Reinforcement Learning FoundationsA gentler path from delayed rewards and Bellman equations to policy gradients and REINFORCE18 KB
-
10Policy OptimizationTrust regions, PPO, GAE, reward shaping, and why GRPO matters for LLMs14 KB
-
11RLHFPreference data, reward models, KL control, process rewards, and the real costs of RLHF13 KB
-
12DPO and Alignment AlternativesA simpler DPO derivation plus RLAIF, Constitutional AI, IPO, KTO, and when online RL still wins12 KB