LLM From Scratch - Part III Rewrite

This path is only for review. The original chapters remain unchanged in the main book.

All four Part III rewrite chapters are available here, along with a standalone worked-example guide. Once the content and rendering are right, these can replace the existing Part III chapters.

Companion Guide

One standalone walkthrough that keeps every RL quantity tied to the same toy LLM answer

Ex

Worked LLM RL Example

A full prompt-to-PPO walkthrough showing state, reward, return, V, Q, advantage, GAE, and PPO on one concrete answer tree

Guide

Part III: Reinforcement Learning & Alignment

Alternative chapters for RL fundamentals, policy optimization, RLHF, and DPO/alignment methods

9

Reinforcement Learning Foundations

A gentler path from delayed rewards and Bellman equations to policy gradients and REINFORCE

18 KB
10

Policy Optimization

Trust regions, PPO, GAE, reward shaping, and why GRPO matters for LLMs

14 KB
11

RLHF

Preference data, reward models, KL control, process rewards, and the real costs of RLHF

13 KB
12

DPO and Alignment Alternatives

A simpler DPO derivation plus RLAIF, Constitutional AI, IPO, KTO, and when online RL still wins

12 KB