AI From Scratch

Phase 09/12 lessons/~13 hours

Reinforcement Learning

Agents that learn by doing. The foundation of RLHF.

0 / 12 complete0%

Lessons

01MDPs, States, Actions & RewardsUp nextA Markov Decision Process is five things: states, actions, transitions, rewards, a discount. Everything in RL — Q-learning, PPO, DPO, GRPO — optimizes over this shape. Learn it once, read the rest of reinforcement learning for free.Learn/~45 minutes/Python 02Dynamic Programming — Policy Iteration & Value IterationDynamic programming is RL with cheating. You already know the transition and reward functions; you just iterate the Bellman equation until V or π stops moving. It is the benchmark every sampling-based method tries to approach.Build/~75 minutes/Python 03Monte Carlo Methods — Learning from Complete EpisodesDynamic programming needs a model. Monte Carlo needs nothing but episodes. Run the policy, watch the returns, average them. The simplest idea in RL — and the one that unlocks everything downstream.Build/~75 minutes/Python 04Temporal Difference — Q-Learning & SARSAMonte Carlo waits until the episode ends. TD updates after every step by bootstrapping the next value estimate. Q-learning is off-policy and optimistic; SARSA is on-policy and cautious. Both are one line of code. Both underpin every deep-R...Build/~75 minutes/Python 05Deep Q-Networks (DQN)2013: Mnih trained one Q-learning network on raw pixels, beat every classical RL agent on seven Atari games. 2015: extended to 49 games, published in Nature, sparked the deep-RL era. DQN is Q-learning plus three tricks that make function a...Build/~75 minutes/Python 06Policy Gradient — REINFORCE from ScratchStop estimating value. Parameterize the policy directly, compute the gradient of expected return, step uphill. Williams (1992) wrote it in one theorem. It is why PPO, GRPO, and every LLM RL loop exist.Build/~75 minutes/Python 07Actor-Critic — A2C and A3CREINFORCE is noisy. Add a critic that learns V̂(s), subtract it from the return, and you get an advantage that has the same expectation but far lower variance. That is actor-critic. A2C runs it synchronously; A3C runs it across threads. Bo...Build/~75 minutes/Python 08Proximal Policy Optimization (PPO)A2C throws away each rollout after one update. PPO wraps the policy gradient in a clipped importance ratio so you can do 10+ epochs on the same data without the policy exploding. Schulman et al. (2017). Still the default policy-gradient al...Build/~75 minutes/Python 09Reward Modeling & RLHFHumans cannot write a reward function for "good assistant response," but they can compare two responses and pick the better one. Fit a reward model to those comparisons, then RL the language model against it. Christiano 2017. InstructGPT 2...Build/~45 minutes/Python 10Multi-Agent RLSingle-agent RL assumes the environment is stationary. Put two learning agents in the same world and that assumption breaks: each agent is part of the other's environment, and both are changing. Multi-agent RL is the set of tricks to make...Build/~45 minutes/Python 11Sim-to-Real TransferA policy trained in a simulator that fails on hardware is a policy that memorized the simulator. Domain randomization, domain adaptation, and system identification are the three tools to make learned controllers cross the reality gap.Learn/~45 minutes/Python 12RL for Games — AlphaZero, MuZero, and the LLM-Reasoning Era1992: TD-Gammon beat human champions at backgammon with pure TD. 2016: AlphaGo beat Lee Sedol. 2017: AlphaZero dominated chess, shogi, and Go from scratch. 2024: DeepSeek-R1 proved the same recipe, with GRPO replacing PPO, works on reasoni...Build/~120 minutes/Python