AI From Scratch/Lesson 06/~75 minutes

Policy Gradient — REINFORCE from Scratch

Stop estimating value. Parameterize the policy directly, compute the gradient of expected return, step uphill. Williams (1992) wrote it in one theorem. It is why PPO, GRPO, and every LLM RL loop exist.

BuildPython

Loading lesson page...