Loading lesson page...
AI From Scratch/Lesson 06/~75 minutes
Policy Gradient — REINFORCE from Scratch
Stop estimating value. Parameterize the policy directly, compute the gradient of expected return, step uphill. Williams (1992) wrote it in one theorem. It is why PPO, GRPO, and every LLM RL loop exist.
BuildPython