Phase 09: Reinforcement Learning
AI From Scratch/Lesson 08/~75 minutes

Proximal Policy Optimization (PPO)

A2C throws away each rollout after one update. PPO wraps the policy gradient in a clipped importance ratio so you can do 10+ epochs on the same data without the policy exploding. Schulman et al. (2017). Still the default policy-gradient al...

BuildPythonNo prerequisites
Loading lesson page...