Loading lesson page...
AI From Scratch/Lesson 40/~90 minutes
Capstone Lesson 40: Direct Preference Optimization from Scratch
Reward models and PPO are the classical RLHF stack. DPO collapses that stack into a single supervised loss that fits a policy directly against preference pairs. This lesson derives the DPO loss from the reward-difference identity, ships a...
BuildPython (torchnumpy)No prerequisites