AI From Scratch/Lesson 08/~90 minutes

DPO: Direct Preference Optimization

RLHF works. It also requires training three models (SFT, reward model, policy), managing PPO's instability, and tuning a KL penalty. DPO asks: what if you could skip all of that? DPO directly optimizes the language model on preference pair...

BuildPython (with numpy)

Loading lesson page...