Loading lesson page...
AI From Scratch/Lesson 08/~90 minutes
DPO: Direct Preference Optimization
RLHF works. It also requires training three models (SFT, reward model, policy), managing PPO's instability, and tuning a KL penalty. DPO asks: what if you could skip all of that? DPO directly optimizes the language model on preference pair...
BuildPython (with numpy)