Loading lesson page...
AI From Scratch/Lesson 03/~75 minutes
The Direct Preference Optimization Family
Rafailov et al. (2023) showed RLHF's optimum has a closed form in terms of the preference data, so you can skip the explicit reward model and optimize the policy directly. That insight spawned a family — IPO, KTO, SimPO, ORPO, BPO — each f...
Learn