AI From Scratch/Lesson 03/~75 minutes

The Direct Preference Optimization Family

Rafailov et al. (2023) showed RLHF's optimum has a closed form in terms of the preference data, so you can skip the explicit reward model and optimize the policy directly. That insight spawned a family — IPO, KTO, SimPO, ORPO, BPO — each f...

Learn

Loading lesson page...