AI From Scratch/Lesson 09/~45 minutes

Reward Modeling & RLHF

Humans cannot write a reward function for "good assistant response," but they can compare two responses and pick the better one. Fit a reward model to those comparisons, then RL the language model against it. Christiano 2017. InstructGPT 2...

BuildPython

Loading lesson page...