AI From Scratch/Lesson 07/~90 minutes

RLHF: Reward Model + PPO

SFT teaches the model to follow instructions. But it doesn't teach the model which response is BETTER. Two grammatically correct, factually accurate answers can differ enormously in helpfulness. RLHF is how you encode human judgment into t...

BuildPython (with numpy)

Loading lesson page...