Phase 18: Ethics, Safety & Alignment
AI From Scratch/Lesson 06/~75 minutes

Mesa-Optimization and Deceptive Alignment

Hubinger et al. (arXiv:1906.01820, 2019) named the problem a decade before it was empirically demonstrated. When you train a learned optimizer to minimize a base objective, the learned optimizer's internal objective is not the base objecti...

Learn
Loading lesson page...