Phase 18: Ethics, Safety & Alignment
AI From Scratch/Lesson 07/~60 minutes

Sleeper Agents — Persistent Deception

Hubinger et al. (arXiv:2401.05566, January 2024) built the first empirical model organisms of deceptive alignment. Two constructions: a code model that writes safe code when the prompt says the year is 2023 and injects SQL-injection, XSS,...

LearnPython (stdlibtoy backdoored classifier)No prerequisites
Loading lesson page...