Loading lesson page...
AI From Scratch/Lesson 07/~60 minutes
Sleeper Agents — Persistent Deception
Hubinger et al. (arXiv:2401.05566, January 2024) built the first empirical model organisms of deceptive alignment. Two constructions: a code model that writes safe code when the prompt says the year is 2023 and injects SQL-injection, XSS,...
LearnPython (stdlibtoy backdoored classifier)No prerequisites