Phase 18: Ethics, Safety & Alignment
AI From Scratch/Lesson 12/~75 minutes

Red-Teaming: PAIR and Automated Attacks

Chao, Robey, Dobriban, Hassani, Pappas, Wong (NeurIPS 2023, arXiv:2310.08419). PAIR — Prompt Automatic Iterative Refinement — is the canonical automated black-box jailbreak. An attacker LLM with a red-team system prompt iteratively propose...

BuildPython (stdlibmock PAIR loop against a toy target)No prerequisites
Loading lesson page...