Loading lesson page...
AI From Scratch/Lesson 12/~75 minutes
Red-Teaming: PAIR and Automated Attacks
Chao, Robey, Dobriban, Hassani, Pappas, Wong (NeurIPS 2023, arXiv:2310.08419). PAIR — Prompt Automatic Iterative Refinement — is the canonical automated black-box jailbreak. An attacker LLM with a red-team system prompt iteratively propose...
BuildPython (stdlibmock PAIR loop against a toy target)No prerequisites