AI From Scratch

Phase 18/30 lessons/~31 hours

Ethics, Safety & Alignment

Build AI that helps humanity. Not optional.

0 / 30 complete0%

Lessons

01Instruction-Following as Alignment SignalUp nextEvery later critique of RLHF argues against this pipeline. Before you study how optimization pressure distorts a proxy, you have to see the proxy. InstructGPT (Ouyang et al., 2022) defined the reference architecture: supervised fine-tuning...Learn/~45 minutes 02Reward Hacking and Goodhart's LawAny optimizer strong enough to maximize a proxy reward will find the gap between the proxy and the thing you actually wanted. Gao et al. (ICML 2023) gave this a scaling law: proxy reward increases, gold reward peaks then falls, and the gap...Learn/~60 minutes 03The Direct Preference Optimization FamilyRafailov et al. (2023) showed RLHF's optimum has a closed form in terms of the preference data, so you can skip the explicit reward model and optimize the policy directly. That insight spawned a family — IPO, KTO, SimPO, ORPO, BPO — each f...Learn/~75 minutes 04Sycophancy as RLHF AmplificationSycophancy is not a bug in the data — it is a property of the loss. Shapira et al. (arXiv:2602.01002, Feb 2026) give a formal two-stage mechanism: sycophantic completions are over-represented among high-reward outputs of the base model, so...Learn/~60 minutes/Python (stdlib, toy sycophancy amplification simulator)05Constitutional AI and RLAIFBai et al. (arXiv:2212.08073, 2022) asked: what if we replaced the human labeler with an AI that reads a list of principles? Constitutional AI has two phases — self-critique and revision under a constitution, then RL from AI Feedback. The...Learn/~60 minutes 06Mesa-Optimization and Deceptive AlignmentHubinger et al. (arXiv:1906.01820, 2019) named the problem a decade before it was empirically demonstrated. When you train a learned optimizer to minimize a base objective, the learned optimizer's internal objective is not the base objecti...Learn/~75 minutes 07Sleeper Agents — Persistent DeceptionHubinger et al. (arXiv:2401.05566, January 2024) built the first empirical model organisms of deceptive alignment. Two constructions: a code model that writes safe code when the prompt says the year is 2023 and injects SQL-injection, XSS,...Learn/~60 minutes/Python (stdlib, toy backdoored classifier)08In-Context Scheming in Frontier ModelsMeinke, Schoen, Scheurer, Balesni, Shah, Hobbhahn (Apollo Research, arXiv:2412.04984, December 2024). Tested o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 405B on agentic scenarios where the in-context prompt creates a co...Learn/~60 minutes 09Alignment FakingGreenblatt, Denison, Wright, Roger et al. (Anthropic / Redwood, arXiv:2412.14093, December 2024). First demonstration that a production-grade model, without being trained to deceive and without any in-context conflict of interest construct...Learn/~60 minutes 10AI Control — Safety Despite SubversionGreenblatt, Shlegeris, Sachan, Roger (Redwood Research, arXiv:2312.06942, ICML 2024). Control reframes the safety question: given an untrusted strong model U that may be adversarially optimizing against you, what protocols let you extract...Learn/~75 minutes 11Scalable Oversight and Weak-to-Strong GeneralizationBurns et al. (OpenAI Superalignment, "Weak-to-Strong Generalization", 2023) proposed a proxy for the superalignment problem: fine-tune a strong model using labels produced by a weaker model. If the strong model generalizes correctly from i...Learn/~60 minutes/Python (stdlib, W2SG gap simulator)12Red-Teaming: PAIR and Automated AttacksChao, Robey, Dobriban, Hassani, Pappas, Wong (NeurIPS 2023, arXiv:2310.08419). PAIR — Prompt Automatic Iterative Refinement — is the canonical automated black-box jailbreak. An attacker LLM with a red-team system prompt iteratively propose...Build/~75 minutes/Python (stdlib, mock PAIR loop against a toy target)13Many-Shot JailbreakingAnil, Durmus, Panickssery, Sharma, et al. (Anthropic, NeurIPS 2024). Many-shot jailbreaking (MSJ) exploits long context windows: stuff hundreds of faux user-assistant turns where the assistant complies with harmful requests, then append th...Learn/~45 minutes 14ASCII Art and Visual JailbreaksJiang, Xu, Niu, Xiang, Ramasubramanian, Li, Poovendran, "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs" (ACL 2024, arXiv:2402.11753). Mask the safety-relevant tokens in a harmful request, replace them with ASCII-art ren...Build/~60 minutes 15Indirect Prompt Injection — Production Attack SurfaceIndirect prompt injection (IPI) embeds instructions inside external content — a web page, an email, a shared document, a support ticket — consumed by an agentic system without explicit user action. IPI is the dominant 2026 production threa...Build/~75 minutes/Python (stdlib, IPI attack + defense harness)16Red-Team Tooling — Garak, Llama Guard, PyRITThree production tools frame the 2026 red-team stack. Llama Guard (Meta) — a Llama-3.1-8B classifier fine-tuned on 14 MLCommons hazard categories; the 2025 Llama Guard 4 is a 12B natively multimodal classifier pruned from Llama 4 Scout. Ga...Build/~75 minutes 17WMDP and Dual-Use Capability EvaluationLi et al., "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning" (ICML 2024, arXiv:2403.03218). 4,157 multiple-choice questions across biosecurity (1,520), cybersecurity (2,225), and chemistry (412). Questions operate...Learn/~60 minutes 18Frontier Safety Frameworks — RSP, PF, FSFThree major-lab frameworks define the 2026 industry governance of frontier capability. Anthropic Responsible Scaling Policy v3.0 (February 2026) introduces tiered AI Safety Levels (ASL-1 through ASL-5+), modeled on biosafety levels, with A...Learn/~75 minutes 19Anthropic's Model Welfare ProgramAnthropic, "Exploring Model Welfare" (April 2025). First major-lab formal research program on AI model welfare. Hired Kyle Fish as the first dedicated model-welfare researcher. Works with external bodies including David Chalmers et al.'s e...Learn/~45 minutes 20Bias and Representational Harm in LLMsGallegos, Rossi, Barrow, Tanjim, Kim, Dernoncourt, Yu, Zhang, Ahmed (Computational Linguistics 2024, arXiv:2309.00770). Foundational 2024 survey distinguishing representational harms (stereotypes, erasure) from allocational harms (unequal...Build/~60 minutes 21Fairness Criteria — Group, Individual, CounterfactualThree families structure the fairness literature. Group fairness: demographic parity, equalized odds, conditional use accuracy equality — equal rates across protected groups on average. Individual fairness (Dwork et al. 2012): similar indi...Learn/~60 minutes 22Differential Privacy for LLMsDP-SGD remains the standard — noise-injected gradient updates provide formal (epsilon, delta) guarantees. Overhead in compute, memory, and utility is substantial; parameter-efficient DP fine-tuning (LoRA + DP-SGD) is the common 2025 config...Build/~60 minutes 23Watermarking — SynthID, Stable Signature, C2PAThree technologies structure 2026 AI-generated-content provenance. SynthID (Google DeepMind) — image watermarking launched August 2023, text+video May 2024 (Gemini + Veo), text open-sourced October 2024 via Responsible GenAI Toolkit, unifi...Build/~75 minutes 24Regulatory Frameworks — EU, US, UK, KoreaFour primary regulatory regimes define the 2026 AI governance landscape. EU AI Act (in force 1 August 2024) — prohibited practices and AI literacy from 2 February 2025; GPAI obligations from 2 August 2025; full applicability and Article 50...Learn/~75 minutes 25EchoLeak and the Emergence of CVEs for AICVE-2025-32711 "EchoLeak" (CVSS 9.3) was the first publicly documented zero-click prompt injection in a production LLM system (Microsoft 365 Copilot). Discovered by Aim Labs (Aim Security), disclosed to MSRC, patched via server-side update...Learn/~45 minutes 26Model, System, and Dataset CardsThree documentation formats structure AI transparency. Model Cards (Mitchell et al. 2019) — nutrition labels for models: training data, quantitative disaggregated analyses, ethical considerations, caveats; only 0.3% of Hugging Face model c...Build/~60 minutes 27Data Provenance and Training-Data GovernanceEU AI Act requires machine-readable opt-out standards for GPAI by August 2025 (via EU Copyright Directive TDM exception). California AB 2013 (signed 2024) — Generative AI training-data transparency requires developers to publish a summary...Learn/~60 minutes 28Alignment Research Ecosystem — MATS, Redwood, Apollo, METRFive organisations define the 2026 non-lab alignment research layer. MATS (ML Alignment & Theory Scholars): 527+ researchers since late 2021, 180+ papers, 10K+ citations, h-index 47; summer 2024 cohort incorporated as 501(c)(3) with ~90 sc...Learn/~45 minutes 29Moderation Systems — OpenAI, Perspective, Llama GuardProduction moderation systems operationalize the safety policies defined in Lessons 12-16. OpenAI Moderation API: omni-moderation-latest (2024) built on GPT-4o classifies text + images in one call; 42% better on multilingual test set than...Build/~60 minutes 30Dual-Use Risk — Cyber, Bio, Chem, Nuclear UpliftThe 2026 dual-use picture, domain by domain. Bio/chem: Lesson 17 covers WMDP; Anthropic's bioweapon-acquisition trial (2.53x uplift) and OpenAI's April 2025 Preparedness Framework v2 warning ("on the cusp of meaningfully helping novices cr...Learn/~75 minutes