AI From Scratch

Phase 10/24 lessons/~26 hours

LLMs from Scratch

Build, train, and understand large language models.

0 / 24 complete0%

Lessons

01Tokenizers: BPE, WordPiece, SentencePieceUp nextYour LLM does not read English. It reads integers. The tokenizer decides whether those integers carry meaning or waste it.Build/~90 minutes/Python 02Building a Tokenizer from ScratchLesson 01 gave you a toy. This lesson gives you a weapon.Build/~90 minutes/Python 03Data Pipelines for Pre-TrainingThe model is a mirror. It reflects whatever data you feed it. Feed it garbage, it reflects garbage with perfect fluency.Build/~90 minutes/Python 04Pre-Training a Mini GPT (124M Parameters)GPT-2 Small has 124 million parameters. That's 12 transformer layers, 12 attention heads, and 768-dimensional embeddings. You can train it from scratch on a single GPU in a few hours. Most people never do this. They use pre-trained checkpo...Build/~120 minutes/Python (with numpy)05Scaling: Distributed Training, FSDP, DeepSpeedYour 124M model trained on one GPU. Now try 7 billion parameters. The model doesn't fit in memory. The data takes weeks on a single machine. Distributed training isn't optional at scale. It's the only path forward.Build/~120 minutes/Python 06Instruction Tuning (SFT)A base model predicts the next token. That's it. It doesn't follow instructions, answer questions, or refuse harmful requests. SFT is the bridge between a token predictor and a useful assistant. Every model you've ever talked to -- Claude,...Build/~90 minutes/Python (with numpy)07RLHF: Reward Model + PPOSFT teaches the model to follow instructions. But it doesn't teach the model which response is BETTER. Two grammatically correct, factually accurate answers can differ enormously in helpfulness. RLHF is how you encode human judgment into t...Build/~90 minutes/Python (with numpy)08DPO: Direct Preference OptimizationRLHF works. It also requires training three models (SFT, reward model, policy), managing PPO's instability, and tuning a KL penalty. DPO asks: what if you could skip all of that? DPO directly optimizes the language model on preference pair...Build/~90 minutes/Python (with numpy)09Constitutional AI and Self-ImprovementRLHF needs humans in the loop. Constitutional AI replaces most of them with the model itself. Write a list of principles, have the model critique its own outputs against those principles, and train on the critiques. DeepSeek-R1 pushed this...Build/~45 minutes/Python (stdlib + numpy)10Evaluation: Benchmarks, Evals, LM HarnessGoodhart's Law: when a measure becomes a target, it ceases to be a good measure. Every frontier lab games benchmarks. MMLU scores go up while models still can't reliably count the number of R's in "strawberry." The only eval that matters i...Build/~90 minutes/Python 11Quantization: Making Models FitA 70B model in FP16 needs 140GB. Two A100s just for weights. Quantize to FP8: one 80GB GPU. INT4: a MacBook.Build/~120 minutes/Python (with numpy)12Inference OptimizationTwo phases define LLM inference. Prefill processes your prompt in parallel -- compute-bound. Decode generates tokens one at a time -- memory-bound. Every optimization targets one or both.Build/~120 minutes/Python 13Building a Complete LLM PipelineEverything from Lessons 01 to 12 is one stage of one pipeline. This lesson is the scaffold that turns those stages into a single end-to-end run: tokenize, pre-train, scale, SFT, align, evaluate, quantize, serve. You will not train a 70B mo...Build/~120 minutes/Python (stdlib)14Open Models: Architecture WalkthroughsYou built a GPT-2 Small from scratch in Lesson 04. Frontier open models in 2026 are the same family with five or six concrete changes. RMSNorm instead of LayerNorm. SwiGLU instead of GELU. RoPE instead of learned positions. GQA or MLA inst...Learn/~45 minutes/Python (stdlib)15Speculative Decoding and EAGLE-3Phase 7 · Lesson 16 proved the math: the Leviathan rejection rule preserves the verifier's distribution exactly. This lesson is the training-stack view of 2026 production speculative decoding. EAGLE-3 turned the draft model from a cheap ap...Build/~75 minutes/Python (stdlib)16Differential Attention (V2)Softmax attention spreads a small amount of probability over every non-matching token. Over 100k tokens that noise adds up and drowns the signal. Differential Transformer (Ye et al., ICLR 2025) fixes it by computing attention as the differ...Build/~60 minutes/Python (stdlib)17Native Sparse Attention (DeepSeek NSA)At 64k tokens, attention eats 70-80% of decode latency. Every open-model lab has a plan to fix it. DeepSeek's NSA (ACL 2025 best paper) is the one that stuck: three parallel attention branches — compressed coarse-grained tokens, selectivel...Build/~60 minutes/Python (stdlib)18Multi-Token Prediction (MTP)Every autoregressive LLM from GPT-2 to Llama 3 trains on one loss per position: predict the next token. DeepSeek-V3 added a second loss per position: predict the token after that. The extra 14B of parameters (on a 671B model) got distilled...Build/~60 minutes/Python (stdlib)19DualPipe ParallelismDeepSeek-V3 was trained on 2,048 H800 GPUs with MoE experts scattered across nodes. Cross-node expert all-to-all communication cost 1 GPU-hour of comm for every 1 GPU-hour of compute. GPUs were idle half the time. DualPipe (DeepSeek, Dec 2...Learn/~60 minutes/Python (stdlib, schedule simulator)20DeepSeek-V3 Architecture WalkthroughPhase 10 · Lesson 14 named the six architectural knobs every open model turns. DeepSeek-V3 (December 2024, 671B parameters total, 37B active) turns all six and adds four more: Multi-Head Latent Attention, auxiliary-loss-free load balancing...Learn/~75 minutes/Python (stdlib, parameter calculator)21Jamba — Hybrid SSM-TransformerState space models (SSMs) and transformers want different things. Transformers buy quality via attention at quadratic cost. SSMs buy linear-time inference and constant memory via a recurrence but lag quality. AI21's Jamba (March 2024) and...Learn/~60 minutes 22Async and Hogwild! InferenceSpeculative decoding (Phase 10 · 15) parallelizes tokens within one sequence. Multi-agent frameworks parallelize across whole sequences but force explicit coordination (voting, sub-task splitting). Hogwild! Inference (Rodionov et al., arXi...Build/~60 minutes/Python (stdlib)23Speculative Decoding and EAGLEA frontier LLM generating one token requires a full forward pass over billions of parameters. That forward pass is massively over-provisioned: most of the time a much smaller model can guess the next 3-5 tokens correctly, and the big model...Build/~75 minutes/Python (with numpy)24Gradient Checkpointing and Activation RecomputationBackprop keeps every intermediate activation. At 70B parameters and 128K context that is 3 TB of activations per rank. Checkpointing trades FLOPs for memory: recompute instead of save. The question is which segments to drop, and the answer...Build/~70 minutes/Python (with numpy, optional torch)