AI From Scratch
Phase 07/16 lessons/~14 hours

Transformers Deep Dive

The architecture that changed everything. Understand every layer.

0 / 16 complete0%
Start phase
Lessons
01Why Transformers — The Problems with RNNsUp nextRNNs process tokens one at a time. Transformers process all tokens at once. That single architectural bet changed every scaling curve in deep learning after 2017.Learn/~45 minutes/Python02Self-Attention from ScratchAttention is a lookup table where every word asks "who matters to me?" - and learns the answer.Build/~90 minutes/Python03Multi-Head AttentionOne attention head learns one relation at a time. Eight heads learn eight. Heads are free. Take more of them.Build/~75 minutes/Python04Positional Encoding — Sinusoidal, RoPE, ALiBiAttention is permutation-invariant. "The cat sat on the mat" and "mat the on sat cat the" produce the same output without positional signal. Three algorithms fix it — each with a different bet on what "position" means.Build/~45 minutes/Python05The Full Transformer — Encoder + DecoderAttention is the star. Everything else — residuals, normalization, feed-forward, cross-attention — is the scaffolding that lets you stack it deep.Build/~75 minutes/Python06BERT — Masked Language ModelingGPT predicts the next word. BERT predicts a missing word. One sentence of difference — and half a decade of everything embedding-shaped.Build/~45 minutes/Python07GPT — Causal Language ModelingBERT sees both sides. GPT sees only the past. The triangle mask is the most consequential single line of code in modern AI.Build/~75 minutes/Python08T5, BART — Encoder-Decoder ModelsEncoders understand. Decoders generate. Put them back together and you get a model built for input → output tasks: translate, summarize, rewrite, transcribe.Learn/~45 minutes/Python09Vision Transformers (ViT)An image is a grid of patches. A sentence is a grid of tokens. The same transformer eats both.Build/~45 minutes/Python10Audio Transformers — Whisper ArchitectureAudio is an image of frequency over time. Whisper is a ViT that eats mel spectrograms and speaks back.Learn/~45 minutes/Python11Mixture of Experts (MoE)A dense 70B transformer activates every parameter for every token. A 671B MoE activates only 37B per token and beats it on every benchmark. Sparsity is the most important scaling idea of the decade.Build/~45 minutes/Python12KV Cache, Flash Attention & Inference OptimizationTraining is parallel and FLOP-bound. Inference is serial and memory-bound. Different bottleneck, different tricks.Build/~75 minutes/Python13Scaling LawsThe 2020 Kaplan paper said: bigger model, lower loss. The 2022 Hoffmann paper said: you were under-training. Compute goes into two buckets — parameters and tokens — and the split is not obvious.Learn/~45 minutes/Python14Build a Transformer from Scratch — The CapstoneThirteen lessons. One model. No shortcuts.Build/~120 minutes/Python15Attention Variants — Sliding Window, Sparse, DifferentialFull attention is a circle. Every token sees every token, and memory pays the price. Four variants bend the shape of the circle and recover half the cost.Build/~60 minutes/Python16Speculative Decoding — Draft, Verify, RepeatAutoregressive decoding is serial. Each token waits for the previous one. Speculative decoding breaks the chain: a cheap model drafts N tokens, the expensive model verifies all N in one forward pass. When the draft is right you paid one bi...Build/~60 minutes/Python