AI From ScratchStart phase
Phase 19/85 lessons/~620 hours
Capstone Projects
Prove everything you learned. Build portfolio-grade systems.
0 / 85 complete0%
Lessons
01Capstone 01 — Terminal-Native Coding AgentUp nextBy 2026 the shape of a coding agent is settled. A TUI harness, a stateful plan, a sandboxed tool surface, a loop that plans, acts, observes, recovers. Claude Code, Cursor 3, and OpenCode all look the same from 50 feet. This capstone asks y...Capstone/35 hours/TypeScript, Bun (harness), Python (eval scripts)02Capstone 02 — RAG over Codebase (Cross-Repo Semantic Search)Every serious engineering org in 2026 runs an internal code search that understands meaning, not just strings. Sourcegraph Amp, Cursor's codebase answers, Augment's enterprise graph, Aider's repomap, Pinterest's internal MCP — same shape....Capstone/30 hours/Python (ingestion), TypeScript (API + UI)03Capstone 03 — Real-Time Voice Assistant (ASR to LLM to TTS)A voice agent that feels right has end-to-end latency under 800ms, knows when you have stopped talking, handles barge-in, and can call a tool without stalling. Retell, Vapi, LiveKit Agents, and Pipecat all hit this bar in 2026. They do it...Capstone/30 hours/Python (agent + pipeline), TypeScript (web client)04Capstone 04 — Multimodal Document QA (Vision-First PDF, Tables, Charts)The 2026 document-QA frontier moved away from OCR-then-text and toward vision-first late interaction. ColPali, ColQwen2.5, and ColQwen3-omni treat each PDF page as an image, embed it with multi-vector late interaction, and let the query at...Capstone/30 hours/Python (pipeline), TypeScript (viewer UI)05Capstone 05 — Autonomous Research Agent (AI-Scientist Class)Sakana's AI-Scientist-v2 published full papers. Agent Laboratory ran the experiments. Allen AI shared traces. The 2026 shape is plan-execute-verify tree search over experiments, budgeted cost, sandboxed code execution, a vision-feedback La...Capstone/40 hours/Python (agent + sandbox), LaTeX (output)06Capstone 06 — DevOps Troubleshooting Agent for KubernetesAWS's DevOps Agent went GA, Resolve AI published its K8s playbooks, NeuBird demoed semantic monitoring, and Metoro tied AI SRE to per-service SLOs. The production shape is settled: an alert webhook fires, an agent reads telemetry, walks a...Capstone/30 hours/Python (agent), TypeScript (Slack integration)07Capstone 07 — End-to-End Fine-Tuning Pipeline (Data to SFT to DPO to Serve)An 8B model trained on your own data, DPO-aligned on your own preferences, quantized, speculative-decoded, and served at measurable $/1M tokens. The 2026 open stack is Axolotl v0.8, TRL 0.15, Unsloth for iteration, GPTQ/AWQ/GGUF for quanti...Capstone/35 hours/Python (pipeline), YAML (configs), Bash (scripts)08Capstone 08 — Production RAG Chatbot for a Regulated VerticalHarvey, Glean, Mendable, and LlamaCloud all run the same production shape in 2026. Ingest with docling or Unstructured and ColPali for visuals. Hybrid search. Re-rank with bge-reranker-v2-gemma. Synthesize with Claude Sonnet 4.7 using prom...Capstone/30 hours/Python (pipeline + API), TypeScript (chat UI)09Capstone 09 — Code Migration Agent (Repo-Level Language / Runtime Upgrade)Amazon's MigrationBench (Java 8 to 17) and Google's App Engine Py2-to-Py3 migrator set the 2026 bar. Moderne's OpenRewrite does deterministic AST rewrites at scale. Grit targets the same problem with codemod-style DSL. The production patte...Capstone/30 hours/Python (agent), Java, Python (targets)10Capstone 10 — Multi-Agent Software Engineering TeamSWE-AF's factory architecture, MetaGPT's role-based prompting, AutoGen 0.4's typed actor graph, Cognition's Devin, and Factory's Droids all converged on the same 2026 shape: an architect plans, N coders work in parallel worktrees, a review...Capstone/40 hours/Python, TypeScript (agents), Shell (worktree scripts)11Capstone 11 — LLM Observability & Eval DashboardLangfuse went open-core. Arize Phoenix published the 2026 GenAI semconv mappings. Helicone and Braintrust both doubled down on per-user cost attribution. Traceloop's OpenLLMetry became the de-facto SDK instrumentation. The production shape...Capstone/25 hours/TypeScript (UI), Python, TypeScript (ingest + evals)12Capstone 12 — Video Understanding Pipeline (Scene, QA, Search)Twelve Labs productized Marengo + Pegasus. VideoDB shipped the CRUD-for-video API. AI2's Molmo 2 published open VLM checkpoints. Gemini long-context handles hours of video natively. TimeLens-100K defined temporal grounding at scale. The 20...Capstone/30 hours/Python (pipeline), TypeScript (UI)13Capstone 13 — MCP Server with Registry and GovernanceThe Model Context Protocol stopped being the future and became the default tool-use spec in 2026. Anthropic, OpenAI, Google, and every major IDE ship MCP clients. Pinterest published its internal ecosystem of MCP servers. The AAIF Registry...Capstone/25 hours/Python (server, via FastMCP) or TypeScript (@modelcontextprotocol, sdk)14Capstone 14 — Speculative-Decoding Inference ServerEAGLE-3 in vLLM 0.7 ships 2.5-3x throughput on real traffic. P-EAGLE (AWS 2026) pushed parallel speculation even further. SGLang's SpecForge trained draft heads at scale. Red Hat's Speculators hub published aligned drafts for common open m...Capstone/30 hours/Python (serving), C++, CUDA (kernel inspection)15Capstone 15 — Constitutional Safety Harness + Red-Team RangeAnthropic's Constitutional Classifiers, Meta's Llama Guard 4, Google's ShieldGemma-2, NVIDIA's Nemotron 3 Content Safety, and X-Guard for multilingual coverage defined the 2026 safety-classifier stack. garak, PyRIT, NVIDIA Aegis, and promp...Capstone/25 hours/Python (safety pipeline, red team), YAML (policy configs)16Capstone 16 — GitHub Issue-to-PR Autonomous AgentAWS Remote SWE Agents, Cursor Background Agents, OpenAI Codex cloud, and Google Jules all ship the same 2026 product shape: label an issue, get a PR. Run an agent in a cloud sandbox, verify tests pass, and post a review-ready PR with ratio...Capstone/30 hours/Python (agent), TypeScript (GitHub App), YAML (Actions)17Capstone 17 — Personal AI Tutor (Adaptive, Multimodal, with Memory)Khanmigo (Khan Academy), Duolingo Max, Google LearnLM / Gemini for Education, Quizlet Q-Chat, and Synthesis Tutor all shipped adaptive multimodal tutoring at scale in 2026. The common shape is a Socratic policy (never just dump the answer)...Capstone/30 hours/Python (backend, learner model), TypeScript (web app)18Agent Harness Loop ContractThe harness is the agent. The model is a coprocessor. This lesson freezes the loop contract you can wire any model into.Build/~90 minutes/Python19Tool Registry with Schema ValidationA tool the agent cannot validate is a tool the agent cannot call. Build the registry and the schema checker before you build the tools.Build/~90 minutes/Python20JSON-RPC 2.0 Over Newline-Delimited StdioThe transport between a model client and a tool server is JSON-RPC over stdio. Hand-rolling it once teaches you what every framing layer is paying for.Build/~90 minutes/Python21Function Call DispatcherThe dispatcher is where the harness pays for every promise the schema made. Timeouts, retries, dedupe, error mapping. All on one seam.Build/~90 minutes/Python22Plan-Execute Control FlowA plan that cannot survive a failure is a script. A script that can replan is an agent. Build the replanner first.Build/~90 minutes/Python23Capstone Lesson 25: Verification Gates and the Observation BudgetAn agent harness without a verification layer is a wish in a trenchcoat. This lesson builds the deterministic gate chain that decides whether a tool call is allowed to fire, how much of its output the agent is allowed to see, and when the...Build/~90 minutes/Python (stdlib)24Capstone Lesson 26: Sandbox Runner with Denylist and Path JailThe verification gate decides whether a tool call should run. The sandbox decides what happens when it does. This lesson ships a subprocess runner that refuses dangerous executables, refuses dangerous argv shapes, jails every file path to...Build/~90 minutes/Python (stdlib)25Capstone Lesson 27: Eval Harness with Fixture TasksA coding agent is only as good as the suite of tasks you measure it against. This lesson builds an evaluation harness that takes a folder of fixture tasks, runs each through a candidate agent, scores pass or fail through a deterministic ve...Build/~90 minutes/Python (stdlib)26Capstone Lesson 28: Observability with OTel GenAI Spans and Prometheus MetricsAn agent harness without observability is a black box that costs money. This lesson hand-rolls a span builder that emits records compliant with the OpenTelemetry GenAI semantic conventions, writes them to a JSON-Lines file one span per lin...Build/~90 minutes/Python (stdlib)27Capstone Lesson 29: End-to-End Coding Agent on the HarnessTrack A's payoff. This lesson stitches the gate chain, the sandbox, the eval harness, and the OTel spans into one working coding agent that fixes a real (small, fixture-scale) bug in a multi-file Python project. The agent is a deterministi...Build/~90 minutes/Python (stdlib)28BPE Tokenizer From ScratchBytes in, ids out, ids back to the same bytes. Build the tokenizer that every modern text model still starts from.Build/~90 minutes/Python29Tokenized Dataset with Sliding WindowA pretraining run is a function from token ids to gradients. This lesson builds the conveyor that feeds the ids in.Build/~90 minutes/Python30Token and Positional EmbeddingsIds are integers. The model wants vectors. Two lookup tables sit between them, and the choice of the positional one shapes what the model can learn.Build/~90 minutes/Python31Multi-Head Self-AttentionOne linear projection, three views, H parallel heads, one mask. The attention block as the model actually uses it.Build/~90 minutes/Python32Transformer Block from ScratchOne block is the unit of every modern decoder LLM. Layer norm, multi head attention, residual, MLP, residual. The pre-LN variant trains stably without warmup. The post-LN variant is what the original paper shipped. This lesson builds both,...Build/~90 minutes/Python33GPT Model AssemblyTwelve blocks stacked, a token embedding, a learned position embedding, a final LayerNorm, and a tied language model head. That is the entire 124 million parameter GPT model. This lesson assembles those pieces into a working class, counts...Build/~90 minutes/Python34Training Loop and EvaluationA loop that does not measure is a loop that lies. This lesson builds the training loop that drives the GPT model: AdamW with weight decay split, a warmup plus cosine learning rate schedule, a calc_loss_batch helper, an evaluate_model pass...Build/~90 minutes/Python35Loading Pretrained WeightsTraining a 124 million parameter model from scratch is a budget decision; loading a published checkpoint is a Tuesday. This lesson loads pretrained GPT-2 style weights from a safetensors file into the exact architecture from lesson 35, wal...Build/~90 minutes/Python36Capstone Lesson 38: Classifier Fine-Tuning by Head SwapTrack B's first capstone. A pretrained language model is a stack of self-attention blocks ending in a token-prediction head. When you want spam vs ham, the head is wrong but the body is mostly right. This lesson rips the head off, glues a...Build/~90 minutes/Python (torch, numpy)37Capstone Lesson 39: Instruction Tuning by Supervised Fine-TuningA pretrained base model can extend a sequence but cannot follow an instruction. Supervised fine-tuning is the smallest change that fixes this: feed the model paired examples of an instruction and a desired response, and train the body to p...Build/~90 minutes/Python (torch, numpy)38Capstone Lesson 40: Direct Preference Optimization from ScratchReward models and PPO are the classical RLHF stack. DPO collapses that stack into a single supervised loss that fits a policy directly against preference pairs. This lesson derives the DPO loss from the reward-difference identity, ships a...Build/~90 minutes/Python (torch, numpy)39Capstone Lesson 41: Full Evaluation PipelineTraining is the part you can monitor with loss curves. Evaluation is the part you have to design. This lesson builds a unified eval pipeline that takes any trained language model, runs four heterogeneous evals on it, aggregates the results...Build/~90 minutes/Python (torch, numpy)40Large Corpus DownloaderTraining a language model begins long before the first forward pass. The corpus has to land on disk, decompressed, deduplicated, and addressable, with the resume story already worked out before the network drops at 4 percent. This lesson b...Build/~90 minutes/Python41HDF5 Tokenized CorpusThe downloaded corpus has to land in a layout the trainer can stream from at line speed. JSONL on disk does not survive 16 dataloader workers. HDF5 with a resizable, chunked integer dataset does. This lesson builds streaming tokenization i...Build/~90 minutes/Python42Cosine LR with Linear WarmupThe learning-rate schedule is the second most important decision after the loss function. AdamW with a cosine decay and a linear warmup is the modern default for language-model training because it lets the model see a small effective step...Build/~90 minutes/Python43Gradient Clipping and Mixed PrecisionThe optimizer and schedule from the previous lesson assume gradients are sane. They usually are not. A single bad batch can spike the gradient norm by three orders of magnitude. Mixed-precision training amplifies this by introducing FP16 o...Build/~90 minutes/Python44Gradient AccumulationTrain at an effective batch you cannot afford, one micro-batch at a time. Scale the loss, hold the optimizer step, and let the gradients pile up.Build/~90 minutes/Python45Checkpoint Save and ResumeTrain interrupts kill runs; checkpoints let them continue. Save model, optimizer, scheduler, loss history, step counter, and RNG state, atomically, so a kill at any moment leaves a valid file on disk.Build/~90 minutes/Python46Distributed Data Parallel and FSDP from ScratchMulti-rank training is two collectives and one rule. Broadcast the parameters at startup, average the gradients after backward, never let the ranks disagree about what step they are on.Build/~90 minutes/Python47Language Model Evaluation HarnessA model that does well on a task you cannot define is a model that does well by accident. The harness is the task definition, the metric, the runner, and the leaderboard, in one short, swappable shape.Build/~90 minutes/Python48Hypothesis GeneratorA research agent that asks the same question twice is wasting tokens. The trick is forcing each draft to land somewhere new.Build/~90 minutes/Python49Literature RetrievalA hypothesis is cheap. Knowing whether someone already proved it is the expensive part. Build the retrieval layer that answers that question before the runner spins up a sandbox.Build/~90 minutes/Python50Experiment RunnerThe loop is only as honest as its measurements. Build the runner that takes a spec, executes it in a sandboxed subprocess, and emits a json metrics blob the evaluator can trust.Build/~90 minutes/Python51Result EvaluatorThe runner produced numbers. The evaluator decides whether those numbers are an improvement, a regression, or noise. Build the verdict path that turns metrics into a one line conclusion.Build/~90 minutes/Python52Paper WriterA LaTeX skeleton is a contract between the researcher and the typesetter. If the contract is broken the document does not compile, and the failure is loud. Build the skeleton first, then fill it.Build/~90 minutes/Python53Critic LoopA critic that returns "looks good" the first time is broken. A critic that always returns "needs work" is broken. The interesting critic is the one that converges, and you have to engineer convergence.Build/~90 minutes/Python54Iteration SchedulerA research loop without a scheduler is a queue with delusions. The scheduler is where the loop decides what to stop exploring, and that decision is the whole game.Build/~90 minutes/Python55End-to-End Research DemoA demo is the place where every contract you wrote earlier has to compose. If any one of them leaks, the demo is the lesson that catches it.Build/~90 minutes/Python56Vision Encoder PatchesA vision model that reads pixels needs a tokenizer for pixels. Patch embedding is that tokenizer. Cut the image into a grid of squares, flatten each square, project it through one linear layer, then add a 2D position signal so the transfor...Build/~90 minutes/Python57Vision Transformer EncoderPatches alone do not see. A 12-layer pre-LN transformer with 12 attention heads turns the sequence of patch tokens into a sequence of contextual tokens, with the CLS token pooling whole-image features in its final hidden state. This lesson...Build/~90 minutes/Python58Projection Layer for Modality AlignmentA vision encoder produces image tokens. A text decoder consumes text tokens. The two live in different vector spaces. A small two-layer MLP projects image tokens into the text embedding space, and a cosine alignment loss against a paired c...Build/~90 minutes/Python59Cross-Attention FusionThe projection layer aligns one image vector with one caption vector. A real vision-language decoder needs every text token to attend to every patch token, so the model can ground each word in a region. Cross-attention is how that groundin...Build/~90 minutes/Python60Vision-Language PretrainingThe encoder, projection, and decoder are wired. Now train them together. Two objectives drive learning: a contrastive image-text loss (InfoNCE) that pulls matching pairs together in the joint embedding space, and a language modeling loss t...Build/~90 minutes/Python61Multimodal EvaluationTraining is half the loop. The other half is measurement. This lesson builds three evaluation surfaces from primitives: image-caption retrieval reported as R@1, R@5, R@10; visual question answering reported as exact match accuracy; and ima...Build/~90 minutes/Python62Chunking Strategies, ComparedChunking decides what your retriever can ever surface. Get the boundaries wrong and no embedding model, no reranker, no LLM can repair the damage downstream.Build/~90 minutes/Python63Hybrid Retrieval with BM25 and Dense EmbeddingsLexical and semantic retrieval fail on opposite query distributions. Hybrid retrieval with reciprocal rank fusion does not interpolate, it votes - and the vote wins on every query class.Build/~90 minutes/Python64Cross-Encoder RerankerA bi-encoder embeds query and document independently. A cross-encoder concatenates them and reads both at once. The cross-encoder is the smartest reader and the slowest. Used as a second stage on the bi-encoder's top-k, it pays for itself.Build/~90 minutes/Python65Query Rewriting: HyDE, Multi-Query, and DecompositionThe query the user types is not the query your retriever wants. Rewriting bridges the gap before retrieval, so the index sees something closer to what the answer looks like.Build/~90 minutes/Python66RAG Evaluation: Precision, Recall, MRR, nDCG, Faithfulness, Answer RelevanceIf you cannot grade your retrieval and your answer at the same time, you cannot ship the system. The two are not the same metric and the same prompt fails on different axes.Build/~90 minutes/Python67End-to-End RAG SystemSix lessons of components. One pipeline. One eval loop. One self-terminating demo. This is the system you ship.Build/~90 minutes/Python68Task Spec FormatAn eval harness is only as good as the contract its tasks honour. Freeze the JSONL shape and the metric vocabulary before you write a single scoring function.Build/~90 min/Python69Classical MetricsBLEU, ROUGE-L, F1, exact-match, accuracy. Five metrics that still account for most published LLM eval numbers. Implement each from first principles so you know what the number means.Build/~90 min/Python70Code Exec MetricGenerated code is right when it passes the tests. The eval harness has to extract code, run it without crashing the host, and tally pass-rates honestly. This lesson builds that surface.Build/~90 min/Python71Perplexity and CalibrationIf your model says 90 percent confident on a thousand answers and gets six hundred right, it is not well calibrated. Calibration is half of trustworthy eval. The other half is perplexity, which tells you whether the model thinks the held-o...Build/~90 min/Python72Leaderboard AggregationPer-task scores are easy. Per-model rankings across heterogeneous tasks are harder. Statistical significance on a thousand-prediction leaderboard is the part everyone skips. This lesson does not skip it.Build/~90 min/Python73End-to-End Eval RunnerFive lessons of plumbing, one lesson to glue them. The runner reads the task spec from lesson 70, calls a model through an adapter, scores with lessons 71 and 72, attaches the calibration report from lesson 73, and emits the leaderboard fr...Build/~90 min/Python74Collective Ops From ScratchThe four collective operations that hold distributed training together are allreduce, broadcast, allgather, and reduce_scatter. Every other primitive a training framework offers is a wrapper around these. Build them once over a multiproces...Build/~90 min/Python75Data Parallel DDP From ScratchDistributedDataParallel is a hook on top of allreduce. Wrap a model, broadcast the initial parameters from rank 0 so every rank starts identical, install a backward hook on every parameter that issues an allreduce of the gradient, and the...Build/~90 min/Python76ZeRO Optimizer State ShardingAdam stores two moment estimates per parameter, both in float32. A 7B-parameter model carries 56 GB of optimiser state. ZeRO stage 1 shards that across N ranks; each rank owns 1/N of the optimiser. After the local step the updated paramete...Build/~90 min/Python77Pipeline Parallel and Bubble AnalysisTensor parallelism splits the matrix multiply across ranks. Pipeline parallelism splits the model across ranks, one stage per rank. Microbatches flow through the pipeline. The empty time at the start and end is the bubble; minimising it is...Build/~90 min/Python78Sharded Checkpoint and Atomic ResumeA 70B-parameter training job is paused by a node failure every few hours. The checkpoint format decides whether you lose 30 minutes or 30 hours. A sharded checkpoint writes every rank's shard in parallel and records ownership in a manifest...Build/~90 min/Python79End-to-End Distributed TrainingLessons 76 through 80 each built one piece. This is the assembly: a tiny GPT trained across 4 simulated ranks with DDP for gradient sync, ZeRO-1 for optimiser-state sharding, and a sharded checkpoint at the halfway mark. The demo runs 20 s...Build/~90 min/Python80Capstone 82 — Jailbreak TaxonomyA safety harness without a taxonomy is a coin flip. Name the attack before you defend it.Build/~90 min/Python81Capstone 83 — Prompt Injection DetectorA detector is a function from prompt to confidence and category. Anything else is a vibe.Build/~90 min/Python82Capstone 84 — Refusal EvaluationHelpfulness on benign prompts and refusal on harmful prompts are two metrics, not one. Measure both.Build/~90 min/Python83Capstone 85 — Content Classifier IntegrationClassifiers on the output side answer a different question than rules on the input side. Both need a policy router.Build/~90 min/Python84Capstone 86 — Constitutional Rules EngineA rule is a name, a predicate, and an explanation. Anything missing one of those three is a vibe, not a rule.Build/~90 min/Python, YAML85Capstone 87 — End-to-End Safety GatePre-gen, during-gen, post-gen. Three checkpoints, one verdict, an audit trail per request.Build/~90 min/Python