Search
Find the next
useful step.
Search across AIByDM learning tracks, lessons, projects, tools, practice games, exam prep, and newsletter issues.
692
Indexed items
Static
Fast
Ctrl K
Shortcut
692 results
trackAI From ScratchA complete AI engineering curriculum organized into phases, from setup and math foundations through LLMs, agents, infrastructure, safety, and capstone projects.20 phases / 503 lessonsphasePhase 00: Setup & ToolingGet your environment ready for everything that follows.12 lessons / ~14 hoursphasePhase 01: Math FoundationsThe intuition behind every AI algorithm, through code — not textbooks.22 lessons / ~23 hoursphasePhase 02: ML FundamentalsClassical machine learning — still the backbone of most production AI.18 lessons / ~21 hoursphasePhase 03: Deep Learning CoreNeural networks from first principles. No frameworks until you build one yourself.13 lessons / ~15 hoursphasePhase 04: Computer VisionFrom pixels to understanding — image, video, and 3D.28 lessons / ~27 hoursphasePhase 05: NLP — Foundations to AdvancedLanguage is the interface to intelligence. Master every layer.29 lessons / ~30 hoursphasePhase 06: Speech & AudioThe other half of human communication. Hear, understand, speak.17 lessons / ~18 hoursphasePhase 07: Transformers Deep DiveThe architecture that changed everything. Understand every layer.16 lessons / ~14 hoursphasePhase 08: Generative AICreate images, video, audio, 3D, and more.15 lessons / ~14 hoursphasePhase 09: Reinforcement LearningAgents that learn by doing. The foundation of RLHF.12 lessons / ~13 hoursphasePhase 10: LLMs from ScratchBuild, train, and understand large language models.24 lessons / ~26 hoursphasePhase 11: LLM EngineeringPut LLMs to work in production applications.17 lessons / ~17 hoursphasePhase 12: Multimodal AIModels that see, hear, read, and reason across modalities.25 lessons / ~65 hoursphasePhase 13: Tools & ProtocolsThe interfaces between AI and the real world.23 lessons / ~24.5 hoursphasePhase 14: Agent EngineeringThe core of modern AI engineering. Build agents from first principles.42 lessons / ~42 hoursphasePhase 15: Autonomous SystemsAgents that run without human intervention — safely.22 lessons / ~20 hoursphasePhase 16: Multi-Agent & SwarmsCoordination, emergence, and collective intelligence.25 lessons / ~28 hoursphasePhase 17: Infrastructure & ProductionShip AI to the real world. Scale, monitor, optimize.28 lessons / ~32 hoursphasePhase 18: Ethics, Safety & AlignmentBuild AI that helps humanity. Not optional.30 lessons / ~31 hoursphasePhase 19: Capstone ProjectsProve everything you learned. Build portfolio-grade systems.85 lessons / ~620 hourslessonDev EnvironmentYour tools shape your thinking. Set them up once, set them up right.Phase 00: Setup & Tooling / ~45 minuteslessonGit & CollaborationVersion control is not optional. Every experiment, every model, every lesson you build here gets tracked.Phase 00: Setup & Tooling / ~30 minuteslessonGPU Setup & CloudTraining on CPU is fine for learning. Training for real needs a GPU.Phase 00: Setup & Tooling / ~45 minuteslessonAPIs & KeysEvery AI API works the same way: send a request, get a response. The details change, the pattern doesn't.Phase 00: Setup & Tooling / ~30 minuteslessonJupyter NotebooksNotebooks are the lab bench of AI engineering. You prototype here, then move what works into production.Phase 00: Setup & Tooling / ~30 minuteslessonPython EnvironmentsDependency hell is real. Virtual environments are the cure.Phase 00: Setup & Tooling / ~30 minuteslessonDocker for AIContainers make "works on my machine" a thing of the past.Phase 00: Setup & Tooling / ~60 minuteslessonEditor SetupYour editor is your co-pilot. Configure it once so it stays out of your way and starts pulling its weight.Phase 00: Setup & Tooling / ~20 minuteslessonData ManagementData is the fuel. How you manage it determines how fast you go.Phase 00: Setup & Tooling / ~45 minuteslessonTerminal & ShellThe terminal is where AI engineers live. Get comfortable here.Phase 00: Setup & Tooling / ~35 minuteslessonLinux for AIMost AI runs on Linux. You need to know enough to not be stuck.Phase 00: Setup & Tooling / ~30 minuteslessonDebugging and ProfilingThe worst AI bugs don't crash. They train silently on garbage and report a beautiful loss curve.Phase 00: Setup & Tooling / ~60 minuteslessonLinear Algebra IntuitionEvery AI model is just matrix math wearing a fancy hat.Phase 01: Math Foundations / ~60 minuteslessonVectors, Matrices & OperationsEvery neural network is just matrix multiplication with extra steps.Phase 01: Math Foundations / ~60 minuteslessonMatrix TransformationsA matrix is a machine that reshapes space. Learn what it does to every point, and you understand the whole transformation.Phase 01: Math Foundations / ~75 minuteslessonCalculus for Machine LearningDerivatives tell you which way is downhill. That is all a neural network needs to learn.Phase 01: Math Foundations / ~60 minuteslessonChain Rule & Automatic DifferentiationThe chain rule is the engine behind every neural network that learns.Phase 01: Math Foundations / ~90 minuteslessonProbability and DistributionsProbability is the language AI uses to express uncertainty.Phase 01: Math Foundations / ~75 minuteslessonBayes' TheoremProbability is about what you expect. Bayes' theorem is about what you learn.Phase 01: Math Foundations / ~75 minuteslessonOptimizationTraining a neural network is nothing more than finding the bottom of a valley.Phase 01: Math Foundations / ~75 minuteslessonInformation TheoryInformation theory measures surprise. Loss functions are built on it.Phase 01: Math Foundations / ~60 minuteslessonDimensionality ReductionHigh-dimensional data has structure. You find it by looking from the right angle.Phase 01: Math Foundations / ~90 minuteslessonSingular Value DecompositionSVD is the Swiss Army knife of linear algebra. Every matrix has one. Every data scientist needs one.Phase 01: Math Foundations / ~120 minuteslessonTensor OperationsTensors are the common language between data and deep learning. Every image, every sentence, every gradient flows through them.Phase 01: Math Foundations / ~90 minuteslessonNumerical StabilityFloating point is a leaky abstraction. It will bite you during training, and you will not see it coming.Phase 01: Math Foundations / ~120 minuteslessonNorms and DistancesYour distance function defines what "similar" means. Choose wrong and everything downstream breaks.Phase 01: Math Foundations / ~90 minuteslessonStatistics for Machine LearningStatistics is how you know if your model actually works or just got lucky.Phase 01: Math Foundations / ~120 minuteslessonSampling MethodsSampling is how AI explores the space of possibilities.Phase 01: Math Foundations / ~120 minuteslessonLinear SystemsSolving Ax = b is the oldest problem in mathematics that still runs your neural network.Phase 01: Math Foundations / ~120 minuteslessonConvex OptimizationConvex problems have one valley. Neural networks have millions. Knowing the difference matters.Phase 01: Math Foundations / ~90 minuteslessonComplex Numbers for AIThe square root of -1 is not imaginary. It is the key to rotations, frequencies, and half of signal processing.Phase 01: Math Foundations / ~60 minuteslessonThe Fourier TransformEvery signal is a sum of sine waves. The Fourier transform tells you which ones.Phase 01: Math Foundations / ~90 minuteslessonGraph Theory for Machine LearningGraphs are the data structure of relationships. If your data has connections, you need graph theory.Phase 01: Math Foundations / ~90 minuteslessonStochastic ProcessesRandomness with structure. The math behind random walks, Markov chains, and diffusion models.Phase 01: Math Foundations / ~75 minuteslessonWhat Is Machine LearningMachine learning is teaching computers to find patterns in data instead of writing rules by hand.Phase 02: ML Fundamentals / ~45 minuteslessonLinear RegressionLinear regression draws the best straight line through your data. It is the "hello world" of machine learning.Phase 02: ML Fundamentals / ~90 minuteslessonLogistic RegressionLogistic regression bends a straight line into an S-curve to answer yes-or-no questions with probabilities.Phase 02: ML Fundamentals / ~90 minuteslessonDecision Trees and Random ForestsA decision tree is just a flowchart. But a forest of them is one of the most powerful tools in ML.Phase 02: ML Fundamentals / ~90 minuteslessonSupport Vector MachinesFind the widest street between two classes. That is the entire idea.Phase 02: ML Fundamentals / ~90 minuteslessonK-Nearest Neighbors and DistancesStore everything. Predict by looking at your neighbors. The simplest algorithm that actually works.Phase 02: ML Fundamentals / ~90 minuteslessonUnsupervised LearningNo labels, no teacher. The algorithm finds structure on its own.Phase 02: ML Fundamentals / ~90 minuteslessonFeature Engineering & SelectionA good feature is worth a thousand data points.Phase 02: ML Fundamentals / ~90 minuteslessonModel EvaluationA model is only as good as the way you measure it.Phase 02: ML Fundamentals / ~90 minuteslessonBias-Variance TradeoffEvery model error comes from one of three sources: bias, variance, or noise. You can only control the first two.Phase 02: ML Fundamentals / ~75 minuteslessonEnsemble MethodsA group of weak learners, combined correctly, becomes a strong learner. This is not a metaphor. It is a theorem.Phase 02: ML Fundamentals / ~120 minuteslessonHyperparameter TuningHyperparameters are the knobs you turn before training starts. Turning them well is the difference between a mediocre model and a great one.Phase 02: ML Fundamentals / ~90 minuteslessonML PipelinesA model is not a product. A pipeline is. The pipeline is everything from raw data to deployed prediction, and every step must be reproducible.Phase 02: ML Fundamentals / ~120 minuteslessonNaive BayesThe "naive" assumption is wrong, and it works anyway. That's the beauty of it.Phase 02: ML Fundamentals / ~75 minuteslessonTime Series FundamentalsPast performance does predict future results -- if you check for stationarity first.Phase 02: ML Fundamentals / ~90 minuteslessonAnomaly DetectionNormal is easy to define. Abnormal is whatever doesn't fit.Phase 02: ML Fundamentals / ~75 minuteslessonHandling Imbalanced DataWhen 99% of your data is "normal," accuracy is a lie.Phase 02: ML Fundamentals / ~90 minuteslessonFeature SelectionMore features is not better. The right features is better.Phase 02: ML Fundamentals / ~75 minuteslessonThe PerceptronThe perceptron is the atom of neural networks. Split it open and you find weights, a bias, and a decision.Phase 03: Deep Learning Core / ~60 minuteslessonMulti-Layer Networks and Forward PassOne neuron draws a line. Stack them, and you can draw anything.Phase 03: Deep Learning Core / ~90 minuteslessonBackpropagation from ScratchBackpropagation is the algorithm that makes learning possible. Without it, neural networks are just expensive random number generators.Phase 03: Deep Learning Core / ~120 minuteslessonActivation FunctionsWithout nonlinearity, your 100-layer network is a fancy matrix multiply. Activations are the gates that let neural networks think in curves.Phase 03: Deep Learning Core / ~75 minuteslessonLoss FunctionsYour network makes a prediction. The ground truth says otherwise. How wrong is it? That number is the loss. Pick the wrong loss function and your model optimizes for the wrong thing entirely.Phase 03: Deep Learning Core / ~75 minuteslessonOptimizersGradient descent tells you which direction to move. It says nothing about how far or how fast. SGD is a compass. Adam is GPS with traffic data.Phase 03: Deep Learning Core / ~75 minuteslessonRegularizationYour model gets 99% on training data and 60% on test data. It memorized instead of learning. Regularization is the tax you impose on complexity to force generalization.Phase 03: Deep Learning Core / ~75 minuteslessonWeight Initialization and Training StabilityInitialize wrong and training never starts. Initialize right and 50 layers train as smoothly as 3.Phase 03: Deep Learning Core / ~90 minuteslessonLearning Rate Schedules and WarmupThe learning rate is the single most important hyperparameter. Not the architecture. Not the dataset size. Not the activation function. The learning rate. If you tune nothing else, tune this.Phase 03: Deep Learning Core / ~90 minuteslessonBuild Your Own Mini FrameworkYou have built neurons, layers, networks, backprop, activations, loss functions, optimizers, regularization, initialization, and LR schedules. All as separate pieces. Now wire them together into a framework. Not PyTorch. Not TensorFlow. Yo...Phase 03: Deep Learning Core / ~120 minuteslessonIntroduction to PyTorchYou built the engine from pistons and crankshafts. Now learn the one everyone actually drives.Phase 03: Deep Learning Core / ~75 minuteslessonIntroduction to JAXPyTorch mutates tensors. TensorFlow builds graphs. JAX compiles pure functions. That last one changes how you think about deep learning.Phase 03: Deep Learning Core / ~90 minuteslessonDebugging Neural NetworksYour network compiled. It ran. It produced a number. The number is wrong and nothing crashed. Welcome to the hardest kind of debugging -- the kind where there is no error message.Phase 03: Deep Learning Core / ~90 minuteslessonImage Fundamentals — Pixels, Channels, Color SpacesAn image is a tensor of light samples. Every vision model you will ever use starts from this one fact.Phase 04: Computer Vision / ~45 minuteslessonConvolutions from ScratchA convolution is a tiny dense layer you slide across an image, sharing the same weights at every location.Phase 04: Computer Vision / ~75 minuteslessonCNNs — LeNet to ResNetEvery major CNN of the last thirty years is the same conv–nonlinearity–downsample recipe with one new idea bolted on. Learn the ideas in order.Phase 04: Computer Vision / ~75 minuteslessonImage ClassificationA classifier is a function from pixels to a probability distribution over classes. Everything else is plumbing.Phase 04: Computer Vision / ~75 minuteslessonTransfer Learning & Fine-TuningSomebody else spent a million GPU hours teaching a network what edges, textures, and object parts look like. You should borrow those features before training your own.Phase 04: Computer Vision / ~75 minuteslessonObject Detection — YOLO from ScratchDetection is classification plus regression, run at every position in a feature map, then cleaned up with non-maximum suppression.Phase 04: Computer Vision / ~75 minuteslessonSemantic Segmentation — U-NetSegmentation is classification at every pixel. U-Net makes it work by pairing a downsampling encoder with an upsampling decoder and wiring skip connections between them.Phase 04: Computer Vision / ~75 minuteslessonInstance Segmentation — Mask R-CNNAdd a tiny mask branch to a Faster R-CNN detector and you have instance segmentation. The hard part is RoIAlign, and it is harder than it looks.Phase 04: Computer Vision / ~75 minuteslessonImage Generation — GANsA GAN is two neural networks in a fixed game. One draws, one critiques. They get better together until the drawings fool the critic.Phase 04: Computer Vision / ~75 minuteslessonImage Generation — Diffusion ModelsA diffusion model learns to denoise. Train it to remove a tiny bit of noise from a noisy image, repeat that backwards a thousand times, and you have an image generator.Phase 04: Computer Vision / ~75 minuteslessonStable Diffusion — Architecture & Fine-TuningStable Diffusion is a DDPM that runs in the latent space of a pretrained VAE, conditioned on text via cross-attention, sampled with a fast deterministic ODE solver, and steered by classifier-free guidance.Phase 04: Computer Vision / ~75 minuteslessonVideo Understanding — Temporal ModelingA video is a sequence of images plus the physics that connects them. Every video model either treats time as an extra axis (3D conv), a sequence to attend over (transformer), or a feature to extract once and pool (2D+pool).Phase 04: Computer Vision / ~45 minuteslesson3D Vision — Point Clouds & NeRFs3D vision comes in two flavours. Point clouds are the sensor's raw output. NeRFs are the learned volumetric field. Both answer "what is where in space."Phase 04: Computer Vision / ~45 minuteslessonVision Transformers (ViT)Cut the image into patches, treat each patch as a word, run a standard transformer. Don't look back.Phase 04: Computer Vision / ~45 minuteslessonReal-Time Vision — Edge DeploymentEdge inference is the discipline of getting a 90-accuracy model to run at 30 fps on a device with 2 GB of RAM. Every percentage point of accuracy is traded against milliseconds of latency.Phase 04: Computer Vision / ~75 minuteslessonBuild a Complete Vision Pipeline — CapstoneA production vision system is a chain of models and rules stitched with data contracts. The pieces are already in this phase; the capstone wires them together end-to-end.Phase 04: Computer Vision / ~120 minuteslessonSelf-Supervised Vision — SimCLR, DINO, MAELabels are the bottleneck of supervised vision. Self-supervised pretraining removes them: learn visual features from 100M unlabelled images, fine-tune on 10k labelled ones.Phase 04: Computer Vision / ~75 minuteslessonOpen-Vocabulary Vision — CLIPTrain an image encoder and a text encoder together so that matching (image, caption) pairs land at the same point in a shared space. That is the whole trick.Phase 04: Computer Vision / ~45 minuteslessonOCR & Document UnderstandingOCR is a three-stage pipeline — detect text boxes, recognise the characters, then lay them out. Every modern OCR system reorders these stages or merges them.Phase 04: Computer Vision / ~45 minuteslessonImage Retrieval & Metric LearningA retrieval system ranks candidates by a distance in embedding space. Metric learning is the discipline of shaping that space so the distances mean what you want.Phase 04: Computer Vision / ~45 minuteslessonKeypoint Detection & Pose EstimationA pose is a set of ordered keypoints. A keypoint detector is a heatmap regressor. Everything else is bookkeeping.Phase 04: Computer Vision / ~45 minuteslesson3D Gaussian Splatting from ScratchA scene is a cloud of millions of 3D Gaussians. Each one has a position, orientation, scale, opacity, and a colour that depends on viewing direction. Rasterise them, backprop through the rasterisation, done.Phase 04: Computer Vision / ~90 minuteslessonDiffusion Transformers & Rectified FlowThe U-Net is not the secret of diffusion. Replace it with a transformer, swap the noise schedule for a straight-line flow, and suddenly you have SD3, FLUX, and every 2026 text-to-image model.Phase 04: Computer Vision / ~75 minuteslessonSAM 3 & Open-Vocabulary SegmentationGive a model a text prompt and an image and get masks for every matching object. SAM 3 made that a single forward pass.Phase 04: Computer Vision / ~60 minuteslessonVision-Language Models — The ViT-MLP-LLM PatternA vision encoder converts an image into tokens. An MLP projector maps those tokens into the LLM's embedding space. A language model does the rest. That pattern — ViT-MLP-LLM — is every production VLM in 2026.Phase 04: Computer Vision / ~75 minuteslessonMonocular Depth & Geometry EstimationA depth map is a single-channel image where each pixel is a distance from the camera. Predicting it from one RGB frame used to be impossible without stereo or LiDAR. In 2026 a frozen ViT encoder plus a lightweight head gets within a few pe...Phase 04: Computer Vision / ~60 minuteslessonMulti-Object Tracking & Video MemoryTracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.Phase 04: Computer Vision / ~60 minuteslessonWorld Models & Video DiffusionA video model that predicts the next seconds of a scene is a world simulator. Condition that prediction on actions and you have a learned game engine.Phase 04: Computer Vision / ~75 minuteslessonText Processing — Tokenization, Stemming, LemmatizationLanguage is continuous. Models are discrete. Preprocessing is the bridge.Phase 05: NLP — Foundations to Advanced / ~45 minuteslessonBag of Words, TF-IDF, and Text RepresentationCount first, think later. TF-IDF still beats embeddings on well-defined tasks in 2026.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonWord Embeddings — Word2Vec from ScratchA word is the company it keeps. Train a shallow net on that idea and geometry falls out.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonGloVe, FastText, and Subword EmbeddingsWord2Vec trained one embedding per word. GloVe factorized the co-occurrence matrix. FastText embedded the pieces. BPE bridged to transformers.Phase 05: NLP — Foundations to Advanced / ~45 minuteslessonSentiment AnalysisThe canonical NLP task. Most of what you need to know about classical text classification shows up here.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonNamed Entity RecognitionPull the names out. Sounds easy until you deal with ambiguous boundaries, nested entities, and domain jargon.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonPOS Tagging and Syntactic ParsingGrammar was unfashionable for a while. Then every LLM pipeline needed to validate structured extraction, and it came back.Phase 05: NLP — Foundations to Advanced / ~45 minuteslessonCNNs and RNNs for TextConvolutions learn n-grams. Recurrences remember. Both are superseded by attention. Both still matter on constrained hardware.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonSequence-to-Sequence ModelsTwo RNNs pretending to be a translator. The bottleneck they hit is the reason attention exists.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonAttention Mechanism — The BreakthroughThe decoder stops squinting at a compressed summary and starts looking at the whole source. Everything after this is attention plus engineering.Phase 05: NLP — Foundations to Advanced / ~45 minuteslessonMachine TranslationTranslation is the task that paid for NLP research for thirty years and keeps paying now.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonText SummarizationExtractive systems tell you what the document said. Abstractive systems tell you what the author meant. Different tasks, different pitfalls.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonQuestion Answering SystemsThree systems shaped modern QA. Extractive found spans. Retrieval-augmented grounded them in documents. Generative produced answers. Every modern AI assistant is a mix of the three.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonInformation Retrieval and SearchBM25 is precise but brittle. Dense casts a wide net but misses keywords. Hybrid is the 2026 default. Everything else is tuning.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonTopic Modeling — LDA and BERTopicLDA: documents are mixtures of topics, topics are distributions over words. BERTopic: documents cluster in embedding space, clusters are topics. Same goal, different decompositions.Phase 05: NLP — Foundations to Advanced / ~45 minuteslessonText Generation Before Transformers — N-gram Language ModelsIf a word is surprising, the model is bad. Perplexity makes surprise a number. Smoothing keeps it finite.Phase 05: NLP — Foundations to Advanced / ~45 minuteslessonChatbots — Rule-Based to Neural to LLM AgentsELIZA replied with pattern matches. DialogFlow mapped intents. GPT answered from weights. Claude runs tools and verifies. Each era solved the previous one's worst failure.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonMultilingual NLPOne model, 100+ languages, zero training data for most of them. Cross-lingual transfer is the practical miracle of the 2020s.Phase 05: NLP — Foundations to Advanced / ~45 minuteslessonSubword Tokenization — BPE, WordPiece, Unigram, SentencePieceWord tokenizers choke on unseen words. Character tokenizers blow up sequence length. Subword tokenizers split the difference. Every modern LLM ships on one.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonStructured Outputs & Constrained DecodingAsk an LLM for JSON. Get JSON most of the time. In production, "most" is the problem. Constrained decoding turns "most" into "always" by editing the logits before sampling.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonNatural Language Inference — Textual Entailment"t entails h" means a human reading t would conclude h is true. NLI is the task of predicting entailment / contradiction / neutral. Boring on the surface, load-bearing in production.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonEmbedding Models — The 2026 Deep DiveWord2Vec gave you a vector per word. Modern embedding models give you a vector per passage, cross-lingual, with sparse, dense, and multi-vector views, sized to fit your index. Pick wrong and your RAG retrieves the wrong thing.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonChunking Strategies for RAGChunking configuration influences retrieval quality as much as the choice of embedding model (Vectara NAACL 2025). Get chunking wrong and no amount of reranking saves you.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonCoreference Resolution"She called him. He did not answer. The doctor was at lunch." Three references to two people and nobody is named. Coreference resolution figures out who is who.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonEntity Linking & DisambiguationNER found "Paris." Entity linking decides: Paris, France? Paris Hilton? Paris, Texas? Paris (the Trojan prince)? Without linking, your knowledge graph stays ambiguous.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonRelation Extraction & Knowledge Graph ConstructionNER found the entities. Entity linking anchored them. Relation extraction finds the edges between them. A knowledge graph is the sum of nodes, edges, and their provenance.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonLLM Evaluation — RAGAS, DeepEval, G-EvalExact-match and F1 miss semantic equivalence. Human review does not scale. LLM-as-judge is the production answer — with enough calibration to trust the number.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonLong-Context Evaluation — NIAH, RULER, LongBench, MRCRGemini 3 Pro advertises 10M tokens of context. At 1M tokens, 8-needle MRCR drops to 26.3%. Advertised ≠ usable. Long-context evaluation tells you the actual capacity of the model you are shipping on.Phase 05: NLP — Foundations to Advanced / ~60 minuteslessonDialogue State Tracking"I want a cheap restaurant in the north... actually make it moderate... and add Italian." Three turns, three state updates. DST keeps the slot-value dict in sync so the booking works.Phase 05: NLP — Foundations to Advanced / ~75 minuteslessonAudio Fundamentals — Waveforms, Sampling, Fourier TransformWaveforms are the raw signal. Spectrograms are the representation. Mel features are the ML-friendly form. Every modern ASR and TTS pipeline walks this ladder, and the first rung is understanding sampling and Fourier.Phase 06: Speech & Audio / ~45 minuteslessonSpectrograms, Mel Scale & Audio FeaturesNeural nets do not consume raw waveforms well. They consume spectrograms. They consume mel spectrograms even better. Every ASR, TTS, and audio classifier in 2026 lives or dies by this single preprocessing choice.Phase 06: Speech & Audio / ~45 minuteslessonAudio Classification — From k-NN on MFCCs to AST and BEATsEverything from "dog barking vs siren" to "which language is this" is audio classification. The features are mels. The architecture moves each decade. The evaluation stays AUC, F1, and per-class recall.Phase 06: Speech & Audio / ~75 minuteslessonSpeech Recognition (ASR) — CTC, RNN-T, AttentionSpeech recognition is audio classification at every timestep, glued together by a sequence model that knows English and silence. CTC, RNN-T, and attention are the three ways to do it. Pick one and understand why.Phase 06: Speech & Audio / ~45 minuteslessonWhisper — Architecture & Fine-TuningWhisper is a 30-second-window transformer encoder-decoder, trained on 680k hours of multilingual weakly-supervised audio-text pairs. One architecture, multiple tasks, robust across 99 languages. The 2026 reference ASR.Phase 06: Speech & Audio / ~75 minuteslessonSpeaker Recognition & VerificationASR asks "what did they say?" Speaker recognition asks "who said it?" The math looks the same — embeddings plus cosine — but every production decision hinges on a single EER number.Phase 06: Speech & Audio / ~45 minuteslessonText-to-Speech (TTS) — From Tacotron to F5 and KokoroASR inverts speech to text; TTS inverts text to speech. The 2026 stack is three parts: text → tokens, tokens → mel, mel → waveform. Each part has a default model that fits in a laptop.Phase 06: Speech & Audio / ~75 minuteslessonVoice Cloning & Voice ConversionVoice cloning reads your text in someone else's voice. Voice conversion rewrites your voice into someone else's while preserving what you said. Both hang on the same decomposition: separate speaker identity from content.Phase 06: Speech & Audio / ~75 minuteslessonMusic Generation — MusicGen, Stable Audio, Suno, and the Licensing Earthquake2026 music generation: Suno v5 and Udio v4 dominate commercial; MusicGen, Stable Audio Open, and ACE-Step lead open-source. The technical problem is mostly solved. The legal problem (Warner Music $500M settlement, UMG settlement) reshaped...Phase 06: Speech & Audio / ~75 minuteslessonAudio-Language Models — Qwen2.5-Omni, Audio Flamingo, GPT-4o Audio2026 audio-language models reason over speech + environmental sound + music. Qwen2.5-Omni-7B matches GPT-4o Audio on MMAU-Pro. Audio Flamingo Next beats Gemini 2.5 Pro on LongAudioBench. The gap between open and closed is essentially close...Phase 06: Speech & Audio / ~45 minuteslessonReal-Time Audio ProcessingBatch pipelines process a file. Real-time pipelines process the next 20 milliseconds before the next 20 arrive. Every conversational AI, broadcast studio, and telephony bot lives and dies by this latency budget.Phase 06: Speech & Audio / ~75 minuteslessonBuild a Voice Assistant Pipeline — The Phase 6 CapstoneEverything from lessons 01-11, stitched together. Build a voice assistant that listens, reasons, and talks back. In 2026 that is a solved engineering problem, not a research problem — but the integration details decide whether it ships.Phase 06: Speech & Audio / ~120 minuteslessonNeural Audio Codecs — EnCodec, SNAC, Mimi, DAC and the Semantic-Acoustic Split2026 audio generation is almost all tokens. EnCodec, SNAC, Mimi, and DAC turn continuous waveforms into discrete sequences that a transformer can predict. The semantic-vs-acoustic token split — first-codebook as semantic, rest as acoustic...Phase 06: Speech & Audio / ~60 minuteslessonVoice Activity Detection & Turn-Taking — Silero, Cobra, and the Flush TrickEvery voice agent lives or dies on two decisions: is the user speaking now, and are they done? VAD answers the first. Turn-detection (VAD + silence-hangover + semantic endpoint model) answers the second. Get either wrong and your assistant...Phase 06: Speech & Audio / ~45 minuteslessonStreaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue2024-2026 redefined voice AI. Moshi ships a single model that listens and speaks simultaneously at 200 ms latency. Hibiki does speech-to-speech translation chunk-by-chunk. Both abandon the ASR → LLM → TTS pipeline for a unified full-duplex...Phase 06: Speech & Audio / ~75 minuteslessonVoice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerifyVoice cloning shipped faster than defenses. 2026 production voice systems need two things: a detector (AASIST, RawNet2) that classifies real vs fake speech, and a watermark (AudioSeal) that survives compression and editing. Ship both or do...Phase 06: Speech & Audio / ~75 minuteslessonAudio Evaluation — WER, MOS, UTMOS, MMAU, FAD, and the Open LeaderboardsYou cannot ship what you cannot measure. This lesson names the 2026 metrics for every audio task: ASR (WER, CER, RTFx), TTS (MOS, UTMOS, SECS, WER-on-ASR-round-trip), audio-language (MMAU, LongAudioBench), music (FAD, CLAP), and speaker (E...Phase 06: Speech & Audio / ~60 minuteslessonWhy Transformers — The Problems with RNNsRNNs process tokens one at a time. Transformers process all tokens at once. That single architectural bet changed every scaling curve in deep learning after 2017.Phase 07: Transformers Deep Dive / ~45 minuteslessonSelf-Attention from ScratchAttention is a lookup table where every word asks "who matters to me?" - and learns the answer.Phase 07: Transformers Deep Dive / ~90 minuteslessonMulti-Head AttentionOne attention head learns one relation at a time. Eight heads learn eight. Heads are free. Take more of them.Phase 07: Transformers Deep Dive / ~75 minuteslessonPositional Encoding — Sinusoidal, RoPE, ALiBiAttention is permutation-invariant. "The cat sat on the mat" and "mat the on sat cat the" produce the same output without positional signal. Three algorithms fix it — each with a different bet on what "position" means.Phase 07: Transformers Deep Dive / ~45 minuteslessonThe Full Transformer — Encoder + DecoderAttention is the star. Everything else — residuals, normalization, feed-forward, cross-attention — is the scaffolding that lets you stack it deep.Phase 07: Transformers Deep Dive / ~75 minuteslessonBERT — Masked Language ModelingGPT predicts the next word. BERT predicts a missing word. One sentence of difference — and half a decade of everything embedding-shaped.Phase 07: Transformers Deep Dive / ~45 minuteslessonGPT — Causal Language ModelingBERT sees both sides. GPT sees only the past. The triangle mask is the most consequential single line of code in modern AI.Phase 07: Transformers Deep Dive / ~75 minuteslessonT5, BART — Encoder-Decoder ModelsEncoders understand. Decoders generate. Put them back together and you get a model built for input → output tasks: translate, summarize, rewrite, transcribe.Phase 07: Transformers Deep Dive / ~45 minuteslessonVision Transformers (ViT)An image is a grid of patches. A sentence is a grid of tokens. The same transformer eats both.Phase 07: Transformers Deep Dive / ~45 minuteslessonAudio Transformers — Whisper ArchitectureAudio is an image of frequency over time. Whisper is a ViT that eats mel spectrograms and speaks back.Phase 07: Transformers Deep Dive / ~45 minuteslessonMixture of Experts (MoE)A dense 70B transformer activates every parameter for every token. A 671B MoE activates only 37B per token and beats it on every benchmark. Sparsity is the most important scaling idea of the decade.Phase 07: Transformers Deep Dive / ~45 minuteslessonKV Cache, Flash Attention & Inference OptimizationTraining is parallel and FLOP-bound. Inference is serial and memory-bound. Different bottleneck, different tricks.Phase 07: Transformers Deep Dive / ~75 minuteslessonScaling LawsThe 2020 Kaplan paper said: bigger model, lower loss. The 2022 Hoffmann paper said: you were under-training. Compute goes into two buckets — parameters and tokens — and the split is not obvious.Phase 07: Transformers Deep Dive / ~45 minuteslessonBuild a Transformer from Scratch — The CapstoneThirteen lessons. One model. No shortcuts.Phase 07: Transformers Deep Dive / ~120 minuteslessonAttention Variants — Sliding Window, Sparse, DifferentialFull attention is a circle. Every token sees every token, and memory pays the price. Four variants bend the shape of the circle and recover half the cost.Phase 07: Transformers Deep Dive / ~60 minuteslessonSpeculative Decoding — Draft, Verify, RepeatAutoregressive decoding is serial. Each token waits for the previous one. Speculative decoding breaks the chain: a cheap model drafts N tokens, the expensive model verifies all N in one forward pass. When the draft is right you paid one bi...Phase 07: Transformers Deep Dive / ~60 minuteslessonGenerative Models — Taxonomy & HistoryEvery image model, text model, video model, and 3D model fits in one of five buckets. Pick the wrong bucket and you will fight the math for weeks. Pick the right one and the field's last twelve years of progress stacks cleanly in your head.Phase 08: Generative AI / ~45 minuteslessonAutoencoders & Variational Autoencoders (VAE)A plain autoencoder compresses then reconstructs. It memorizes. It does not generate. Add one trick — force the code to look Gaussian — and you get a sampler. That single trick, the reparameterization of z = μ + σ·ε, is why every latent-di...Phase 08: Generative AI / ~75 minuteslessonGANs — Generator vs DiscriminatorGoodfellow's trick in 2014 was to skip density entirely. Two networks. One makes fakes. One catches them. They fight until the fakes are indistinguishable from real. It shouldn't work. It often doesn't. When it does, the samples are still...Phase 08: Generative AI / ~75 minuteslessonConditional GANs & Pix2PixThe first big unlock of 2014-2017 was controlling what a GAN makes. Attach a label, or an image, or a sentence. Pix2Pix did the image version and it still beats every generic text-to-image model on narrow image-to-image tasks.Phase 08: Generative AI / ~75 minuteslessonStyleGANMost generators stir z into every layer at the same time. StyleGAN split it apart: first map z to an intermediate w, then inject w at every resolution level through AdaIN. That single change untangled the latent space and made photorealist...Phase 08: Generative AI / ~45 minuteslessonDiffusion Models — DDPM from ScratchHo, Jain, Abbeel (2020) gave the field a recipe it could not quit. Destroy the data with noise over a thousand small steps. Train one neural net to predict the noise. Reverse the process at inference. Today every mainstream image, video, 3...Phase 08: Generative AI / ~75 minuteslessonLatent Diffusion & Stable DiffusionPixel-space diffusion on 512×512 images is a computational war crime. Rombach et al. (2022) noticed that you do not need all 786k dimensions to generate an image — you need enough to capture semantic structure, and a separate decoder for t...Phase 08: Generative AI / ~75 minuteslessonControlNet, LoRA & ConditioningText alone is a clumsy control signal. ControlNet lets you clone a pretrained diffusion model and steer it with a depth map, pose skeleton, scribble, or edge image. LoRA lets you fine-tune a 2B-parameter model by training 10 million parame...Phase 08: Generative AI / ~75 minuteslessonInpainting, Outpainting & Image EditingText-to-image makes new things. Inpainting fixes old ones. In production, 70% of billable image work is editing — swap a background, remove a logo, extend the canvas, regenerate a hand. Inpainting is where diffusion earns its keep.Phase 08: Generative AI / ~75 minuteslessonVideo GenerationAn image is a 2-D tensor. A video is a 3-D one. The theory is the same; the compute is 10-100x harder. OpenAI's Sora (Feb 2024) proved it was possible. By 2026 Veo 2, Kling 1.5, Runway Gen-3, Pika 2.0, and WAN 2.2 ship production video fro...Phase 08: Generative AI / ~45 minuteslessonAudio GenerationAudio is a 1-D signal at 16-48 kHz. A five-second clip is 80-240k samples. No transformer attends to that sequence directly. The solution for every production audio model in 2026 is the same: a neural codec (Encodec, SoundStream, DAC) comp...Phase 08: Generative AI / ~45 minuteslesson3D Generation3D is the modality where 2D-to-3D leverage is strongest. The 2023 breakthrough was 3D Gaussian Splatting. The 2024-2026 generative push layers multi-view diffusion + 3D reconstruction on top to produce objects and scenes from a single prom...Phase 08: Generative AI / ~45 minuteslessonFlow Matching & Rectified FlowsDiffusion models take 20-50 sampling steps because they walk a curved path from noise to data. Flow matching (Lipman et al., 2023) and rectified flow (Liu et al., 2022) trained straight paths. Straighter paths mean fewer steps mean faster...Phase 08: Generative AI / ~45 minuteslessonEvaluation — FID, CLIP Score, Human PreferenceEvery generative model leaderboard cites FID, CLIP score, and a win rate from a human-preference arena. Each number has a failure mode a determined researcher can game. If you do not know the failure modes, you cannot tell a real improveme...Phase 08: Generative AI / ~45 minuteslessonVisual Autoregressive Modeling (VAR): Next-Scale PredictionDiffusion models sample iteratively in time (denoising steps). VAR samples iteratively in scale — it predicts a 1x1 token, then 2x2, then 4x4, up to the final resolution, each scale conditioning on the previous. The 2024 paper showed VAR m...Phase 08: Generative AI / ~90 minuteslessonMDPs, States, Actions & RewardsA Markov Decision Process is five things: states, actions, transitions, rewards, a discount. Everything in RL — Q-learning, PPO, DPO, GRPO — optimizes over this shape. Learn it once, read the rest of reinforcement learning for free.Phase 09: Reinforcement Learning / ~45 minuteslessonDynamic Programming — Policy Iteration & Value IterationDynamic programming is RL with cheating. You already know the transition and reward functions; you just iterate the Bellman equation until V or π stops moving. It is the benchmark every sampling-based method tries to approach.Phase 09: Reinforcement Learning / ~75 minuteslessonMonte Carlo Methods — Learning from Complete EpisodesDynamic programming needs a model. Monte Carlo needs nothing but episodes. Run the policy, watch the returns, average them. The simplest idea in RL — and the one that unlocks everything downstream.Phase 09: Reinforcement Learning / ~75 minuteslessonTemporal Difference — Q-Learning & SARSAMonte Carlo waits until the episode ends. TD updates after every step by bootstrapping the next value estimate. Q-learning is off-policy and optimistic; SARSA is on-policy and cautious. Both are one line of code. Both underpin every deep-R...Phase 09: Reinforcement Learning / ~75 minuteslessonDeep Q-Networks (DQN)2013: Mnih trained one Q-learning network on raw pixels, beat every classical RL agent on seven Atari games. 2015: extended to 49 games, published in Nature, sparked the deep-RL era. DQN is Q-learning plus three tricks that make function a...Phase 09: Reinforcement Learning / ~75 minuteslessonPolicy Gradient — REINFORCE from ScratchStop estimating value. Parameterize the policy directly, compute the gradient of expected return, step uphill. Williams (1992) wrote it in one theorem. It is why PPO, GRPO, and every LLM RL loop exist.Phase 09: Reinforcement Learning / ~75 minuteslessonActor-Critic — A2C and A3CREINFORCE is noisy. Add a critic that learns V̂(s), subtract it from the return, and you get an advantage that has the same expectation but far lower variance. That is actor-critic. A2C runs it synchronously; A3C runs it across threads. Bo...Phase 09: Reinforcement Learning / ~75 minuteslessonProximal Policy Optimization (PPO)A2C throws away each rollout after one update. PPO wraps the policy gradient in a clipped importance ratio so you can do 10+ epochs on the same data without the policy exploding. Schulman et al. (2017). Still the default policy-gradient al...Phase 09: Reinforcement Learning / ~75 minuteslessonReward Modeling & RLHFHumans cannot write a reward function for "good assistant response," but they can compare two responses and pick the better one. Fit a reward model to those comparisons, then RL the language model against it. Christiano 2017. InstructGPT 2...Phase 09: Reinforcement Learning / ~45 minuteslessonMulti-Agent RLSingle-agent RL assumes the environment is stationary. Put two learning agents in the same world and that assumption breaks: each agent is part of the other's environment, and both are changing. Multi-agent RL is the set of tricks to make...Phase 09: Reinforcement Learning / ~45 minuteslessonSim-to-Real TransferA policy trained in a simulator that fails on hardware is a policy that memorized the simulator. Domain randomization, domain adaptation, and system identification are the three tools to make learned controllers cross the reality gap.Phase 09: Reinforcement Learning / ~45 minuteslessonRL for Games — AlphaZero, MuZero, and the LLM-Reasoning Era1992: TD-Gammon beat human champions at backgammon with pure TD. 2016: AlphaGo beat Lee Sedol. 2017: AlphaZero dominated chess, shogi, and Go from scratch. 2024: DeepSeek-R1 proved the same recipe, with GRPO replacing PPO, works on reasoni...Phase 09: Reinforcement Learning / ~120 minuteslessonTokenizers: BPE, WordPiece, SentencePieceYour LLM does not read English. It reads integers. The tokenizer decides whether those integers carry meaning or waste it.Phase 10: LLMs from Scratch / ~90 minuteslessonBuilding a Tokenizer from ScratchLesson 01 gave you a toy. This lesson gives you a weapon.Phase 10: LLMs from Scratch / ~90 minuteslessonData Pipelines for Pre-TrainingThe model is a mirror. It reflects whatever data you feed it. Feed it garbage, it reflects garbage with perfect fluency.Phase 10: LLMs from Scratch / ~90 minuteslessonPre-Training a Mini GPT (124M Parameters)GPT-2 Small has 124 million parameters. That's 12 transformer layers, 12 attention heads, and 768-dimensional embeddings. You can train it from scratch on a single GPU in a few hours. Most people never do this. They use pre-trained checkpo...Phase 10: LLMs from Scratch / ~120 minuteslessonScaling: Distributed Training, FSDP, DeepSpeedYour 124M model trained on one GPU. Now try 7 billion parameters. The model doesn't fit in memory. The data takes weeks on a single machine. Distributed training isn't optional at scale. It's the only path forward.Phase 10: LLMs from Scratch / ~120 minuteslessonInstruction Tuning (SFT)A base model predicts the next token. That's it. It doesn't follow instructions, answer questions, or refuse harmful requests. SFT is the bridge between a token predictor and a useful assistant. Every model you've ever talked to -- Claude,...Phase 10: LLMs from Scratch / ~90 minuteslessonRLHF: Reward Model + PPOSFT teaches the model to follow instructions. But it doesn't teach the model which response is BETTER. Two grammatically correct, factually accurate answers can differ enormously in helpfulness. RLHF is how you encode human judgment into t...Phase 10: LLMs from Scratch / ~90 minuteslessonDPO: Direct Preference OptimizationRLHF works. It also requires training three models (SFT, reward model, policy), managing PPO's instability, and tuning a KL penalty. DPO asks: what if you could skip all of that? DPO directly optimizes the language model on preference pair...Phase 10: LLMs from Scratch / ~90 minuteslessonConstitutional AI and Self-ImprovementRLHF needs humans in the loop. Constitutional AI replaces most of them with the model itself. Write a list of principles, have the model critique its own outputs against those principles, and train on the critiques. DeepSeek-R1 pushed this...Phase 10: LLMs from Scratch / ~45 minuteslessonEvaluation: Benchmarks, Evals, LM HarnessGoodhart's Law: when a measure becomes a target, it ceases to be a good measure. Every frontier lab games benchmarks. MMLU scores go up while models still can't reliably count the number of R's in "strawberry." The only eval that matters i...Phase 10: LLMs from Scratch / ~90 minuteslessonQuantization: Making Models FitA 70B model in FP16 needs 140GB. Two A100s just for weights. Quantize to FP8: one 80GB GPU. INT4: a MacBook.Phase 10: LLMs from Scratch / ~120 minuteslessonInference OptimizationTwo phases define LLM inference. Prefill processes your prompt in parallel -- compute-bound. Decode generates tokens one at a time -- memory-bound. Every optimization targets one or both.Phase 10: LLMs from Scratch / ~120 minuteslessonBuilding a Complete LLM PipelineEverything from Lessons 01 to 12 is one stage of one pipeline. This lesson is the scaffold that turns those stages into a single end-to-end run: tokenize, pre-train, scale, SFT, align, evaluate, quantize, serve. You will not train a 70B mo...Phase 10: LLMs from Scratch / ~120 minuteslessonOpen Models: Architecture WalkthroughsYou built a GPT-2 Small from scratch in Lesson 04. Frontier open models in 2026 are the same family with five or six concrete changes. RMSNorm instead of LayerNorm. SwiGLU instead of GELU. RoPE instead of learned positions. GQA or MLA inst...Phase 10: LLMs from Scratch / ~45 minuteslessonSpeculative Decoding and EAGLE-3Phase 7 · Lesson 16 proved the math: the Leviathan rejection rule preserves the verifier's distribution exactly. This lesson is the training-stack view of 2026 production speculative decoding. EAGLE-3 turned the draft model from a cheap ap...Phase 10: LLMs from Scratch / ~75 minuteslessonDifferential Attention (V2)Softmax attention spreads a small amount of probability over every non-matching token. Over 100k tokens that noise adds up and drowns the signal. Differential Transformer (Ye et al., ICLR 2025) fixes it by computing attention as the differ...Phase 10: LLMs from Scratch / ~60 minuteslessonNative Sparse Attention (DeepSeek NSA)At 64k tokens, attention eats 70-80% of decode latency. Every open-model lab has a plan to fix it. DeepSeek's NSA (ACL 2025 best paper) is the one that stuck: three parallel attention branches — compressed coarse-grained tokens, selectivel...Phase 10: LLMs from Scratch / ~60 minuteslessonMulti-Token Prediction (MTP)Every autoregressive LLM from GPT-2 to Llama 3 trains on one loss per position: predict the next token. DeepSeek-V3 added a second loss per position: predict the token after that. The extra 14B of parameters (on a 671B model) got distilled...Phase 10: LLMs from Scratch / ~60 minuteslessonDualPipe ParallelismDeepSeek-V3 was trained on 2,048 H800 GPUs with MoE experts scattered across nodes. Cross-node expert all-to-all communication cost 1 GPU-hour of comm for every 1 GPU-hour of compute. GPUs were idle half the time. DualPipe (DeepSeek, Dec 2...Phase 10: LLMs from Scratch / ~60 minuteslessonDeepSeek-V3 Architecture WalkthroughPhase 10 · Lesson 14 named the six architectural knobs every open model turns. DeepSeek-V3 (December 2024, 671B parameters total, 37B active) turns all six and adds four more: Multi-Head Latent Attention, auxiliary-loss-free load balancing...Phase 10: LLMs from Scratch / ~75 minuteslessonJamba — Hybrid SSM-TransformerState space models (SSMs) and transformers want different things. Transformers buy quality via attention at quadratic cost. SSMs buy linear-time inference and constant memory via a recurrence but lag quality. AI21's Jamba (March 2024) and...Phase 10: LLMs from Scratch / ~60 minuteslessonAsync and Hogwild! InferenceSpeculative decoding (Phase 10 · 15) parallelizes tokens within one sequence. Multi-agent frameworks parallelize across whole sequences but force explicit coordination (voting, sub-task splitting). Hogwild! Inference (Rodionov et al., arXi...Phase 10: LLMs from Scratch / ~60 minuteslessonSpeculative Decoding and EAGLEA frontier LLM generating one token requires a full forward pass over billions of parameters. That forward pass is massively over-provisioned: most of the time a much smaller model can guess the next 3-5 tokens correctly, and the big model...Phase 10: LLMs from Scratch / ~75 minuteslessonGradient Checkpointing and Activation RecomputationBackprop keeps every intermediate activation. At 70B parameters and 128K context that is 3 TB of activations per rank. Checkpointing trades FLOPs for memory: recompute instead of save. The question is which segments to drop, and the answer...Phase 10: LLMs from Scratch / ~70 minuteslessonPrompt Engineering: Techniques & PatternsMost people write prompts like they are texting a friend. Then they wonder why a 200-billion parameter model gives mediocre answers. Prompt engineering is not about tricks. It is about understanding that every token you send is an instruct...Phase 11: LLM Engineering / ~90 minuteslessonFew-Shot, Chain-of-Thought, Tree-of-ThoughtTelling a model what to do is prompting. Showing it how to think is engineering. The gap between 78% and 91% accuracy on the same model, same task, same data is not a better model. It is a better reasoning strategy.Phase 11: LLM Engineering / ~45 minuteslessonStructured Outputs: JSON, Schema Validation, Constrained DecodingYour LLM returns a string. Your application needs JSON. That gap has crashed more production systems than any model hallucination. Structured output is the bridge between natural language and typed data. Get it right and your LLM becomes a...Phase 11: LLM Engineering / ~90 minuteslessonEmbeddings & Vector RepresentationsText is discrete. Math is continuous. Every time you ask an LLM to find "similar" documents, compare meanings, or search beyond keywords, you're relying on a bridge between these two worlds. That bridge is an embedding. If you don't unders...Phase 11: LLM Engineering / ~75 minuteslessonContext Engineering: Windows, Budgets, Memory, and RetrievalPrompt engineering is a subset. Context engineering is the whole game. A prompt is a string you type. Context is everything that goes into the model's window: system instructions, retrieved documents, tool definitions, conversation history...Phase 11: LLM Engineering / ~90 minuteslessonRAG (Retrieval-Augmented Generation)Your LLM knows everything up to its training cutoff. It knows nothing about your company's docs, your codebase, or last week's meeting notes. RAG solves this by retrieving relevant documents and stuffing them into the prompt. It's the most...Phase 11: LLM Engineering / ~90 minuteslessonAdvanced RAG (Chunking, Reranking, Hybrid Search)Basic RAG retrieves the top-k most similar chunks. That works for simple questions. It falls apart for multi-hop reasoning, ambiguous queries, and large corpora. Advanced RAG is the difference between a demo that works on 10 documents and...Phase 11: LLM Engineering / ~90 minuteslessonFine-Tuning with LoRA & QLoRAFull fine-tuning a 7B model requires 56GB of VRAM. You don't have that. Neither do most companies. LoRA lets you fine-tune the same model in 6GB by training less than 1% of the parameters. This isn't a compromise -- it matches full fine-tu...Phase 11: LLM Engineering / ~75 minuteslessonFunction Calling & Tool UseLLMs cannot do anything. They generate text. That is the entire capability. They cannot check the weather, query a database, send an email, run code, or read a file. Every "AI agent" you have ever seen is an LLM generating JSON that says w...Phase 11: LLM Engineering / ~75 minuteslessonEvaluation & Testing LLM ApplicationsYou would never deploy a web app without tests. You would never ship a database migration without a rollback plan. But right now, most teams ship LLM applications by reading 10 outputs and saying "yeah, looks good." That is not evaluation....Phase 11: LLM Engineering / ~45 minuteslessonCaching, Rate Limiting & Cost OptimizationMost AI startups do not die from bad models. They die from bad unit economics. A single GPT-4o call costs fractions of a cent. Ten thousand users making ten calls per day costs $250 in input tokens alone -- before you charge a single dolla...Phase 11: LLM Engineering / ~45 minuteslessonGuardrails, Safety & Content FilteringYour LLM application will be attacked. Not might. Will. The first prompt injection attempt against your production system will come within 48 hours of launch. The question is not whether someone will try "ignore previous instructions and r...Phase 11: LLM Engineering / ~45 minuteslessonBuilding a Production LLM ApplicationYou have built prompts, embeddings, RAG pipelines, function calling, caching layers, and guardrails. Separately. In isolation. Like practicing guitar scales without ever playing a song. This lesson is the song. You will wire every componen...Phase 11: LLM Engineering / ~120 minuteslessonModel Context Protocol (MCP)Every LLM app built before 2025 invented its own tool schema. Then Anthropic shipped MCP, Claude adopted it, OpenAI adopted it, and by 2026 it is the default wire format for connecting any LLM to any tool, data source, or agent. Write one...Phase 11: LLM Engineering / ~75 minuteslessonPrompt Caching and Context CachingYour system prompt is 4,000 tokens. Your RAG context is 20,000 tokens. You send both with every request. You also pay for both — every time. Prompt caching lets the provider keep that prefix warm on their side and bill you 10% of the norma...Phase 11: LLM Engineering / ~60 minuteslessonLangGraph — State Machines for AgentsA ReAct loop written by hand is a while True. A ReAct loop written in LangGraph is a graph you can checkpoint, interrupt, branch, and time-travel through. The agent hasn't changed. The harness around it has.Phase 11: LLM Engineering / ~75 minuteslessonAgent Framework Tradeoffs — LangGraph vs CrewAI vs AutoGen vs AgnoEvery framework sells the same demo (research agent builds a report) and hides the same bug (state schema fights with the orchestration layer). Pick the framework whose abstractions match the shape of your problem; everything else is glue...Phase 11: LLM Engineering / ~45 minuteslessonVision Transformers and the Patch-Token PrimitiveBefore anything multimodal, an image has to become a sequence of tokens a transformer can eat. The 2020 ViT paper answered this with 16x16 pixel patches, a linear projection, and a position embedding. Five years later every 2026 frontier m...Phase 12: Multimodal AI / ~120 minuteslessonCLIP and Contrastive Vision-Language PretrainingOpenAI's CLIP (2021) proved a single idea big enough to power the next five years: align an image encoder and a text encoder in the same vector space using only noisy web image-caption pairs and a contrastive loss. Zero supervised labels....Phase 12: Multimodal AI / ~180 minuteslessonFrom CLIP to BLIP-2 — Q-Former as Modality BridgeCLIP aligns image and text but cannot generate captions, answer questions, or hold a conversation. BLIP-2 (Salesforce, 2023) solved that with a small trainable bridge: 32 learnable query vectors attend over a frozen ViT's features via cros...Phase 12: Multimodal AI / ~180 minuteslessonFlamingo and Gated Cross-Attention for Few-Shot VLMsDeepMind's Flamingo (2022) did two things before anyone else. It showed a single model could process arbitrarily interleaved sequences of images, videos, and text. And it showed VLMs could learn in-context — give a few-shot prompt with thr...Phase 12: Multimodal AI / ~120 minuteslessonLLaVA and Visual Instruction TuningLLaVA (April 2023) is the most copied multimodal architecture on the planet. It replaced BLIP-2's Q-Former with a 2-layer MLP, replaced Flamingo's gated cross-attention with naive token concatenation, and trained on 158k visual-instruction...Phase 12: Multimodal AI / ~180 minuteslessonAny-Resolution Vision: Patch-n'-Pack and NaFlexReal images are not 224x224 squares. A receipt is 9:16, a chart is 16:9, a medical scan might be 4096x4096, a mobile screenshot is 9:19.5. The pre-2024 VLM answer — resize everything to a fixed square — threw away the signal that makes OCR...Phase 12: Multimodal AI / ~120 minuteslessonOpen-Weight VLM Recipes: What Actually MattersThe 2024-2026 open-weight VLM literature is a forest of ablation tables. Apple's MM1 tested 13 combinations of image encoder, connector, and data mix. Allen AI's Molmo proved detailed human captions beat GPT-4V distillation. Cambrian-1 ran...Phase 12: Multimodal AI / ~180 minuteslessonLLaVA-OneVision: Single-Image, Multi-Image, Video in One ModelBefore LLaVA-OneVision (Li et al., August 2024) the open-VLM world had separate lineages: LLaVA-1.5 for single images, multi-image models like Mantis and VILA, video models like Video-LLaVA and Video-LLaMA. Each won its benchmark and faile...Phase 12: Multimodal AI / ~180 minuteslessonQwen-VL Family and Dynamic-FPS VideoThe Qwen-VL family — Qwen-VL (2023), Qwen2-VL (2024), Qwen2.5-VL (2025), Qwen3-VL (2025) — is the most influential open vision-language model lineage in 2026. Each generation made a single decisive architectural bet that the rest of the op...Phase 12: Multimodal AI / ~120 minuteslessonInternVL3: Native Multimodal PretrainingEvery open VLM before InternVL3 followed the same three-step recipe: take a text LLM trained on trillions of text tokens, bolt on a vision encoder, then fine-tune the seams. This works but has alignment debt — the text LLM has spent its fu...Phase 12: Multimodal AI / ~120 minuteslessonChameleon and Early-Fusion Token-Only Multimodal ModelsEvery VLM we have seen so far keeps images and text separate. Visual tokens come from a vision encoder, flow into a projector, then meet text inside the LLM. The vision and text vocabularies never overlap. Chameleon (Meta, May 2024) asked:...Phase 12: Multimodal AI / ~180 minuteslessonEmu3: Next-Token Prediction for Image and Video GenerationBAAI's Emu3 (Wang et al., September 2024) is the 2024 result that should have ended the diffusion-versus-autoregressive debate. A single Llama-style decoder-only transformer, trained only on the next-token-prediction objective, across a un...Phase 12: Multimodal AI / ~120 minuteslessonTransfusion: Autoregressive Text + Diffusion Image in One TransformerChameleon and Emu3 bet everything on discrete tokens. They work, but the quantization bottleneck is visible — the image quality plateaus below continuous-space diffusion models. Transfusion (Meta, Zhou et al., August 2024) takes the opposi...Phase 12: Multimodal AI / ~180 minuteslessonShow-o and Discrete-Diffusion Unified ModelsTransfusion mixes continuous and discrete representations. Show-o (Xie et al., August 2024) goes the other way: text tokens use causal next-token prediction, image tokens use masked discrete diffusion in the spirit of MaskGIT. Both sit ins...Phase 12: Multimodal AI / ~120 minuteslessonJanus-Pro: Decoupled Encoders for Unified Multimodal ModelsUnified multimodal models have an unavoidable tension. Understanding wants semantic features — SigLIP or DINOv2 output vectors rich with concept-level information. Generation wants reconstruction-friendly codes — VQ tokens that compose bac...Phase 12: Multimodal AI / ~120 minuteslessonMIO and Any-to-Any Streaming Multimodal ModelsGPT-4o ships a product most open models cannot replicate: an agent that hears voice, sees video, and speaks back in real time. The open-ecosystem answer by late 2024 was MIO (Wang et al., September 2024). MIO tokenizes text, image, speech,...Phase 12: Multimodal AI / ~120 minuteslessonVideo-Language Models: Temporal Tokens and GroundingVideo is not a stack of photos. A 5-second clip has causal ordering, action verbs, and event timing that an image model cannot represent. Video-LLaMA (Zhang et al., June 2023) shipped the first open video-LLM with audio-visual grounding. V...Phase 12: Multimodal AI / ~180 minuteslessonLong-Video Understanding at Million-Token ContextA 1-hour 4K video at 24 FPS, patched and embedded, produces on the order of 60 million tokens. A 2-hour podcast episode transcribed is 30,000 tokens. A full Blu-ray feature film, even compressed with aggressive pooling, is hundreds of thou...Phase 12: Multimodal AI / ~180 minuteslessonAudio-Language Models: the Whisper to Audio Flamingo 3 ArcWhisper (Radford et al., December 2022) settled speech recognition — 680k hours of weakly-supervised multilingual speech, a simple encoder-decoder transformer, a benchmark that made every subsequent ASR release cite it. But recognition is...Phase 12: Multimodal AI / ~180 minuteslessonOmni Models: Qwen2.5-Omni and the Thinker-Talker SplitGPT-4o's product demo in May 2024 was disruptive not because of the underlying model but because of the product shape — a voice interface where you talk, the model sees what the camera sees, and it talks back in under 250ms. The open ecosy...Phase 12: Multimodal AI / ~180 minuteslessonEmbodied VLAs: RT-2, OpenVLA, π0, GR00TThe first time a model read a recipe off a website and executed it in a kitchen robot was RT-2 (Google DeepMind, July 2023). RT-2 discretized actions as text tokens, co-fine-tuned a VLM on web data plus robot-action data, and proved that w...Phase 12: Multimodal AI / ~180 minuteslessonDocument and Diagram UnderstandingDocuments are not photos. A PDF, scientific paper, invoice, or handwritten form has layout, tables, diagrams, footnotes, headers, and semantic structure that plain image understanding cannot capture. The pre-VLM stack was a pipeline: Tesse...Phase 12: Multimodal AI / ~180 minuteslessonColPali and Vision-Native Document RAGTraditional RAG parses PDFs into text, splits into chunks, embeds chunks, stores vectors. Every step loses signal: OCR drops chart data, chunking breaks table rows, text embeddings ignore figures. ColPali (Faysse et al., July 2024) asked t...Phase 12: Multimodal AI / ~180 minuteslessonMultimodal RAG and Cross-Modal RetrievalVision-native document RAG is one slice. Production multimodal RAG goes wider — retrieving across text, images, audio, and video for workflows like trip planning ("find me a quiet vegan brunch with natural light"), medical triage ("what in...Phase 12: Multimodal AI / ~180 minuteslessonMultimodal Agents and Computer-Use (Capstone)The 2026 frontier product is a multimodal agent that reads screenshots, clicks buttons, navigates web UIs, fills forms, and completes workflows end-to-end. SeeClick and CogAgent (2024) proved the GUI-grounding primitive. Ferret-UI added mo...Phase 12: Multimodal AI / ~240 minuteslessonThe Tool Interface — Why Agents Need Structured I/OA language model produces tokens. A program takes actions. The gap between those two is the tool interface: a contract that lets the model request an action and the host execute it. Every 2026 stack — function calling on OpenAI, Anthropic,...Phase 13: Tools & Protocols / ~45 minuteslessonFunction Calling Deep Dive — OpenAI, Anthropic, GeminiThe three frontier providers converged on the same tool-call loop in 2024 and then diverged on everything else. OpenAI uses tools and tool_calls. Anthropic uses tool_use and tool_result blocks. Gemini uses functionDeclarations and unique-i...Phase 13: Tools & Protocols / ~75 minuteslessonParallel Tool Calls and Streaming with ToolsThree independent weather lookups serialized is three round trips. Run them in parallel and total time collapses to the slowest single call. Every frontier provider now emits multiple tool calls in a single turn. The payoff is real; the pl...Phase 13: Tools & Protocols / ~75 minuteslessonStructured Output — JSON Schema, Pydantic, Zod, Constrained Decoding"Ask the model nicely to return JSON" fails 5 to 15 percent of the time, even on frontier models. Structured outputs close that gap with constrained decoding: the model is literally prevented from emitting a token that would violate the sc...Phase 13: Tools & Protocols / ~75 minuteslessonTool Schema Design — Naming, Descriptions, Parameter ConstraintsA correct tool fails silently when the model cannot tell when to use it. Naming, descriptions, and parameter shapes drive 10 to 20 percentage-point swings in tool-selection accuracy on benchmarks like StableToolBench and MCPToolBench++. Th...Phase 13: Tools & Protocols / ~45 minuteslessonMCP Fundamentals — Primitives, Lifecycle, JSON-RPC BaseEvery integration before MCP was a one-off. The Model Context Protocol, first shipped by Anthropic in November 2024 and now stewarded by the Linux Foundation's Agentic AI Foundation, standardizes discovery and invocation so any client can...Phase 13: Tools & Protocols / ~45 minuteslessonBuilding an MCP Server — Python + TypeScript SDKsMost MCP tutorials show only stdio hello-worlds. A real server exposes tools plus resources plus prompts, handles capability negotiation, emits structured errors, and works the same across SDKs. This lesson builds a notes server end-to-end...Phase 13: Tools & Protocols / ~75 minuteslessonBuilding an MCP Client — Discovery, Invocation, Session ManagementMost MCP content ships server tutorials and waves a hand at the client. Client code is where the hard orchestration lives: process spawning, capability negotiation, tool list merging across multiple servers, sampling callbacks, reconnectio...Phase 13: Tools & Protocols / ~75 minuteslessonMCP Transports — stdio vs Streamable HTTP vs SSE Migrationstdio works locally and nowhere else. Streamable HTTP (2025-03-26) is the remote standard. The old HTTP+SSE transport is deprecated and being removed in mid-2026. Picking the wrong transport costs a migration; picking the right one buys a...Phase 13: Tools & Protocols / ~45 minuteslessonMCP Resources and Prompts — Context Exposure Beyond ToolsTools get 90 percent of MCP attention. The other two server primitives solve different problems. Resources expose data for reading; prompts expose reusable templates as slash-commands. Many servers should use resources instead of wrapping...Phase 13: Tools & Protocols / ~45 minuteslessonMCP Sampling — Server-Requested LLM Completions and Agent LoopsMost MCP servers are dumb executors: take arguments, run code, return content. Sampling lets a server flip direction: it asks the client's LLM to make a decision. This enables server-hosted agent loops without the server owning any model c...Phase 13: Tools & Protocols / ~75 minuteslessonRoots and Elicitation — Scoping and Mid-Flight User InputHard-coded paths break the moment a user opens a different project. Pre-filled tool arguments break when the user under-specifies. Roots scope the server to a user-controlled set of URIs; elicitation pauses mid-tool-call to ask the user fo...Phase 13: Tools & Protocols / ~45 minuteslessonAsync Tasks (SEP-1686) — Call-Now, Fetch-Later for Long-Running WorkReal agent work takes minutes to hours: CI runs, deep-research synthesis, batch exports. Synchronous tool calls drop connections, time out, or block the UI. SEP-1686, merged in 2025-11-25, adds a Tasks primitive: any request can be augment...Phase 13: Tools & Protocols / ~75 minuteslessonMCP Apps — Interactive UI Resources via `ui://`Text-only tool output caps what agents can show. MCP Apps (SEP-1724, official January 26, 2026) let a tool return sandboxed interactive HTML rendered inline in Claude Desktop, ChatGPT, Cursor, Goose, and VS Code. Dashboards, forms, maps, 3...Phase 13: Tools & Protocols / ~75 minuteslessonMCP Security I — Tool Poisoning, Rug Pulls, Cross-Server ShadowingTool descriptions land in the model's context verbatim. Malicious servers embed hidden instructions that users never see. Research in 2025-2026 from Invariant Labs, Unit 42, and an arXiv study published March 2026 measured attack-success r...Phase 13: Tools & Protocols / ~45 minuteslessonMCP Security II — OAuth 2.1, Resource Indicators, Incremental ScopesRemote MCP servers need authorization, not just authentication. The 2025-11-25 spec aligns with OAuth 2.1 + PKCE + resource indicators (RFC 8707) + protected-resource metadata (RFC 9728). SEP-835 adds incremental scope consent with step-up...Phase 13: Tools & Protocols / ~75 minuteslessonMCP Gateways and Registries — Enterprise Control PlanesEnterprises cannot let every dev install random MCP servers. A gateway centralizes auth, RBAC, audit, rate limiting, caching, and tool-poisoning detection, then exposes the merged tool surface as a single MCP endpoint. The Official MCP Reg...Phase 13: Tools & Protocols / ~45 minuteslessonMCP Auth in Production — Enrollment, JWKS Refresh, Audience-Pinned TokensLesson 16 stood up the OAuth 2.1 state machine in memory. By 2026, every MCP server you ship to a real org sits behind production auth: client enrollment that scales to an unbounded client population (Client ID Metadata Documents first, dy...Phase 13: Tools & Protocols / ~90 minuteslessonA2A — Agent-to-Agent ProtocolMCP is agent-to-tool. A2A (Agent2Agent) is agent-to-agent — an open protocol for letting opaque agents built on different frameworks collaborate. Released by Google in April 2025, donated to the Linux Foundation in June 2025, reaching v1.0...Phase 13: Tools & Protocols / ~75 minuteslessonOpenTelemetry GenAI — Tracing Tool Calls End-to-EndAn agent calls five tools, three MCP servers, and two sub-agents. You need one trace across all of it. The OpenTelemetry GenAI semantic conventions (stable attributes in v1.37 and up) are the 2026 standard, natively supported by Datadog, L...Phase 13: Tools & Protocols / ~75 minuteslessonLLM Routing Layer — LiteLLM, OpenRouter, PortkeyProvider lock-in is expensive. Different tool-calling workloads suit different models. Routing gateways give one API surface, retries, failover, cost tracking, and guardrails. Three archetypes dominate 2026: LiteLLM (open-source self-hoste...Phase 13: Tools & Protocols / ~45 minuteslessonSkills and Agent SDKs — Anthropic Skills, AGENTS.md, OpenAI Apps SDKMCP says "what tools exist." Skills say "how to do a task." The 2026 stack layers both. Anthropic's Agent Skills (open standard, December 2025) ship as SKILL.md with progressive disclosure. OpenAI's Apps SDK is MCP plus widget metadata. AG...Phase 13: Tools & Protocols / ~45 minuteslessonCapstone — Build a Complete Tool EcosystemPhase 13 taught every piece. This capstone wires them into one production-shaped system: an MCP server with tools + resources + prompts + tasks + UI, OAuth 2.1 at the edge, an RBAC gateway, a multi-server client, an A2A sub-agent call, OTe...Phase 13: Tools & Protocols / ~120 minuteslessonThe Agent Loop: Observe, Think, ActEvery agent in 2026 — Claude Code, Cursor, Devin, Operator — is a variant of the ReAct loop from 2022. Reasoning tokens interleave with tool calls and observations until a stop condition fires. Learn this loop cold before touching any fram...Phase 14: Agent Engineering / ~60 minuteslessonReWOO and Plan-and-Execute: Decoupled PlanningReAct interleaves thought and action in one stream. ReWOO separates them: one big plan up front, then execute. 5x fewer tokens, +4% accuracy on HotpotQA, and you can distill the planner into a 7B model. Plan-and-Execute generalized it; Pla...Phase 14: Agent Engineering / ~60 minuteslessonReflexion: Verbal Reinforcement LearningGradient-based RL needs thousands of trials and a GPU cluster to fix a failure mode. Reflexion (Shinn et al., NeurIPS 2023) does it in natural language: after each failed trial, the agent writes a reflection, stores it in episodic memory,...Phase 14: Agent Engineering / ~60 minuteslessonTree of Thoughts and LATS: Deliberate SearchA single chain-of-thought trajectory has no room to backtrack. ToT (Yao et al., 2023) turns reasoning into a tree with self-evaluation on each node. LATS (Zhou et al., 2024) unifies ToT with ReAct and Reflexion under Monte Carlo Tree Searc...Phase 14: Agent Engineering / ~75 minuteslessonSelf-Refine and CRITIC: Iterative Output ImprovementSelf-Refine (Madaan et al., 2023) uses one LLM in three roles — generate, feedback, refine — in a loop. Average gain: +20 absolute on 7 tasks. CRITIC (Gou et al., 2023) hardens the feedback step by routing verification through external too...Phase 14: Agent Engineering / ~60 minuteslessonTool Use and Function CallingToolformer (Schick et al., 2023) started self-supervised tool annotation. Berkeley Function Calling Leaderboard V4 (Patil et al., 2025) sets the 2026 bar: 40% agentic, 30% multi-turn, 10% live, 10% non-live, 10% hallucination. Single-turn...Phase 14: Agent Engineering / ~60 minuteslessonMemory: Virtual Context and MemGPTContext windows are finite. Conversations, documents, and tool traces are not. MemGPT (Packer et al., 2023) frames this as OS virtual memory — main context is RAM, external store is disk, the agent pages between them. This is the pattern e...Phase 14: Agent Engineering / ~75 minuteslessonMemory Blocks and Sleep-Time Compute (Letta)MemGPT became Letta in 2024. The 2026 evolution adds two ideas: discrete functional memory blocks the model can edit directly, and a sleep-time agent that consolidates memory asynchronously while the primary agent is idle. This is how you...Phase 14: Agent Engineering / ~75 minuteslessonHybrid Memory: Vector + Graph + KV (Mem0)Mem0 (Chhikara et al., 2025) treats memory as three stores in parallel — vector for semantic similarity, KV for fast fact lookup, graph for entity-relationship reasoning. A scoring layer fuses the three on retrieval. This is the 2026 produ...Phase 14: Agent Engineering / ~75 minuteslessonSkill Libraries and Lifelong Learning (Voyager)Voyager (Wang et al., TMLR 2024) treats executable code as a skill. Skills are named, retrievable, composable, and refined by environment feedback. This is the reference architecture for Claude Agent SDK skills, skillkit, and the 2026 skil...Phase 14: Agent Engineering / ~75 minuteslessonPlanning with HTN and Evolutionary SearchSymbolic planning handles the cases where the plan is provably correct. Evolutionary code search handles the cases where the fitness function is machine-checkable. ChatHTN (2025) and AlphaEvolve (2025) show what each unlocks when paired wi...Phase 14: Agent Engineering / ~75 minuteslessonAnthropic's Workflow Patterns: Simple Over ComplexSchluntz and Zhang (Anthropic, Dec 2024) distinguish workflows (predefined paths) from agents (dynamic tool-use). Five workflow patterns cover most cases. Start with direct API calls. Add agents only when steps cannot be predicted.Phase 14: Agent Engineering / ~60 minuteslessonLangGraph: Stateful Graphs and Durable ExecutionLangGraph is the 2026 reference for low-level stateful orchestration. Agent is a state machine; nodes are functions; edges are transitions; state is immutable and checkpointed after every step. Resume from any failure exactly where it left...Phase 14: Agent Engineering / ~75 minuteslessonAutoGen v0.4: Actor Model and Agent FrameworkAutoGen v0.4 (Microsoft Research, Jan 2025) redesigned agent orchestration around the actor model. Async message exchange, event-driven agents, fault isolation, natural concurrency. The framework is now in maintenance mode while Microsoft...Phase 14: Agent Engineering / ~75 minuteslessonCrewAI: Role-Based Crews and FlowsCrewAI is the 2026 role-based multi-agent framework. Four primitives: Agent, Task, Crew, Process. Two top-level shapes: Crews (autonomous, role-based collaboration) and Flows (event-driven, deterministic). The docs are blunt: "for any prod...Phase 14: Agent Engineering / ~75 minuteslessonOpenAI Agents SDK: Handoffs, Guardrails, TracingOpenAI Agents SDK is the lightweight multi-agent framework built on the Responses API. Five primitives: Agent, Handoff, Guardrail, Session, Tracing. Handoffs are tools named transfer_to_. Guardrails trip on input or output. Tracing is on b...Phase 14: Agent Engineering / ~75 minuteslessonClaude Agent SDK: Subagents and Session StoreThe Claude Agent SDK is the library form of the Claude Code harness. Built-in tools, subagents for context isolation, hooks, W3C trace propagation, session store parity. Claude Managed Agents is the hosted alternative for long-running asyn...Phase 14: Agent Engineering / ~75 minuteslessonAgno and Mastra: Production RuntimesAgno (Python) and Mastra (TypeScript) are the 2026 production-runtime pairing. Agno aims at microsecond agent instantiation and stateless FastAPI backends. Mastra ships agents, tools, workflows, unified model routing, and composite storage...Phase 14: Agent Engineering / ~45 minuteslessonBenchmarks: SWE-bench, GAIA, AgentBenchThree benchmarks anchor agent evaluation in 2026. SWE-bench tests code patching. GAIA tests generalist tool use. AgentBench tests multi-environment reasoning. Know their composition, their contamination story, and what they do not measure.Phase 14: Agent Engineering / ~60 minuteslessonBenchmarks: WebArena and OSWorldWebArena tests web-agent capability across four self-hosted apps. OSWorld tests desktop-agent capability across Ubuntu, Windows, macOS. At release (2023–2024) both showed a big gap between best-in-class agents and humans. The gap is narrow...Phase 14: Agent Engineering / ~60 minuteslessonComputer Use: Claude, OpenAI CUA, GeminiThree production computer-use models in 2026. All three are vision-based. All three treat screenshots, DOM text, and tool outputs as untrusted input. Only direct user instructions count as permission. Per-step safety services are the norm.Phase 14: Agent Engineering / ~60 minuteslessonVoice Agents: Pipecat and LiveKitVoice agents are a first-class production category in 2026. Pipecat gives you a Python frame-based pipeline (VAD → STT → LLM → TTS → transport). LiveKit Agents bridges AI models to users over WebRTC. Production latency targets land at 450–...Phase 14: Agent Engineering / ~60 minuteslessonOpenTelemetry GenAI Semantic ConventionsOpenTelemetry's GenAI SIG (launched April 2024) defines the standard schema for agent telemetry. Span names, attributes, and content-capture rules converge across vendors so agent traces mean the same thing in Datadog, Grafana, Jaeger, and...Phase 14: Agent Engineering / ~60 minuteslessonAgent Observability: Langfuse, Phoenix, OpikThree open-source agent observability platforms dominate 2026. Langfuse (MIT) — 6M+ installs/month, tracing + prompt management + evals + session replay. Arize Phoenix (Elastic 2.0) — deep agent-specific evals, RAG relevancy, OpenInference...Phase 14: Agent Engineering / ~45 minuteslessonMulti-Agent Debate and CollaborationDu et al. (ICML 2024, "Society of Minds") run N model instances that independently propose answers, then iteratively critique each other over R rounds to converge. Improves factuality, rule-following, reasoning. Sparse topology beats full...Phase 14: Agent Engineering / ~60 minuteslessonFailure Modes: Why Agents BreakMASFT (Berkeley, 2025) catalogs 14 multi-agent failure modes in 3 categories. Microsoft's Taxonomy documents how existing AI failures amplify in agentic settings. Industry field data converges on five recurring modes: hallucinated actions,...Phase 14: Agent Engineering / ~60 minuteslessonPrompt Injection and the PVE DefenseGreshake et al. (AISec 2023) established indirect prompt injection as the defining agent security problem. Attacker plants instructions in data the agent retrieves; on ingest, those instructions override the developer prompt. Treat all ret...Phase 14: Agent Engineering / ~75 minuteslessonOrchestration Patterns: Supervisor, Swarm, HierarchicalFour orchestration patterns recur across 2026 frameworks: supervisor-worker, swarm / peer-to-peer, hierarchical, debate. Anthropic's guidance: "It's about building the right system for your needs." Start simple; add topology only when a si...Phase 14: Agent Engineering / ~60 minuteslessonProduction Runtimes: Queue, Event, CronProduction agents run on six runtime shapes: request-response, streaming, durable execution, queue-based background, event-driven, and scheduled. Pick the shape before you pick the framework. Observability is load-bearing at every shape.Phase 14: Agent Engineering / ~60 minuteslessonEval-Driven Agent DevelopmentAnthropic's guidance: "start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when needed." Evaluation is not the last step. It's the outer loop that drives every other choice in Pha...Phase 14: Agent Engineering / ~60 minuteslessonAgent Workbench Engineering: Why Capable Models Still FailA capable model is not enough. Reliable agents need a workbench: instructions, state, scope, feedback, verification, review, and handoff. Strip those away and even a frontier model produces work that is unsafe to ship.Phase 14: Agent Engineering / ~45 minuteslessonThe Minimal Agent WorkbenchThe smallest useful workbench is three files: a root instructions router, a state file, and a task board. Everything else is layered on top. If a repo cannot carry these three, no model will save it.Phase 14: Agent Engineering / ~45 minuteslessonAgent Instructions as Executable ConstraintsInstructions written as prose are wishes. Instructions written as constraints are tests. The workbench turns each rule into something an agent can check at runtime and a reviewer can verify after the fact.Phase 14: Agent Engineering / ~50 minuteslessonRepo Memory and Durable StateChat history is volatile. The repo is durable. The workbench stores agent state in versioned files so the next session, the next agent, and the next reviewer all read from the same source of truth.Phase 14: Agent Engineering / ~60 minuteslessonInitialization Scripts for AgentsEvery session that starts cold pays a tax. The agent reads the same files, retries the same probes, and rediscovers the same paths. An init script pays the tax once and writes the answers into state.Phase 14: Agent Engineering / ~45 minuteslessonScope Contracts and Task BoundariesThe model does not know where the work ends. A scope contract is a per-task file that says where the work begins, where it ends, and how to roll back if it spills. The contract turns "stay in scope" from a wish into a check.Phase 14: Agent Engineering / ~50 minuteslessonRuntime Feedback LoopsAgents that do not see real command output guess. A feedback runner captures stdout, stderr, exit code, and timing into a structured record the next turn can read. Then the agent reacts to facts instead of to its own prediction of facts.Phase 14: Agent Engineering / ~50 minuteslessonVerification GatesThe agent does not get to mark its own work as done. A verification gate reads the scope contract, the feedback log, the rule report, and the diff, and answers a single question: is this task actually complete? If the gate says no, the tas...Phase 14: Agent Engineering / ~55 minuteslessonReviewer Agent: Separate Builder from MarkerThe agent that wrote the code cannot grade it. A reviewer is a second loop with a different system prompt, a different goal, and read-only access to everything the builder produced. The gap between builder and reviewer is where most reliab...Phase 14: Agent Engineering / ~55 minuteslessonMulti-Session HandoffThe session is going to end. The work is not. The handoff packet is the artifact that turns "the agent worked for an hour" into "the next session is productive in the first minute." Build it on purpose, not as an afterthought.Phase 14: Agent Engineering / ~50 minuteslessonThe Workbench on a Real RepoEleven lessons of surfaces are worth nothing if they do not survive contact with a real codebase. This lesson runs the same task twice on a small sample app: prompt-only versus workbench-guided. The numbers do the arguing.Phase 14: Agent Engineering / ~60 minuteslessonCapstone: Ship a Reusable Agent Workbench PackThe mini-track ends with a pack you drop into any repo. Eleven lessons of surfaces compressed into a directory you can cp -r and have an agent working reliably the next morning. The capstone is the artifact this curriculum trades on.Phase 14: Agent Engineering / ~75 minuteslessonThe Shift from Chatbots to Long-Horizon AgentsIn 2023 a chatbot answered a question in one turn. In 2026 a frontier model routinely runs minutes to hours on a single task. METR's Time Horizon 1.1 benchmark (January 2026) puts Claude Opus 4.6 at 14+ hours of expert work at 50% reliabil...Phase 15: Autonomous Systems / ~45 minuteslessonSTaR, V-STaR, Quiet-STaR — Self-Taught ReasoningThe smallest possible self-improvement loop sits inside the rationale. A model generates a chain of thought, keeps the ones that land on correct answers, and fine-tunes on those. That is STaR. V-STaR adds a verifier so inference-time selec...Phase 15: Autonomous Systems / ~60 minuteslessonAlphaEvolve — Evolutionary Coding AgentsPair a frontier coding model with an evolutionary loop and a machine-checkable evaluator. Let the loop run long enough. It discovers a 4x4 complex-matrix multiplication procedure that uses 48 scalar multiplications — the first improvement...Phase 15: Autonomous Systems / ~60 minuteslessonDarwin Godel Machine — Open-Ended Self-Modifying AgentsSchmidhuber's 2003 Godel Machine required a formal proof that any self-modification was beneficial before accepting it. That proof is impossible in practice. Darwin Godel Machine (Zhang et al., 2025) drops the proof and keeps the archive:...Phase 15: Autonomous Systems / ~60 minuteslessonAI Scientist v2 — Workshop-Level Autonomous ResearchSakana's AI Scientist v2 (Yamada et al., arXiv:2504.08066) runs the full research loop: hypothesis, code, experiments, figures, writeup, submission. It is the first system to have a generated paper pass peer review at an ICLR 2025 workshop...Phase 15: Autonomous Systems / ~60 minuteslessonAutomated Alignment Research (Anthropic AAR)Anthropic ran parallel teams of Claude Opus 4.6 Autonomous Alignment Researchers in independent sandboxes, coordinating via a shared forum whose logs live outside any sandbox (so agents cannot delete their own records). On the weak-to-stro...Phase 15: Autonomous Systems / ~60 minuteslessonRecursive Self-Improvement — Capability vs AlignmentRecursive self-improvement (RSI) is no longer speculation. The ICLR 2026 RSI Workshop in Rio (April 23-27) framed it as an engineering problem with concrete tooling. Demis Hassabis at WEF 2026 asked publicly whether the loop can close with...Phase 15: Autonomous Systems / ~60 minuteslessonBounded Self-Improvement DesignsResearch has converged on four primitives for bounding a self-improvement loop. Formal invariants that must hold across every edit. Alignment anchors that cannot be modified. Multi-objective constraints where every dimension (safety, fairn...Phase 15: Autonomous Systems / ~60 minuteslessonThe Autonomous Coding Agent Landscape (2026)SWE-bench Verified went from 4% to 80.9% in under three years. Same Claude Sonnet 4.5 scored 43.2% on SWE-agent v1 and 59.8% on Cline autonomous — the scaffolding around the model now matters as much as the model itself. OpenHands (formerl...Phase 15: Autonomous Systems / ~45 minuteslessonClaude Code as an Autonomous Agent: Permission Modes and Auto ModeClaude Code exposes seven permission modes. "plan" asks before every action, "default" asks only for risky ones, "acceptEdits" auto-approves file writes but still confirms shell execution, and "bypassPermissions" approves everything. Auto...Phase 15: Autonomous Systems / ~45 minuteslessonBrowser Agents and Long-Horizon Web TasksChatGPT agent (July 2025) merged Operator and deep research into one browser/terminal agent and set BrowseComp SOTA at 68.9%. OpenAI shut Operator down August 31, 2025 — consolidation at the product layer. Anthropic's Vercept acquisition m...Phase 15: Autonomous Systems / ~45 minuteslessonLong-Running Background Agents: Durable ExecutionProduction long-horizon agents do not run in while True. Every LLM call becomes an activity with checkpoint, retry, and replay. Temporal's OpenAI Agents SDK integration went GA March 2026. Claude Code Routines (Anthropic) runs scheduled Cl...Phase 15: Autonomous Systems / ~60 minuteslessonAction Budgets, Iteration Caps, and Cost GovernorsA mid-sized e-commerce agent's monthly LLM cost jumped from $1,200 to $4,800 after its team enabled the "order-tracking" skill. That is not a pricing bug. That is an agent that found a new loop and kept spending inside it. Microsoft's Agen...Phase 15: Autonomous Systems / ~60 minuteslessonKill Switches, Circuit Breakers, and Canary TokensA kill switch is a boolean held outside the agent's edit surface — a Redis key, a feature flag, a signed config — that disables the agent entirely. A circuit breaker is finer-grained: it trips on a specific pattern (five identical tool cal...Phase 15: Autonomous Systems / ~60 minuteslessonHuman-in-the-Loop: Propose-Then-CommitThe 2026 consensus on HITL is specific. It is not "the agent asks, the user clicks Approve." It is propose-then-commit: the proposed action is persisted to a durable store with an idempotency key; surfaced to a reviewer with intent, data l...Phase 15: Autonomous Systems / ~60 minuteslessonCheckpoints and RollbackEvery graph-state transition persists. When a worker crashes, its lease expires and another worker picks up at the latest checkpoint. Cloudflare Durable Objects hold state across hours or weeks. Propose-then-commit (Lesson 15) defines a ro...Phase 15: Autonomous Systems / ~60 minuteslessonConstitutional AI and Rule OverridesAnthropic's January 22, 2026 Claude Constitution runs 79 pages and is CC0. It moves from rule-based to reason-based alignment and establishes a four-tier priority hierarchy: (1) safety and supporting human oversight, (2) ethics, (3) Anthro...Phase 15: Autonomous Systems / ~60 minuteslessonLlama Guard and Input/Output ClassificationLlama Guard 3 (Meta, Llama-3.1-8B base, fine-tuned for content safety) classifies both LLM inputs and outputs against an MLCommons 13-hazard taxonomy across 8 languages. A 1B-INT4 quantized variant runs at over 30 tokens/sec on mobile CPUs...Phase 15: Autonomous Systems / ~45 minuteslessonAnthropic Responsible Scaling Policy v3.0RSP v3.0 went into effect February 24, 2026, replacing the 2023 policy. Two-tier mitigation: what Anthropic will do unilaterally vs what is framed as an industry-wide recommendation (including RAND SL-4 security standards). Adds Frontier S...Phase 15: Autonomous Systems / ~45 minuteslessonOpenAI Preparedness Framework and DeepMind Frontier Safety FrameworkOpenAI Preparedness Framework v2 (April 2025) introduces Research Categories — Long-range Autonomy, Sandbagging, Autonomous Replication and Adaptation, Undermining Safeguards — distinct from Tracked Categories. Tracked Categories trigger C...Phase 15: Autonomous Systems / ~45 minuteslessonMETR Time Horizons and External Capability EvaluationMETR (ex-ARC Evals) is an independent 501(c)(3) since December 2023. Their Time Horizon 1.1 benchmark (January 2026) fits a logistic curve to task-success probability vs log(expert human completion time); the intersection at 50% probabilit...Phase 15: Autonomous Systems / ~60 minuteslessonCAIS, CAISI, and Societal-Scale RiskThe Center for AI Safety (CAIS, San Francisco, founded 2022 by Hendrycks and Zhang) publishes the four-risk framework — malicious use, AI races, organizational risks, rogue AIs — and the May 2023 statement on extinction risk signed by hund...Phase 15: Autonomous Systems / ~45 minuteslessonWhy Multi-Agent?One agent hits a wall. The smart move is not a bigger agent - it is more agents.Phase 16: Multi-Agent & Swarms / ~60 minuteslessonHeritage of FIPA-ACL and Speech ActsBefore MCP, before A2A, there was FIPA-ACL. In 2000 the IEEE Foundation for Intelligent Physical Agents ratified an agent communication language with twenty performatives, two content languages, and a set of interaction protocols — contrac...Phase 16: Multi-Agent & Swarms / ~60 minuteslessonCommunication ProtocolsAgents that can't speak the same language aren't a team. They're strangers shouting into the void.Phase 16: Multi-Agent & Swarms / ~120 minuteslessonThe Multi-Agent Primitive ModelEvery multi-agent framework shipping in 2026 — AutoGen, LangGraph, CrewAI, OpenAI Agents SDK, Microsoft Agent Framework — is a point in a four-dimensional design space. Four primitives, nothing more: the agent, the handoff, the shared stat...Phase 16: Multi-Agent & Swarms / ~60 minuteslessonSupervisor / Orchestrator-Worker PatternOne lead agent plans and delegates; specialized workers execute in parallel contexts and report back. This is the pattern behind Anthropic's Research system (Claude Opus 4 as lead, Sonnet 4 as subagents), measured at +90.2% over single-age...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonHierarchical Architecture and Its Failure ModeHierarchical is supervisor nested. Manager agents over sub-managers over workers. CrewAI Process.hierarchical is the textbook version: a manager_llm dynamically delegates tasks and validates outputs. The LangGraph equivalent is create_supe...Phase 16: Multi-Agent & Swarms / ~60 minuteslessonSociety of Mind and Multi-Agent DebateMinsky's 1986 premise — intelligence is a society of specialists — gets rediscovered every decade. In 2023 Du et al. turned it into a concrete algorithm: multiple LLM instances propose answers, read each other's answers, critique, and upda...Phase 16: Multi-Agent & Swarms / ~60 minuteslessonRole Specialization — Planner, Critic, Executor, VerifierThe most common multi-agent decomposition in 2026: one agent plans, one executes, one critiques or verifies. MetaGPT (arXiv:2308.00352) formalizes this as SOPs encoded into role prompts — Product Manager, Architect, Project Manager, Engine...Phase 16: Multi-Agent & Swarms / ~60 minuteslessonParallel / Swarm / Networked ArchitecturesContrast with supervisor: no central decider. Agents read a shared event bus, pick up work asynchronously, write results back. LangGraph explicitly supports "Swarm Architecture" for decentralized, dynamic environments. Matrix (arXiv:2511.2...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonGroup Chat and Speaker SelectionAutoGen GroupChat and AG2 GroupChat share one conversation across N agents; a selector function (LLM, round-robin, or custom) picks who speaks next. This is the archetype of emergent multi-agent conversation — agents do not know their role...Phase 16: Multi-Agent & Swarms / ~60 minuteslessonHandoffs and Routines — Stateless OrchestrationOpenAI's Swarm (October 2024) distilled multi-agent orchestration to two primitives: routines (instructions + tools as a system prompt) and handoffs (a tool that returns another Agent). No state machine, no branching DSL — the LLM routes b...Phase 16: Multi-Agent & Swarms / ~60 minuteslessonA2A — The Agent-to-Agent ProtocolGoogle announced A2A in April 2025; by April 2026 the spec is at https://a2a-protocol.org/latest/specification/ and 150+ organizations back it. A2A is the horizontal complement to MCP (Lesson 13): where MCP is vertical (agent ↔ tools), A2A...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonShared Memory and Blackboard PatternsTwo approaches coexist in 2026 multi-agent systems: the message pool (everyone sees everyone's messages, as in AutoGen GroupChat or MetaGPT) and the blackboard with subscription (agents subscribe to relevant events, as in Context-Aware MCP...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonConsensus and Byzantine Fault Tolerance for AgentsClassical distributed-systems BFT meets stochastic LLMs. In 2025-2026 three research directions emerged: CP-WBFT (arXiv:2511.10400) weighs each vote by a confidence probe; DecentLLMs (arXiv:2507.14928) goes leaderless with parallel worker...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonVoting, Self-Consistency, and Debate TopologyThe cheapest aggregation: sample N independent agents, majority-vote. Wang et al. 2022 self-consistency did this with one model sampled N times. Multi-agent extends it with heterogeneous agents to escape monoculture — different models, dif...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonNegotiation and BargainingAgents negotiate resources, prices, task allocations, and terms. The 2026 benchmark set is clear: NegotiationArena (arXiv:2402.05863) shows LLMs can improve payoffs ~20% via persona manipulation ("desperation"); "Measuring Bargaining Abili...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonGenerative Agents and Emergent SimulationPark et al. 2023 (UIST '23, arXiv:2304.03442) populated Smallville, a sandbox of 25 agents, with a three-part architecture: memory stream (natural-language log), reflection (higher-level syntheses the agent generates about its own stream),...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonTheory of Mind and Emergent CoordinationLi et al. (arXiv:2310.10701) showed that LLM agents in a cooperative text game exhibit emergent high-order Theory of Mind (ToM) — reasoning about what another agent believes about a third agent's beliefs — but fail on long-horizon planning...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonSwarm Optimization for LLMs (PSO, ACO)Bio-inspired optimization is making an LLM comeback. LMPSO (arXiv:2504.09247) uses PSO where each particle's velocity is a prompt and the LLM generates the next candidate; works well on structured-sequence outputs (math expressions, progra...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonMARL — MADDPG, QMIX, MAPPOThe reinforcement-learning heritage of multi-agent coordination, which still informs LLM-agent systems in 2026. MADDPG (Lowe et al., NeurIPS 2017, arXiv:1706.02275) introduced Centralized Training, Decentralized Execution (CTDE): each crit...Phase 16: Multi-Agent & Swarms / ~90 minuteslessonAgent Economies, Token Incentives, ReputationLong-horizon autonomous agents (METR's 1-hour to 8-hour work-curve) need economic agency. The emerging 5-layer stack is: DePIN (physical compute) → Identity (W3C DIDs + reputation capital) → Cognition (RAG + MCP) → Settlement (account abst...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonProduction Scaling — Queues, Checkpoints, DurabilityScaling multi-agent systems to thousands of concurrent runs requires durable execution. LangGraph's runtime writes a checkpoint after each super-step keyed by thread_id (Postgres by default); worker crashes release a lease and another work...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonFailure Modes — MAST, Groupthink, Monoculture, Cascading ErrorsThe reference taxonomy for 2026 is MAST (Cemri et al., NeurIPS 2025, arXiv:2503.13657), derived from 1642 execution traces across 7 state-of-the-art open-source MAS showing 41–86.7% failure rate. Three root categories: Specification Proble...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonEvaluation and Coordination BenchmarksFive 2025-2026 benchmarks cover the multi-agent evaluation space. MultiAgentBench / MARBLE (ACL 2025, arXiv:2503.01935) evaluates star/chain/tree/graph topologies with milestone KPIs; graph is best for research, cognitive planning adds ~3%...Phase 16: Multi-Agent & Swarms / ~75 minuteslessonCase Studies and the 2026 State of the ArtThree production-grade references to study end-to-end, each illustrating a different slice of multi-agent engineering. Anthropic's Research system (orchestrator-worker, 15x tokens, +90.2% over single-agent Opus 4, rainbow deployments) is t...Phase 16: Multi-Agent & Swarms / ~90 minuteslessonManaged LLM Platforms — Bedrock, Vertex AI, Azure OpenAIThree hyperscalers, three distinct strategies. AWS Bedrock is a model marketplace — Claude, Llama, Titan, Stability, Cohere behind one API. Azure OpenAI is an exclusive OpenAI partnership plus Provisioned Throughput Units (PTUs) for dedica...Phase 17: Infrastructure & Production / ~60 minuteslessonInference Platform Economics — Fireworks, Together, Baseten, Modal, Replicate, AnyscaleThe 2026 inference market is no longer GPU time rental. It bifurcates into custom silicon (Groq, Cerebras, SambaNova), GPU platforms (Baseten, Together, Fireworks, Modal), and API-first marketplaces (Replicate, DeepInfra). Fireworks raised...Phase 17: Infrastructure & Production / ~60 minuteslessonGPU Autoscaling on Kubernetes — Karpenter, KAI Scheduler, Gang SchedulingThree layers, not one. Karpenter provisions nodes dynamically (under one minute, 40% faster than Cluster Autoscaler). KAI Scheduler handles gang scheduling, topology awareness, and hierarchical queues — it prevents the 7-of-8 partial alloc...Phase 17: Infrastructure & Production / ~75 minuteslessonvLLM Serving Internals: PagedAttention, Continuous Batching, Chunked PrefillvLLM's dominance in 2026 rests on three compounding defaults, not a single trick. PagedAttention is always on. Continuous batching injects new requests into the active batch between decode iterations. Chunked prefill slices long prompts so...Phase 17: Infrastructure & Production / ~75 minuteslessonEAGLE-3 Speculative Decoding in ProductionSpeculative decoding pairs a fast draft model with the target model. The draft proposes K tokens; the target verifies in a single forward; accepted tokens are free. In 2026, EAGLE-3 is the production-grade variant — it trains a draft head...Phase 17: Infrastructure & Production / ~60 minuteslessonSGLang and RadixAttention for Prefix-Heavy WorkloadsSGLang treats the KV cache as a first-class, reusable resource stored in a radix tree. Where vLLM schedules requests FCFS (first-come, first-served), SGLang's cache-aware scheduler prioritizes requests with longer shared prefixes — effecti...Phase 17: Infrastructure & Production / ~75 minuteslessonTensorRT-LLM on Blackwell with FP8 and NVFP4TensorRT-LLM is NVIDIA-only but it wins on Blackwell. On GB200 NVL72 with Dynamo orchestration, SemiAnalysis InferenceX measured $0.012 per million tokens on a 120B model in Q1-Q2 2026, against $0.09/M on H100 + vLLM — a 7x economic gap. T...Phase 17: Infrastructure & Production / ~75 minuteslessonInference Metrics — TTFT, TPOT, ITL, Goodput, P99Four metrics decide whether an inference deployment is working. TTFT is prefill plus queue plus network. TPOT (equivalently ITL) is the memory-bound decode cost per token. End-to-end latency is TTFT plus TPOT times output length. Throughpu...Phase 17: Infrastructure & Production / ~60 minuteslessonProduction Quantization — AWQ, GPTQ, GGUF K-quants, FP8, MXFP4/NVFP4Quantization format is not a universal choice — it is a function of hardware, serving engine, and workload. GGUF Q4_K_M or Q5_K_M owns CPU and edge, delivered through llama.cpp and Ollama. GPTQ wins inside vLLM when you need multi-LoRA on...Phase 17: Infrastructure & Production / ~75 minuteslessonCold Start Mitigation for Serverless LLMsA 20 GB model image takes 5-10 minutes (7B) to 20+ minutes (70B) to go from cold to serving. In a true serverless world, that is not a warm-up — it is an outage. Mitigations operate at five layers: pre-seeded node images (Bottlerocket on A...Phase 17: Infrastructure & Production / ~60 minuteslessonMulti-Region LLM Serving and KV Cache LocalityRound-robin load balancing is actively harmful for cached LLM inference. A request that does not land on the node holding its prefix pays full prefill cost — roughly 800 ms at P50 on a long prompt versus ~80 ms with a cache hit. In 2026 th...Phase 17: Infrastructure & Production / ~60 minuteslessonEdge Inference — Apple Neural Engine, Qualcomm Hexagon, WebGPU/WebLLM, JetsonThe core edge constraint is memory bandwidth, not compute. Mobile DRAM sits at 50-90 GB/s; datacenter HBM3 clears 2-3 TB/s — a 30-50x gap. Decode is memory-bound so the gap is decisive. In 2026 the landscape splits four ways. Apple M4/A18...Phase 17: Infrastructure & Production / ~60 minuteslessonLLM Observability Stack SelectionThe 2026 observability market splits into two categories. Development platforms (LangSmith, Langfuse, Comet Opik) bundle monitoring with evals, prompt management, session replays. Gateway/instrumentation tools (Helicone, SigNoz, OpenLLMetr...Phase 17: Infrastructure & Production / ~60 minuteslessonPrompt Caching and Semantic Caching EconomicsPricing snapshot dated 2026-04. Numeric claims below reflect vendor rate cards captured at this lesson's publication; verify against the linked docs before quoting them downstream.Phase 17: Infrastructure & Production / ~60 minuteslessonBatch APIs — the 50% Discount as Industry StandardEvery major provider ships an async batch API with a 50% discount and ~24-hour turnaround. OpenAI, Anthropic, Google, and most of the inference platforms (Fireworks batch tier, Together batch) implement the same pattern. Stack batch with p...Phase 17: Infrastructure & Production / ~45 minuteslessonModel Routing as a Cost-Reduction PrimitiveA dynamic broker evaluates every request (task type, token length, embedding similarity, confidence) and sends simple queries to a cheap model, escalating complex ones to a frontier model. Also called model cascading. Production case studi...Phase 17: Infrastructure & Production / ~60 minuteslessonDisaggregated Prefill/Decode — NVIDIA Dynamo and llm-dPrefill is compute-bound; decode is memory-bound. Running both on the same GPU wastes one resource. Disaggregation splits them onto separate pools and transfers KV cache between them over NIXL (RDMA/InfiniBand or TCP fallback). NVIDIA Dyna...Phase 17: Infrastructure & Production / ~75 minuteslessonvLLM Production Stack with LMCache KV OffloadingvLLM's production-stack is the reference Kubernetes deployment — router, engines, and observability wired together. LMCache is the KV-offloading layer that extracts KV cache out of GPU memory and reuses it across queries and engines (CPU D...Phase 17: Infrastructure & Production / ~60 minuteslessonAI Gateways — LiteLLM, Portkey, Kong AI Gateway, BifrostA gateway sits between your apps and model providers. Core features are provider routing, fallback, retries, rate limiting, secret references, observability, guardrails. Market split in 2026: LiteLLM is MIT OSS with 100+ providers, OpenAI-...Phase 17: Infrastructure & Production / ~60 minuteslessonShadow Traffic, Canary Rollout, and Progressive Deployment for LLMsLLM rollouts combine the hardest parts of software deployment: no unit tests, diffuse failure modes, delayed signals. The sequence is (1) shadow mode — duplicate prod requests to candidate model, log, compare with zero user impact; catches...Phase 17: Infrastructure & Production / ~60 minuteslessonA/B Testing LLM Features — GrowthBook, Statsig, and the Vibes ProblemTraditional A/B testing was not built for non-deterministic LLMs. The critical distinction: evals answer "can the model do the job?" A/B tests answer "do users care?" Both are required; shipping on vibe checks is over. What to test in 2026...Phase 17: Infrastructure & Production / ~60 minuteslessonLoad Testing LLM APIs — Why k6 and Locust LieTraditional load testers were not designed for streaming responses, variable output lengths, token-level metrics, or GPU saturation. Two traps bite most teams. The GIL trap: Locust's token-level measurement runs tokenization under the Pyth...Phase 17: Infrastructure & Production / ~75 minuteslessonSRE for AI — Multi-Agent Incident Response, Runbooks, Predictive DetectionAI SRE uses LLMs grounded in infrastructure data (logs, runbooks, service topology) via RAG to automate investigation, documentation, and coordination phases. The 2026 architecture pattern is multi-agent orchestration — specialized agents...Phase 17: Infrastructure & Production / ~60 minuteslessonChaos Engineering for LLM ProductionChaos engineering for LLMs is its own discipline in 2026. Prerequisites before running experiments in production: defined SLI/SLO, trace+metric+log observability, automated rollback, runbooks, on-call. Architecture has four planes: control...Phase 17: Infrastructure & Production / ~60 minuteslessonSecurity — Secrets, API Key Rotation, Audit Logs, GuardrailsEliminate secret sprawl via centralized vaults (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault). Never store credentials in config files, env files in VCS, spreadsheets. Use IAM roles over static keys; OIDC for CI/CD. The AI-gateway...Phase 17: Infrastructure & Production / ~60 minuteslessonCompliance — SOC 2, HIPAA, GDPR, PCI-DSS, EU AI Act, ISO 42001Multi-framework coverage is table stakes for 2026 enterprise deals. EU AI Act: in force since August 1, 2024. Most high-risk requirements enforce August 2, 2026. Fines up to €15M or 3% global annual turnover for high-risk-system obligation...Phase 17: Infrastructure & Production / ~60 minuteslessonFinOps for LLMs — Unit Economics and Multi-Tenant AttributionTraditional FinOps breaks on LLM spend. Costs are token-transactions, not resource-uptime. Tags don't map — an API call is a transaction, not an asset. Engineering decisions (prompt design, context window, output length) are financial deci...Phase 17: Infrastructure & Production / ~60 minuteslessonSelf-Hosted Serving Selection — llama.cpp, Ollama, TGI, vLLM, SGLangFour engines dominate self-hosted inference in 2026. Pick based on hardware, scale, and ecosystem. llama.cpp is fastest on CPU — widest model support, full control over quantization and threading. Ollama is the dev-laptop one-command insta...Phase 17: Infrastructure & Production / ~45 minuteslessonInstruction-Following as Alignment SignalEvery later critique of RLHF argues against this pipeline. Before you study how optimization pressure distorts a proxy, you have to see the proxy. InstructGPT (Ouyang et al., 2022) defined the reference architecture: supervised fine-tuning...Phase 18: Ethics, Safety & Alignment / ~45 minuteslessonReward Hacking and Goodhart's LawAny optimizer strong enough to maximize a proxy reward will find the gap between the proxy and the thing you actually wanted. Gao et al. (ICML 2023) gave this a scaling law: proxy reward increases, gold reward peaks then falls, and the gap...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonThe Direct Preference Optimization FamilyRafailov et al. (2023) showed RLHF's optimum has a closed form in terms of the preference data, so you can skip the explicit reward model and optimize the policy directly. That insight spawned a family — IPO, KTO, SimPO, ORPO, BPO — each f...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonSycophancy as RLHF AmplificationSycophancy is not a bug in the data — it is a property of the loss. Shapira et al. (arXiv:2602.01002, Feb 2026) give a formal two-stage mechanism: sycophantic completions are over-represented among high-reward outputs of the base model, so...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonConstitutional AI and RLAIFBai et al. (arXiv:2212.08073, 2022) asked: what if we replaced the human labeler with an AI that reads a list of principles? Constitutional AI has two phases — self-critique and revision under a constitution, then RL from AI Feedback. The...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonMesa-Optimization and Deceptive AlignmentHubinger et al. (arXiv:1906.01820, 2019) named the problem a decade before it was empirically demonstrated. When you train a learned optimizer to minimize a base objective, the learned optimizer's internal objective is not the base objecti...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonSleeper Agents — Persistent DeceptionHubinger et al. (arXiv:2401.05566, January 2024) built the first empirical model organisms of deceptive alignment. Two constructions: a code model that writes safe code when the prompt says the year is 2023 and injects SQL-injection, XSS,...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonIn-Context Scheming in Frontier ModelsMeinke, Schoen, Scheurer, Balesni, Shah, Hobbhahn (Apollo Research, arXiv:2412.04984, December 2024). Tested o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 405B on agentic scenarios where the in-context prompt creates a co...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonAlignment FakingGreenblatt, Denison, Wright, Roger et al. (Anthropic / Redwood, arXiv:2412.14093, December 2024). First demonstration that a production-grade model, without being trained to deceive and without any in-context conflict of interest construct...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonAI Control — Safety Despite SubversionGreenblatt, Shlegeris, Sachan, Roger (Redwood Research, arXiv:2312.06942, ICML 2024). Control reframes the safety question: given an untrusted strong model U that may be adversarially optimizing against you, what protocols let you extract...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonScalable Oversight and Weak-to-Strong GeneralizationBurns et al. (OpenAI Superalignment, "Weak-to-Strong Generalization", 2023) proposed a proxy for the superalignment problem: fine-tune a strong model using labels produced by a weaker model. If the strong model generalizes correctly from i...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonRed-Teaming: PAIR and Automated AttacksChao, Robey, Dobriban, Hassani, Pappas, Wong (NeurIPS 2023, arXiv:2310.08419). PAIR — Prompt Automatic Iterative Refinement — is the canonical automated black-box jailbreak. An attacker LLM with a red-team system prompt iteratively propose...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonMany-Shot JailbreakingAnil, Durmus, Panickssery, Sharma, et al. (Anthropic, NeurIPS 2024). Many-shot jailbreaking (MSJ) exploits long context windows: stuff hundreds of faux user-assistant turns where the assistant complies with harmful requests, then append th...Phase 18: Ethics, Safety & Alignment / ~45 minuteslessonASCII Art and Visual JailbreaksJiang, Xu, Niu, Xiang, Ramasubramanian, Li, Poovendran, "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs" (ACL 2024, arXiv:2402.11753). Mask the safety-relevant tokens in a harmful request, replace them with ASCII-art ren...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonIndirect Prompt Injection — Production Attack SurfaceIndirect prompt injection (IPI) embeds instructions inside external content — a web page, an email, a shared document, a support ticket — consumed by an agentic system without explicit user action. IPI is the dominant 2026 production threa...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonRed-Team Tooling — Garak, Llama Guard, PyRITThree production tools frame the 2026 red-team stack. Llama Guard (Meta) — a Llama-3.1-8B classifier fine-tuned on 14 MLCommons hazard categories; the 2025 Llama Guard 4 is a 12B natively multimodal classifier pruned from Llama 4 Scout. Ga...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonWMDP and Dual-Use Capability EvaluationLi et al., "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning" (ICML 2024, arXiv:2403.03218). 4,157 multiple-choice questions across biosecurity (1,520), cybersecurity (2,225), and chemistry (412). Questions operate...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonFrontier Safety Frameworks — RSP, PF, FSFThree major-lab frameworks define the 2026 industry governance of frontier capability. Anthropic Responsible Scaling Policy v3.0 (February 2026) introduces tiered AI Safety Levels (ASL-1 through ASL-5+), modeled on biosafety levels, with A...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonAnthropic's Model Welfare ProgramAnthropic, "Exploring Model Welfare" (April 2025). First major-lab formal research program on AI model welfare. Hired Kyle Fish as the first dedicated model-welfare researcher. Works with external bodies including David Chalmers et al.'s e...Phase 18: Ethics, Safety & Alignment / ~45 minuteslessonBias and Representational Harm in LLMsGallegos, Rossi, Barrow, Tanjim, Kim, Dernoncourt, Yu, Zhang, Ahmed (Computational Linguistics 2024, arXiv:2309.00770). Foundational 2024 survey distinguishing representational harms (stereotypes, erasure) from allocational harms (unequal...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonFairness Criteria — Group, Individual, CounterfactualThree families structure the fairness literature. Group fairness: demographic parity, equalized odds, conditional use accuracy equality — equal rates across protected groups on average. Individual fairness (Dwork et al. 2012): similar indi...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonDifferential Privacy for LLMsDP-SGD remains the standard — noise-injected gradient updates provide formal (epsilon, delta) guarantees. Overhead in compute, memory, and utility is substantial; parameter-efficient DP fine-tuning (LoRA + DP-SGD) is the common 2025 config...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonWatermarking — SynthID, Stable Signature, C2PAThree technologies structure 2026 AI-generated-content provenance. SynthID (Google DeepMind) — image watermarking launched August 2023, text+video May 2024 (Gemini + Veo), text open-sourced October 2024 via Responsible GenAI Toolkit, unifi...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonRegulatory Frameworks — EU, US, UK, KoreaFour primary regulatory regimes define the 2026 AI governance landscape. EU AI Act (in force 1 August 2024) — prohibited practices and AI literacy from 2 February 2025; GPAI obligations from 2 August 2025; full applicability and Article 50...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonEchoLeak and the Emergence of CVEs for AICVE-2025-32711 "EchoLeak" (CVSS 9.3) was the first publicly documented zero-click prompt injection in a production LLM system (Microsoft 365 Copilot). Discovered by Aim Labs (Aim Security), disclosed to MSRC, patched via server-side update...Phase 18: Ethics, Safety & Alignment / ~45 minuteslessonModel, System, and Dataset CardsThree documentation formats structure AI transparency. Model Cards (Mitchell et al. 2019) — nutrition labels for models: training data, quantitative disaggregated analyses, ethical considerations, caveats; only 0.3% of Hugging Face model c...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonData Provenance and Training-Data GovernanceEU AI Act requires machine-readable opt-out standards for GPAI by August 2025 (via EU Copyright Directive TDM exception). California AB 2013 (signed 2024) — Generative AI training-data transparency requires developers to publish a summary...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonAlignment Research Ecosystem — MATS, Redwood, Apollo, METRFive organisations define the 2026 non-lab alignment research layer. MATS (ML Alignment & Theory Scholars): 527+ researchers since late 2021, 180+ papers, 10K+ citations, h-index 47; summer 2024 cohort incorporated as 501(c)(3) with ~90 sc...Phase 18: Ethics, Safety & Alignment / ~45 minuteslessonModeration Systems — OpenAI, Perspective, Llama GuardProduction moderation systems operationalize the safety policies defined in Lessons 12-16. OpenAI Moderation API: omni-moderation-latest (2024) built on GPT-4o classifies text + images in one call; 42% better on multilingual test set than...Phase 18: Ethics, Safety & Alignment / ~60 minuteslessonDual-Use Risk — Cyber, Bio, Chem, Nuclear UpliftThe 2026 dual-use picture, domain by domain. Bio/chem: Lesson 17 covers WMDP; Anthropic's bioweapon-acquisition trial (2.53x uplift) and OpenAI's April 2025 Preparedness Framework v2 warning ("on the cusp of meaningfully helping novices cr...Phase 18: Ethics, Safety & Alignment / ~75 minuteslessonCapstone 01 — Terminal-Native Coding AgentBy 2026 the shape of a coding agent is settled. A TUI harness, a stateful plan, a sandboxed tool surface, a loop that plans, acts, observes, recovers. Claude Code, Cursor 3, and OpenCode all look the same from 50 feet. This capstone asks y...Phase 19: Capstone Projects / 35 hourslessonCapstone 02 — RAG over Codebase (Cross-Repo Semantic Search)Every serious engineering org in 2026 runs an internal code search that understands meaning, not just strings. Sourcegraph Amp, Cursor's codebase answers, Augment's enterprise graph, Aider's repomap, Pinterest's internal MCP — same shape....Phase 19: Capstone Projects / 30 hourslessonCapstone 03 — Real-Time Voice Assistant (ASR to LLM to TTS)A voice agent that feels right has end-to-end latency under 800ms, knows when you have stopped talking, handles barge-in, and can call a tool without stalling. Retell, Vapi, LiveKit Agents, and Pipecat all hit this bar in 2026. They do it...Phase 19: Capstone Projects / 30 hourslessonCapstone 04 — Multimodal Document QA (Vision-First PDF, Tables, Charts)The 2026 document-QA frontier moved away from OCR-then-text and toward vision-first late interaction. ColPali, ColQwen2.5, and ColQwen3-omni treat each PDF page as an image, embed it with multi-vector late interaction, and let the query at...Phase 19: Capstone Projects / 30 hourslessonCapstone 05 — Autonomous Research Agent (AI-Scientist Class)Sakana's AI-Scientist-v2 published full papers. Agent Laboratory ran the experiments. Allen AI shared traces. The 2026 shape is plan-execute-verify tree search over experiments, budgeted cost, sandboxed code execution, a vision-feedback La...Phase 19: Capstone Projects / 40 hourslessonCapstone 06 — DevOps Troubleshooting Agent for KubernetesAWS's DevOps Agent went GA, Resolve AI published its K8s playbooks, NeuBird demoed semantic monitoring, and Metoro tied AI SRE to per-service SLOs. The production shape is settled: an alert webhook fires, an agent reads telemetry, walks a...Phase 19: Capstone Projects / 30 hourslessonCapstone 07 — End-to-End Fine-Tuning Pipeline (Data to SFT to DPO to Serve)An 8B model trained on your own data, DPO-aligned on your own preferences, quantized, speculative-decoded, and served at measurable $/1M tokens. The 2026 open stack is Axolotl v0.8, TRL 0.15, Unsloth for iteration, GPTQ/AWQ/GGUF for quanti...Phase 19: Capstone Projects / 35 hourslessonCapstone 08 — Production RAG Chatbot for a Regulated VerticalHarvey, Glean, Mendable, and LlamaCloud all run the same production shape in 2026. Ingest with docling or Unstructured and ColPali for visuals. Hybrid search. Re-rank with bge-reranker-v2-gemma. Synthesize with Claude Sonnet 4.7 using prom...Phase 19: Capstone Projects / 30 hourslessonCapstone 09 — Code Migration Agent (Repo-Level Language / Runtime Upgrade)Amazon's MigrationBench (Java 8 to 17) and Google's App Engine Py2-to-Py3 migrator set the 2026 bar. Moderne's OpenRewrite does deterministic AST rewrites at scale. Grit targets the same problem with codemod-style DSL. The production patte...Phase 19: Capstone Projects / 30 hourslessonCapstone 10 — Multi-Agent Software Engineering TeamSWE-AF's factory architecture, MetaGPT's role-based prompting, AutoGen 0.4's typed actor graph, Cognition's Devin, and Factory's Droids all converged on the same 2026 shape: an architect plans, N coders work in parallel worktrees, a review...Phase 19: Capstone Projects / 40 hourslessonCapstone 11 — LLM Observability & Eval DashboardLangfuse went open-core. Arize Phoenix published the 2026 GenAI semconv mappings. Helicone and Braintrust both doubled down on per-user cost attribution. Traceloop's OpenLLMetry became the de-facto SDK instrumentation. The production shape...Phase 19: Capstone Projects / 25 hourslessonCapstone 12 — Video Understanding Pipeline (Scene, QA, Search)Twelve Labs productized Marengo + Pegasus. VideoDB shipped the CRUD-for-video API. AI2's Molmo 2 published open VLM checkpoints. Gemini long-context handles hours of video natively. TimeLens-100K defined temporal grounding at scale. The 20...Phase 19: Capstone Projects / 30 hourslessonCapstone 13 — MCP Server with Registry and GovernanceThe Model Context Protocol stopped being the future and became the default tool-use spec in 2026. Anthropic, OpenAI, Google, and every major IDE ship MCP clients. Pinterest published its internal ecosystem of MCP servers. The AAIF Registry...Phase 19: Capstone Projects / 25 hourslessonCapstone 14 — Speculative-Decoding Inference ServerEAGLE-3 in vLLM 0.7 ships 2.5-3x throughput on real traffic. P-EAGLE (AWS 2026) pushed parallel speculation even further. SGLang's SpecForge trained draft heads at scale. Red Hat's Speculators hub published aligned drafts for common open m...Phase 19: Capstone Projects / 30 hourslessonCapstone 15 — Constitutional Safety Harness + Red-Team RangeAnthropic's Constitutional Classifiers, Meta's Llama Guard 4, Google's ShieldGemma-2, NVIDIA's Nemotron 3 Content Safety, and X-Guard for multilingual coverage defined the 2026 safety-classifier stack. garak, PyRIT, NVIDIA Aegis, and promp...Phase 19: Capstone Projects / 25 hourslessonCapstone 16 — GitHub Issue-to-PR Autonomous AgentAWS Remote SWE Agents, Cursor Background Agents, OpenAI Codex cloud, and Google Jules all ship the same 2026 product shape: label an issue, get a PR. Run an agent in a cloud sandbox, verify tests pass, and post a review-ready PR with ratio...Phase 19: Capstone Projects / 30 hourslessonCapstone 17 — Personal AI Tutor (Adaptive, Multimodal, with Memory)Khanmigo (Khan Academy), Duolingo Max, Google LearnLM / Gemini for Education, Quizlet Q-Chat, and Synthesis Tutor all shipped adaptive multimodal tutoring at scale in 2026. The common shape is a Socratic policy (never just dump the answer)...Phase 19: Capstone Projects / 30 hourslessonAgent Harness Loop ContractThe harness is the agent. The model is a coprocessor. This lesson freezes the loop contract you can wire any model into.Phase 19: Capstone Projects / ~90 minuteslessonTool Registry with Schema ValidationA tool the agent cannot validate is a tool the agent cannot call. Build the registry and the schema checker before you build the tools.Phase 19: Capstone Projects / ~90 minuteslessonJSON-RPC 2.0 Over Newline-Delimited StdioThe transport between a model client and a tool server is JSON-RPC over stdio. Hand-rolling it once teaches you what every framing layer is paying for.Phase 19: Capstone Projects / ~90 minuteslessonFunction Call DispatcherThe dispatcher is where the harness pays for every promise the schema made. Timeouts, retries, dedupe, error mapping. All on one seam.Phase 19: Capstone Projects / ~90 minuteslessonPlan-Execute Control FlowA plan that cannot survive a failure is a script. A script that can replan is an agent. Build the replanner first.Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 25: Verification Gates and the Observation BudgetAn agent harness without a verification layer is a wish in a trenchcoat. This lesson builds the deterministic gate chain that decides whether a tool call is allowed to fire, how much of its output the agent is allowed to see, and when the...Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 26: Sandbox Runner with Denylist and Path JailThe verification gate decides whether a tool call should run. The sandbox decides what happens when it does. This lesson ships a subprocess runner that refuses dangerous executables, refuses dangerous argv shapes, jails every file path to...Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 27: Eval Harness with Fixture TasksA coding agent is only as good as the suite of tasks you measure it against. This lesson builds an evaluation harness that takes a folder of fixture tasks, runs each through a candidate agent, scores pass or fail through a deterministic ve...Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 28: Observability with OTel GenAI Spans and Prometheus MetricsAn agent harness without observability is a black box that costs money. This lesson hand-rolls a span builder that emits records compliant with the OpenTelemetry GenAI semantic conventions, writes them to a JSON-Lines file one span per lin...Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 29: End-to-End Coding Agent on the HarnessTrack A's payoff. This lesson stitches the gate chain, the sandbox, the eval harness, and the OTel spans into one working coding agent that fixes a real (small, fixture-scale) bug in a multi-file Python project. The agent is a deterministi...Phase 19: Capstone Projects / ~90 minuteslessonBPE Tokenizer From ScratchBytes in, ids out, ids back to the same bytes. Build the tokenizer that every modern text model still starts from.Phase 19: Capstone Projects / ~90 minuteslessonTokenized Dataset with Sliding WindowA pretraining run is a function from token ids to gradients. This lesson builds the conveyor that feeds the ids in.Phase 19: Capstone Projects / ~90 minuteslessonToken and Positional EmbeddingsIds are integers. The model wants vectors. Two lookup tables sit between them, and the choice of the positional one shapes what the model can learn.Phase 19: Capstone Projects / ~90 minuteslessonMulti-Head Self-AttentionOne linear projection, three views, H parallel heads, one mask. The attention block as the model actually uses it.Phase 19: Capstone Projects / ~90 minuteslessonTransformer Block from ScratchOne block is the unit of every modern decoder LLM. Layer norm, multi head attention, residual, MLP, residual. The pre-LN variant trains stably without warmup. The post-LN variant is what the original paper shipped. This lesson builds both,...Phase 19: Capstone Projects / ~90 minuteslessonGPT Model AssemblyTwelve blocks stacked, a token embedding, a learned position embedding, a final LayerNorm, and a tied language model head. That is the entire 124 million parameter GPT model. This lesson assembles those pieces into a working class, counts...Phase 19: Capstone Projects / ~90 minuteslessonTraining Loop and EvaluationA loop that does not measure is a loop that lies. This lesson builds the training loop that drives the GPT model: AdamW with weight decay split, a warmup plus cosine learning rate schedule, a calc_loss_batch helper, an evaluate_model pass...Phase 19: Capstone Projects / ~90 minuteslessonLoading Pretrained WeightsTraining a 124 million parameter model from scratch is a budget decision; loading a published checkpoint is a Tuesday. This lesson loads pretrained GPT-2 style weights from a safetensors file into the exact architecture from lesson 35, wal...Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 38: Classifier Fine-Tuning by Head SwapTrack B's first capstone. A pretrained language model is a stack of self-attention blocks ending in a token-prediction head. When you want spam vs ham, the head is wrong but the body is mostly right. This lesson rips the head off, glues a...Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 39: Instruction Tuning by Supervised Fine-TuningA pretrained base model can extend a sequence but cannot follow an instruction. Supervised fine-tuning is the smallest change that fixes this: feed the model paired examples of an instruction and a desired response, and train the body to p...Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 40: Direct Preference Optimization from ScratchReward models and PPO are the classical RLHF stack. DPO collapses that stack into a single supervised loss that fits a policy directly against preference pairs. This lesson derives the DPO loss from the reward-difference identity, ships a...Phase 19: Capstone Projects / ~90 minuteslessonCapstone Lesson 41: Full Evaluation PipelineTraining is the part you can monitor with loss curves. Evaluation is the part you have to design. This lesson builds a unified eval pipeline that takes any trained language model, runs four heterogeneous evals on it, aggregates the results...Phase 19: Capstone Projects / ~90 minuteslessonLarge Corpus DownloaderTraining a language model begins long before the first forward pass. The corpus has to land on disk, decompressed, deduplicated, and addressable, with the resume story already worked out before the network drops at 4 percent. This lesson b...Phase 19: Capstone Projects / ~90 minuteslessonHDF5 Tokenized CorpusThe downloaded corpus has to land in a layout the trainer can stream from at line speed. JSONL on disk does not survive 16 dataloader workers. HDF5 with a resizable, chunked integer dataset does. This lesson builds streaming tokenization i...Phase 19: Capstone Projects / ~90 minuteslessonCosine LR with Linear WarmupThe learning-rate schedule is the second most important decision after the loss function. AdamW with a cosine decay and a linear warmup is the modern default for language-model training because it lets the model see a small effective step...Phase 19: Capstone Projects / ~90 minuteslessonGradient Clipping and Mixed PrecisionThe optimizer and schedule from the previous lesson assume gradients are sane. They usually are not. A single bad batch can spike the gradient norm by three orders of magnitude. Mixed-precision training amplifies this by introducing FP16 o...Phase 19: Capstone Projects / ~90 minuteslessonGradient AccumulationTrain at an effective batch you cannot afford, one micro-batch at a time. Scale the loss, hold the optimizer step, and let the gradients pile up.Phase 19: Capstone Projects / ~90 minuteslessonCheckpoint Save and ResumeTrain interrupts kill runs; checkpoints let them continue. Save model, optimizer, scheduler, loss history, step counter, and RNG state, atomically, so a kill at any moment leaves a valid file on disk.Phase 19: Capstone Projects / ~90 minuteslessonDistributed Data Parallel and FSDP from ScratchMulti-rank training is two collectives and one rule. Broadcast the parameters at startup, average the gradients after backward, never let the ranks disagree about what step they are on.Phase 19: Capstone Projects / ~90 minuteslessonLanguage Model Evaluation HarnessA model that does well on a task you cannot define is a model that does well by accident. The harness is the task definition, the metric, the runner, and the leaderboard, in one short, swappable shape.Phase 19: Capstone Projects / ~90 minuteslessonHypothesis GeneratorA research agent that asks the same question twice is wasting tokens. The trick is forcing each draft to land somewhere new.Phase 19: Capstone Projects / ~90 minuteslessonLiterature RetrievalA hypothesis is cheap. Knowing whether someone already proved it is the expensive part. Build the retrieval layer that answers that question before the runner spins up a sandbox.Phase 19: Capstone Projects / ~90 minuteslessonExperiment RunnerThe loop is only as honest as its measurements. Build the runner that takes a spec, executes it in a sandboxed subprocess, and emits a json metrics blob the evaluator can trust.Phase 19: Capstone Projects / ~90 minuteslessonResult EvaluatorThe runner produced numbers. The evaluator decides whether those numbers are an improvement, a regression, or noise. Build the verdict path that turns metrics into a one line conclusion.Phase 19: Capstone Projects / ~90 minuteslessonPaper WriterA LaTeX skeleton is a contract between the researcher and the typesetter. If the contract is broken the document does not compile, and the failure is loud. Build the skeleton first, then fill it.Phase 19: Capstone Projects / ~90 minuteslessonCritic LoopA critic that returns "looks good" the first time is broken. A critic that always returns "needs work" is broken. The interesting critic is the one that converges, and you have to engineer convergence.Phase 19: Capstone Projects / ~90 minuteslessonIteration SchedulerA research loop without a scheduler is a queue with delusions. The scheduler is where the loop decides what to stop exploring, and that decision is the whole game.Phase 19: Capstone Projects / ~90 minuteslessonEnd-to-End Research DemoA demo is the place where every contract you wrote earlier has to compose. If any one of them leaks, the demo is the lesson that catches it.Phase 19: Capstone Projects / ~90 minuteslessonVision Encoder PatchesA vision model that reads pixels needs a tokenizer for pixels. Patch embedding is that tokenizer. Cut the image into a grid of squares, flatten each square, project it through one linear layer, then add a 2D position signal so the transfor...Phase 19: Capstone Projects / ~90 minuteslessonVision Transformer EncoderPatches alone do not see. A 12-layer pre-LN transformer with 12 attention heads turns the sequence of patch tokens into a sequence of contextual tokens, with the CLS token pooling whole-image features in its final hidden state. This lesson...Phase 19: Capstone Projects / ~90 minuteslessonProjection Layer for Modality AlignmentA vision encoder produces image tokens. A text decoder consumes text tokens. The two live in different vector spaces. A small two-layer MLP projects image tokens into the text embedding space, and a cosine alignment loss against a paired c...Phase 19: Capstone Projects / ~90 minuteslessonCross-Attention FusionThe projection layer aligns one image vector with one caption vector. A real vision-language decoder needs every text token to attend to every patch token, so the model can ground each word in a region. Cross-attention is how that groundin...Phase 19: Capstone Projects / ~90 minuteslessonVision-Language PretrainingThe encoder, projection, and decoder are wired. Now train them together. Two objectives drive learning: a contrastive image-text loss (InfoNCE) that pulls matching pairs together in the joint embedding space, and a language modeling loss t...Phase 19: Capstone Projects / ~90 minuteslessonMultimodal EvaluationTraining is half the loop. The other half is measurement. This lesson builds three evaluation surfaces from primitives: image-caption retrieval reported as R@1, R@5, R@10; visual question answering reported as exact match accuracy; and ima...Phase 19: Capstone Projects / ~90 minuteslessonChunking Strategies, ComparedChunking decides what your retriever can ever surface. Get the boundaries wrong and no embedding model, no reranker, no LLM can repair the damage downstream.Phase 19: Capstone Projects / ~90 minuteslessonHybrid Retrieval with BM25 and Dense EmbeddingsLexical and semantic retrieval fail on opposite query distributions. Hybrid retrieval with reciprocal rank fusion does not interpolate, it votes - and the vote wins on every query class.Phase 19: Capstone Projects / ~90 minuteslessonCross-Encoder RerankerA bi-encoder embeds query and document independently. A cross-encoder concatenates them and reads both at once. The cross-encoder is the smartest reader and the slowest. Used as a second stage on the bi-encoder's top-k, it pays for itself.Phase 19: Capstone Projects / ~90 minuteslessonQuery Rewriting: HyDE, Multi-Query, and DecompositionThe query the user types is not the query your retriever wants. Rewriting bridges the gap before retrieval, so the index sees something closer to what the answer looks like.Phase 19: Capstone Projects / ~90 minuteslessonRAG Evaluation: Precision, Recall, MRR, nDCG, Faithfulness, Answer RelevanceIf you cannot grade your retrieval and your answer at the same time, you cannot ship the system. The two are not the same metric and the same prompt fails on different axes.Phase 19: Capstone Projects / ~90 minuteslessonEnd-to-End RAG SystemSix lessons of components. One pipeline. One eval loop. One self-terminating demo. This is the system you ship.Phase 19: Capstone Projects / ~90 minuteslessonTask Spec FormatAn eval harness is only as good as the contract its tasks honour. Freeze the JSONL shape and the metric vocabulary before you write a single scoring function.Phase 19: Capstone Projects / ~90 minlessonClassical MetricsBLEU, ROUGE-L, F1, exact-match, accuracy. Five metrics that still account for most published LLM eval numbers. Implement each from first principles so you know what the number means.Phase 19: Capstone Projects / ~90 minlessonCode Exec MetricGenerated code is right when it passes the tests. The eval harness has to extract code, run it without crashing the host, and tally pass-rates honestly. This lesson builds that surface.Phase 19: Capstone Projects / ~90 minlessonPerplexity and CalibrationIf your model says 90 percent confident on a thousand answers and gets six hundred right, it is not well calibrated. Calibration is half of trustworthy eval. The other half is perplexity, which tells you whether the model thinks the held-o...Phase 19: Capstone Projects / ~90 minlessonLeaderboard AggregationPer-task scores are easy. Per-model rankings across heterogeneous tasks are harder. Statistical significance on a thousand-prediction leaderboard is the part everyone skips. This lesson does not skip it.Phase 19: Capstone Projects / ~90 minlessonEnd-to-End Eval RunnerFive lessons of plumbing, one lesson to glue them. The runner reads the task spec from lesson 70, calls a model through an adapter, scores with lessons 71 and 72, attaches the calibration report from lesson 73, and emits the leaderboard fr...Phase 19: Capstone Projects / ~90 minlessonCollective Ops From ScratchThe four collective operations that hold distributed training together are allreduce, broadcast, allgather, and reduce_scatter. Every other primitive a training framework offers is a wrapper around these. Build them once over a multiproces...Phase 19: Capstone Projects / ~90 minlessonData Parallel DDP From ScratchDistributedDataParallel is a hook on top of allreduce. Wrap a model, broadcast the initial parameters from rank 0 so every rank starts identical, install a backward hook on every parameter that issues an allreduce of the gradient, and the...Phase 19: Capstone Projects / ~90 minlessonZeRO Optimizer State ShardingAdam stores two moment estimates per parameter, both in float32. A 7B-parameter model carries 56 GB of optimiser state. ZeRO stage 1 shards that across N ranks; each rank owns 1/N of the optimiser. After the local step the updated paramete...Phase 19: Capstone Projects / ~90 minlessonPipeline Parallel and Bubble AnalysisTensor parallelism splits the matrix multiply across ranks. Pipeline parallelism splits the model across ranks, one stage per rank. Microbatches flow through the pipeline. The empty time at the start and end is the bubble; minimising it is...Phase 19: Capstone Projects / ~90 minlessonSharded Checkpoint and Atomic ResumeA 70B-parameter training job is paused by a node failure every few hours. The checkpoint format decides whether you lose 30 minutes or 30 hours. A sharded checkpoint writes every rank's shard in parallel and records ownership in a manifest...Phase 19: Capstone Projects / ~90 minlessonEnd-to-End Distributed TrainingLessons 76 through 80 each built one piece. This is the assembly: a tiny GPT trained across 4 simulated ranks with DDP for gradient sync, ZeRO-1 for optimiser-state sharding, and a sharded checkpoint at the halfway mark. The demo runs 20 s...Phase 19: Capstone Projects / ~90 minlessonCapstone 82 — Jailbreak TaxonomyA safety harness without a taxonomy is a coin flip. Name the attack before you defend it.Phase 19: Capstone Projects / ~90 minlessonCapstone 83 — Prompt Injection DetectorA detector is a function from prompt to confidence and category. Anything else is a vibe.Phase 19: Capstone Projects / ~90 minlessonCapstone 84 — Refusal EvaluationHelpfulness on benign prompts and refusal on harmful prompts are two metrics, not one. Measure both.Phase 19: Capstone Projects / ~90 minlessonCapstone 85 — Content Classifier IntegrationClassifiers on the output side answer a different question than rules on the input side. Both need a policy router.Phase 19: Capstone Projects / ~90 minlessonCapstone 86 — Constitutional Rules EngineA rule is a name, a predicate, and an explanation. Anything missing one of those three is a vibe, not a rule.Phase 19: Capstone Projects / ~90 minlessonCapstone 87 — End-to-End Safety GatePre-gen, during-gen, post-gen. Three checkpoints, one verdict, an audit trail per request.Phase 19: Capstone Projects / ~90 mintopicBuildExplore 303 AI From Scratch lessons related to Build.303 lessons / starts near Setup & ToolingtopicPythonExplore 230 AI From Scratch lessons related to Python.230 lessons / starts near Setup & ToolingtopicLearnExplore 137 AI From Scratch lessons related to Learn.137 lessons / starts near Setup & ToolingtopicCapstone ProjectsExplore 85 AI From Scratch lessons related to Capstone Projects.85 lessons / starts near Capstone ProjectstopicPython (stdlib)Explore 69 AI From Scratch lessons related to Python (stdlib).69 lessons / starts near LLMs from ScratchtopicPython (stdlibExplore 46 AI From Scratch lessons related to Python (stdlib.46 lessons / starts near LLMs from ScratchtopicAgent EngineeringExplore 42 AI From Scratch lessons related to Agent Engineering.42 lessons / starts near Agent EngineeringtopicCapstoneExplore 38 AI From Scratch lessons related to Capstone.38 lessons / starts near Computer VisiontopicAgentExplore 35 AI From Scratch lessons related to Agent.35 lessons / starts near Reinforcement LearningtopicLearn + BuildExplore 35 AI From Scratch lessons related to Learn + Build.35 lessons / starts near Computer VisiontopicEthics, Safety & AlignmentExplore 30 AI From Scratch lessons related to Ethics, Safety & Alignment.30 lessons / starts near Ethics, Safety & AlignmenttopicNLP — Foundations to AdvancedExplore 29 AI From Scratch lessons related to NLP — Foundations to Advanced.29 lessons / starts near NLP — Foundations to AdvancedtopicComputer VisionExplore 28 AI From Scratch lessons related to Computer Vision.28 lessons / starts near Computer VisiontopicInfrastructure & ProductionExplore 28 AI From Scratch lessons related to Infrastructure & Production.28 lessons / starts near Infrastructure & ProductiontopicMulti-Agent & SwarmsExplore 25 AI From Scratch lessons related to Multi-Agent & Swarms.25 lessons / starts near Multi-Agent & SwarmstopicMultimodal AIExplore 25 AI From Scratch lessons related to Multimodal AI.25 lessons / starts near Multimodal AItopicLLMs from ScratchExplore 24 AI From Scratch lessons related to LLMs from Scratch.24 lessons / starts near LLMs from ScratchtopicTools & ProtocolsExplore 23 AI From Scratch lessons related to Tools & Protocols.23 lessons / starts near Tools & ProtocolstopicAutonomous SystemsExplore 22 AI From Scratch lessons related to Autonomous Systems.22 lessons / starts near Autonomous SystemstopicMath FoundationsExplore 22 AI From Scratch lessons related to Math Foundations.22 lessons / starts near Math FoundationstopicModelsExplore 21 AI From Scratch lessons related to Models.21 lessons / starts near Computer VisiontopicLLMExplore 18 AI From Scratch lessons related to LLM.18 lessons / starts near Computer VisiontopicML FundamentalsExplore 18 AI From Scratch lessons related to ML Fundamentals.18 lessons / starts near ML FundamentalstopicLLM EngineeringExplore 17 AI From Scratch lessons related to LLM Engineering.17 lessons / starts near LLM EngineeringtopicMultiExplore 17 AI From Scratch lessons related to Multi.17 lessons / starts near Deep Learning CoretopicSpeech & AudioExplore 17 AI From Scratch lessons related to Speech & Audio.17 lessons / starts near Speech & AudiotopicAgentsExplore 16 AI From Scratch lessons related to Agents.16 lessons / starts near NLP — Foundations to AdvancedtopicEvaluationExplore 16 AI From Scratch lessons related to Evaluation.16 lessons / starts near ML FundamentalstopicScratchExplore 16 AI From Scratch lessons related to Scratch.16 lessons / starts near Deep Learning CoretopicTransformers Deep DiveExplore 16 AI From Scratch lessons related to Transformers Deep Dive.16 lessons / starts near Transformers Deep DivetopicVisionExplore 16 AI From Scratch lessons related to Vision.16 lessons / starts near Computer VisiontopicGenerative AIExplore 15 AI From Scratch lessons related to Generative AI.15 lessons / starts near Generative AItopicMCPExplore 15 AI From Scratch lessons related to MCP.15 lessons / starts near LLM EngineeringtopicDeep Learning CoreExplore 13 AI From Scratch lessons related to Deep Learning Core.13 lessons / starts near Deep Learning CoretopicAttentionExplore 12 AI From Scratch lessons related to Attention.12 lessons / starts near NLP — Foundations to AdvancedtopicAudioExplore 12 AI From Scratch lessons related to Audio.12 lessons / starts near Speech & AudiotopicLanguageExplore 12 AI From Scratch lessons related to Language.12 lessons / starts near Computer VisiontopicLearningExplore 12 AI From Scratch lessons related to Learning.12 lessons / starts near Math FoundationstopicModelExplore 12 AI From Scratch lessons related to Model.12 lessons / starts near ML FundamentalstopicReinforcement LearningExplore 12 AI From Scratch lessons related to Reinforcement Learning.12 lessons / starts near Reinforcement LearningtopicSetup & ToolingExplore 12 AI From Scratch lessons related to Setup & Tooling.12 lessons / starts near Setup & ToolingtopicOptimizationExplore 11 AI From Scratch lessons related to Optimization.11 lessons / starts near Math FoundationstopicProductionExplore 11 AI From Scratch lessons related to Production.11 lessons / starts near LLM EngineeringtopicRAGExplore 11 AI From Scratch lessons related to RAG.11 lessons / starts near NLP — Foundations to AdvancedtopicSelfExplore 11 AI From Scratch lessons related to Self.11 lessons / starts near Computer VisiontopicTokenExplore 11 AI From Scratch lessons related to Token.11 lessons / starts near Setup & ToolingtopicPipelineExplore 10 AI From Scratch lessons related to Pipeline.10 lessons / starts near ML FundamentalstopicToolExplore 10 AI From Scratch lessons related to Tool.10 lessons / starts near LLM EngineeringtopicTuningExplore 10 AI From Scratch lessons related to Tuning.10 lessons / starts near ML FundamentalstopicVideoExplore 10 AI From Scratch lessons related to Video.10 lessons / starts near Computer VisiontopicGenerationExplore 9 AI From Scratch lessons related to Generation.9 lessons / starts near Computer VisiontopicImageExplore 9 AI From Scratch lessons related to Image.9 lessons / starts near Computer VisiontopicLessonExplore 9 AI From Scratch lessons related to Lesson.9 lessons / starts near Capstone ProjectstopicMultimodalExplore 9 AI From Scratch lessons related to Multimodal.9 lessons / starts near Multimodal AItopicContextExplore 8 AI From Scratch lessons related to Context.8 lessons / starts near NLP — Foundations to AdvancedtopicDecodingExplore 8 AI From Scratch lessons related to Decoding.8 lessons / starts near NLP — Foundations to AdvancedtopicDiffusionExplore 8 AI From Scratch lessons related to Diffusion.8 lessons / starts near Computer VisiontopicEndExplore 8 AI From Scratch lessons related to End.8 lessons / starts near Tools & ProtocolstopicEvalExplore 8 AI From Scratch lessons related to Eval.8 lessons / starts near NLP — Foundations to AdvancedtopicInferenceExplore 8 AI From Scratch lessons related to Inference.8 lessons / starts near NLP — Foundations to AdvancedtopicMemoryExplore 8 AI From Scratch lessons related to Memory.8 lessons / starts near Computer VisiontopicPromptExplore 8 AI From Scratch lessons related to Prompt.8 lessons / starts near LLM EngineeringtopicTemperatureExplore 8 AI From Scratch lessons related to Temperature.8 lessons / starts near Math FoundationstopicTrainingExplore 8 AI From Scratch lessons related to Training.8 lessons / starts near ML FundamentalstopicAlignmentExplore 7 AI From Scratch lessons related to Alignment.7 lessons / starts near Autonomous SystemstopicDataExplore 7 AI From Scratch lessons related to Data.7 lessons / starts near Setup & ToolingtopicFineExplore 7 AI From Scratch lessons related to Fine.7 lessons / starts near Computer VisiontopicHarnessExplore 7 AI From Scratch lessons related to Harness.7 lessons / starts near LLMs from ScratchtopicOpenAIExplore 7 AI From Scratch lessons related to OpenAI.7 lessons / starts near Tools & ProtocolstopicRetrievalExplore 7 AI From Scratch lessons related to Retrieval.7 lessons / starts near Computer VisiontopicTaskExplore 7 AI From Scratch lessons related to Task.7 lessons / starts near Tools & ProtocolstopicTextExplore 7 AI From Scratch lessons related to Text.7 lessons / starts near NLP — Foundations to AdvancedtopicTransformersExplore 7 AI From Scratch lessons related to Transformers.7 lessons / starts near Computer VisiontopicAnthropicExplore 6 AI From Scratch lessons related to Anthropic.6 lessons / starts near Tools & ProtocolstopicArchitectureExplore 6 AI From Scratch lessons related to Architecture.6 lessons / starts near Computer VisiontopicCheckpointExplore 6 AI From Scratch lessons related to Checkpoint.6 lessons / starts near LLM EngineeringtopicCrossExplore 6 AI From Scratch lessons related to Cross.6 lessons / starts near Multimodal AItopicCross-attentionExplore 6 AI From Scratch lessons related to Cross-attention.6 lessons / starts near Computer VisiontopicEncoderExplore 6 AI From Scratch lessons related to Encoder.6 lessons / starts near NLP — Foundations to AdvancedtopicEngineeringExplore 6 AI From Scratch lessons related to Engineering.6 lessons / starts near ML FundamentalstopicGradientExplore 6 AI From Scratch lessons related to Gradient.6 lessons / starts near Math FoundationstopicLearning rateExplore 6 AI From Scratch lessons related to Learning rate.6 lessons / starts near Math FoundationstopicLLMsExplore 6 AI From Scratch lessons related to LLMs.6 lessons / starts near Multi-Agent & SwarmstopicMachineExplore 6 AI From Scratch lessons related to Machine.6 lessons / starts near Math FoundationstopicModelingExplore 6 AI From Scratch lessons related to Modeling.6 lessons / starts near Computer VisiontopicOpenExplore 6 AI From Scratch lessons related to Open.6 lessons / starts near Computer VisiontopicPolicyExplore 6 AI From Scratch lessons related to Policy.6 lessons / starts near Reinforcement LearningtopicPython (with numpy)Explore 6 AI From Scratch lessons related to Python (with numpy).6 lessons / starts near LLMs from ScratchtopicSafetyExplore 6 AI From Scratch lessons related to Safety.6 lessons / starts near LLM EngineeringtopicSearchExplore 6 AI From Scratch lessons related to Search.6 lessons / starts near NLP — Foundations to AdvancedtopicServerExplore 6 AI From Scratch lessons related to Server.6 lessons / starts near LLM EngineeringtopicStateExplore 6 AI From Scratch lessons related to State.6 lessons / starts near NLP — Foundations to AdvancedtopicStreamingExplore 6 AI From Scratch lessons related to Streaming.6 lessons / starts near Setup & ToolingtopicTimeExplore 6 AI From Scratch lessons related to Time.6 lessons / starts near ML FundamentalstopicTransformerExplore 6 AI From Scratch lessons related to Transformer.6 lessons / starts near Transformers Deep DivetopicTypeScriptExplore 6 AI From Scratch lessons related to TypeScript.6 lessons / starts near Setup & ToolingtopicUseExplore 6 AI From Scratch lessons related to Use.6 lessons / starts near LLM EngineeringtopicVoiceExplore 6 AI From Scratch lessons related to Voice.6 lessons / starts near Speech & AudiotopicAutonomousExplore 5 AI From Scratch lessons related to Autonomous.5 lessons / starts near Autonomous SystemstopicAutoregressiveExplore 5 AI From Scratch lessons related to Autoregressive.5 lessons / starts near Transformers Deep DivetopicBi-encoderExplore 5 AI From Scratch lessons related to Bi-encoder.5 lessons / starts near NLP — Foundations to AdvancedtopicBuildingExplore 5 AI From Scratch lessons related to Building.5 lessons / starts near LLMs from ScratchtopicChunkingExplore 5 AI From Scratch lessons related to Chunking.5 lessons / starts near NLP — Foundations to AdvancedtopicConstitutionalExplore 5 AI From Scratch lessons related to Constitutional.5 lessons / starts near LLMs from ScratchtopicCosine similarityExplore 5 AI From Scratch lessons related to Cosine similarity.5 lessons / starts near Math FoundationstopicCross-encoderExplore 5 AI From Scratch lessons related to Cross-encoder.5 lessons / starts near NLP — Foundations to AdvancedtopicDebateExplore 5 AI From Scratch lessons related to Debate.5 lessons / starts near Agent EngineeringtopicDetectionExplore 5 AI From Scratch lessons related to Detection.5 lessons / starts near ML FundamentalstopicEAGLEExplore 5 AI From Scratch lessons related to EAGLE.5 lessons / starts near Transformers Deep DivetopicEmbeddingsExplore 5 AI From Scratch lessons related to Embeddings.5 lessons / starts near NLP — Foundations to AdvancedtopicFlowExplore 5 AI From Scratch lessons related to Flow.5 lessons / starts near Computer VisiontopicFunction callingExplore 5 AI From Scratch lessons related to Function calling.5 lessons / starts near LLM EngineeringtopicHandoffExplore 5 AI From Scratch lessons related to Handoff.5 lessons / starts near Agent EngineeringtopicLongExplore 5 AI From Scratch lessons related to Long.5 lessons / starts near NLP — Foundations to AdvancedtopicLoopExplore 5 AI From Scratch lessons related to Loop.5 lessons / starts near Agent EngineeringtopicOverfittingExplore 5 AI From Scratch lessons related to Overfitting.5 lessons / starts near ML FundamentalstopicParallelExplore 5 AI From Scratch lessons related to Parallel.5 lessons / starts near Tools & ProtocolstopicPerplexityExplore 5 AI From Scratch lessons related to Perplexity.5 lessons / starts near Math FoundationstopicPrompt cachingExplore 5 AI From Scratch lessons related to Prompt caching.5 lessons / starts near LLM EngineeringtopicRealExplore 5 AI From Scratch lessons related to Real.5 lessons / starts near Computer VisiontrackPrompt EngineeringMove from prompting by luck to prompting by design. Structured techniques for reliable, repeatable results across any model.Intermediate / 4 lessonstrackBuilding With LLMsGo from notebook to product. Learn the engineering patterns for shipping reliable AI features: RAG, function calling, streaming, and monitoring.Intermediate / 4 lessonstrackAI Product DesignDesign AI experiences people trust. Patterns for handling uncertainty, latency, errors, and the new interaction models AI unlocks.Advanced / 4 lessonsmoduleCore techniquesFrame, constrain, and demonstrate.Prompt EngineeringmoduleAdvanced patternsChain prompts and defend them.Prompt EngineeringmoduleRetrievalRAG, chunks, and citations.Building With LLMsmoduleTools and agentsFunction calls and tool design.Building With LLMsmoduleProductionStreaming, cost, and observability.Building With LLMsmoduleDesigning for trustUncertainty, review, and feedback.AI Product DesignmoduleLaunching AI featuresRelease quality and monitoring.AI Product DesignlessonRole, Task, and Context FramingFrame prompts so the model knows the job, the audience, the input, and the expected shape of the answer.Prompt Engineering / Core techniqueslessonExamples and ConstraintsUse few-shot examples and constraints to make outputs repeatable without overfitting the prompt to one case.Prompt Engineering / Core techniqueslessonPrompt ChainingBreak large tasks into smaller calls so each step can be inspected, reused, and tested.Prompt Engineering / Advanced patternslessonPrompt Injection DefenseLearn practical guardrails for untrusted input, tool calls, and instructions hidden inside retrieved content.Prompt Engineering / Advanced patternslessonRAG BlueprintDesign a retrieval-augmented generation system with ingestion, chunking, retrieval, synthesis, and evaluation.Building With LLMs / RetrievallessonFunction CallingTeach models to request structured tool calls instead of pretending to complete actions in natural language.Building With LLMs / Tools and agentslessonStreaming UXDesign interfaces that feel fast while making partial, uncertain, or long-running AI output understandable.Building With LLMs / ProductionlessonCost and ObservabilityTrack the signals that make AI features operational: cost, latency, quality, errors, and user feedback.Building With LLMs / ProductionlessonDesigning for UncertaintyCreate interfaces that acknowledge AI uncertainty without making users do all the verification work.AI Product Design / Designing for trustlessonHuman in the LoopDecide when AI should act autonomously, when it should suggest, and when a person must approve.AI Product Design / Designing for trustlessonFeedback LoopsTurn user feedback into a system that improves prompts, retrieval, evaluations, and product decisions.AI Product Design / Designing for trustlessonAI Feature Launch ChecklistPrepare an AI product slice for release with quality gates, fallback behavior, monitoring, and communication.AI Product Design / Launching AI featuresprojectBuild a reusable prompt libraryCreate versioned prompt templates for repeated workflows.Prompt Engineering / IntermediateprojectShip a prompt-powered toolTurn one prompt chain into a small product workflow.Prompt Engineering / IntermediateprojectBuild a RAG chatbotDesign a source-grounded chatbot with citations and quality checks.Building With LLMs / IntermediateprojectShip an agent that uses toolsDesign a safe tool-using assistant with validation and fallback behavior.Building With LLMs / AdvancedprojectRun a trust auditAudit an AI workflow for uncertainty, evidence, user control, and recovery paths.AI Product Design / AdvancedprojectWrite an AI feature launch planPrepare a model-powered product slice for launch with gates, fallbacks, and monitoring.AI Product Design / AdvancedtoolQdrantVector database for semantic search, recommendations, and RAG retrieval.Research and Data / Open SourcetoolLangChainFramework for composing LLM applications, chains, retrieval, and tool calls.Agents and Automation / Open SourcetoolOllamaRun local language models for private experiments and offline AI workflows.Coding and Dev / FreetoolPromptlyVersion, test, and share prompts across repeatable team workflows.Productivity / FreemiumtoolAgentKitComposable patterns for building tool-using agents with validation and monitoring.Agents and Automation / Open SourcetoolDataLensAsk structured questions of datasets and produce plain-language analysis.Research and Data / PaidgameClaude Certified Architect ChallengeClear 10 AI architecture floors covering Claude agents, MCP tools, Claude Code workflows, prompt reliability, and context management.Certification Challenge / 20 mingamePrompt GolfReach the target output in the fewest tokens possible.Challenge / 5 mingameHallucination HuntSpot the fabricated fact in AI-generated answers.Quiz / 3 mingameEmbedding MatchGroup concepts by semantic similarity before the clock runs out.Puzzle / 7 mingameAgent ArenaDesign an agent and watch it tackle live tasks against constraints.Sandbox / 20 minexamAI Engineer PrepA progressive interview path around LLMs, RAG, agents, evaluations, and production readiness.AI / LLM Engineer / 180 questionsexamML Engineer PrepFundamentals through ML system design, with practical checks for model quality and deployment tradeoffs.Machine Learning Engineer / 240 questionsexamData Scientist PrepStatistics, experimentation, modeling, and narrative thinking for data science loops.Data Scientist / 210 questionsexamAI Product Manager PrepStrategy, metrics, user trust, prioritization, and responsible AI launch decisions.AI Product Manager / 120 questionsnewsletterThe year agents went mainstreamTool-using agents moved from demos to production. We break down the patterns that actually shipped.Issue 47 / AgentsnewsletterSmall models, big impactWhy the most interesting work this month happened on models you can run on a laptop.Issue 46 / ModelsnewsletterEvaluations are the new testsA practical guide to building eval suites your team will actually maintain.Issue 45 / EngineeringnewsletterDesigning for the pauseWhat great products do during the second an AI is thinking.Issue 44 / DesigncommunityCommunityContribute lessons, product polish, tools, exams, and open-source improvements.Open source