AI From Scratch

Phase 17/28 lessons/~32 hours

Infrastructure & Production

Ship AI to the real world. Scale, monitor, optimize.

0 / 28 complete0%

Lessons

01Managed LLM Platforms — Bedrock, Vertex AI, Azure OpenAIUp nextThree hyperscalers, three distinct strategies. AWS Bedrock is a model marketplace — Claude, Llama, Titan, Stability, Cohere behind one API. Azure OpenAI is an exclusive OpenAI partnership plus Provisioned Throughput Units (PTUs) for dedica...Learn/~60 minutes 02Inference Platform Economics — Fireworks, Together, Baseten, Modal, Replicate, AnyscaleThe 2026 inference market is no longer GPU time rental. It bifurcates into custom silicon (Groq, Cerebras, SambaNova), GPU platforms (Baseten, Together, Fireworks, Modal), and API-first marketplaces (Replicate, DeepInfra). Fireworks raised...Learn/~60 minutes 03GPU Autoscaling on Kubernetes — Karpenter, KAI Scheduler, Gang SchedulingThree layers, not one. Karpenter provisions nodes dynamically (under one minute, 40% faster than Cluster Autoscaler). KAI Scheduler handles gang scheduling, topology awareness, and hierarchical queues — it prevents the 7-of-8 partial alloc...Learn/~75 minutes 04vLLM Serving Internals: PagedAttention, Continuous Batching, Chunked PrefillvLLM's dominance in 2026 rests on three compounding defaults, not a single trick. PagedAttention is always on. Continuous batching injects new requests into the active batch between decode iterations. Chunked prefill slices long prompts so...Learn/~75 minutes/Python (stdlib, toy continuous batching scheduler)05EAGLE-3 Speculative Decoding in ProductionSpeculative decoding pairs a fast draft model with the target model. The draft proposes K tokens; the target verifies in a single forward; accepted tokens are free. In 2026, EAGLE-3 is the production-grade variant — it trains a draft head...Learn/~60 minutes 06SGLang and RadixAttention for Prefix-Heavy WorkloadsSGLang treats the KV cache as a first-class, reusable resource stored in a radix tree. Where vLLM schedules requests FCFS (first-come, first-served), SGLang's cache-aware scheduler prioritizes requests with longer shared prefixes — effecti...Learn/~75 minutes 07TensorRT-LLM on Blackwell with FP8 and NVFP4TensorRT-LLM is NVIDIA-only but it wins on Blackwell. On GB200 NVL72 with Dynamo orchestration, SemiAnalysis InferenceX measured $0.012 per million tokens on a 120B model in Q1-Q2 2026, against $0.09/M on H100 + vLLM — a 7x economic gap. T...Learn/~75 minutes/Python (stdlib, toy FP8, NVFP4 memory and cost calculator)08Inference Metrics — TTFT, TPOT, ITL, Goodput, P99Four metrics decide whether an inference deployment is working. TTFT is prefill plus queue plus network. TPOT (equivalently ITL) is the memory-bound decode cost per token. End-to-end latency is TTFT plus TPOT times output length. Throughpu...Learn/~60 minutes/Python (stdlib, toy percentile calculator and goodput reporter)09Production Quantization — AWQ, GPTQ, GGUF K-quants, FP8, MXFP4/NVFP4Quantization format is not a universal choice — it is a function of hardware, serving engine, and workload. GGUF Q4_K_M or Q5_K_M owns CPU and edge, delivered through llama.cpp and Ollama. GPTQ wins inside vLLM when you need multi-LoRA on...Learn/~75 minutes/Python (stdlib, toy memory and throughput comparison across formats)10Cold Start Mitigation for Serverless LLMsA 20 GB model image takes 5-10 minutes (7B) to 20+ minutes (70B) to go from cold to serving. In a true serverless world, that is not a warm-up — it is an outage. Mitigations operate at five layers: pre-seeded node images (Bottlerocket on A...Learn/~60 minutes 11Multi-Region LLM Serving and KV Cache LocalityRound-robin load balancing is actively harmful for cached LLM inference. A request that does not land on the node holding its prefix pays full prefill cost — roughly 800 ms at P50 on a long prompt versus ~80 ms with a cache hit. In 2026 th...Learn/~60 minutes 12Edge Inference — Apple Neural Engine, Qualcomm Hexagon, WebGPU/WebLLM, JetsonThe core edge constraint is memory bandwidth, not compute. Mobile DRAM sits at 50-90 GB/s; datacenter HBM3 clears 2-3 TB/s — a 30-50x gap. Decode is memory-bound so the gap is decisive. In 2026 the landscape splits four ways. Apple M4/A18...Learn/~60 minutes 13LLM Observability Stack SelectionThe 2026 observability market splits into two categories. Development platforms (LangSmith, Langfuse, Comet Opik) bundle monitoring with evals, prompt management, session replays. Gateway/instrumentation tools (Helicone, SigNoz, OpenLLMetr...Learn/~60 minutes 14Prompt Caching and Semantic Caching EconomicsPricing snapshot dated 2026-04. Numeric claims below reflect vendor rate cards captured at this lesson's publication; verify against the linked docs before quoting them downstream.Learn/~60 minutes 15Batch APIs — the 50% Discount as Industry StandardEvery major provider ships an async batch API with a 50% discount and ~24-hour turnaround. OpenAI, Anthropic, Google, and most of the inference platforms (Fireworks batch tier, Together batch) implement the same pattern. Stack batch with p...Learn/~45 minutes 16Model Routing as a Cost-Reduction PrimitiveA dynamic broker evaluates every request (task type, token length, embedding similarity, confidence) and sends simple queries to a cheap model, escalating complex ones to a frontier model. Also called model cascading. Production case studi...Learn/~60 minutes/Python (stdlib, toy cascading router simulator)17Disaggregated Prefill/Decode — NVIDIA Dynamo and llm-dPrefill is compute-bound; decode is memory-bound. Running both on the same GPU wastes one resource. Disaggregation splits them onto separate pools and transfers KV cache between them over NIXL (RDMA/InfiniBand or TCP fallback). NVIDIA Dyna...Learn/~75 minutes 18vLLM Production Stack with LMCache KV OffloadingvLLM's production-stack is the reference Kubernetes deployment — router, engines, and observability wired together. LMCache is the KV-offloading layer that extracts KV cache out of GPU memory and reuses it across queries and engines (CPU D...Learn/~60 minutes 19AI Gateways — LiteLLM, Portkey, Kong AI Gateway, BifrostA gateway sits between your apps and model providers. Core features are provider routing, fallback, retries, rate limiting, secret references, observability, guardrails. Market split in 2026: LiteLLM is MIT OSS with 100+ providers, OpenAI-...Learn/~60 minutes 20Shadow Traffic, Canary Rollout, and Progressive Deployment for LLMsLLM rollouts combine the hardest parts of software deployment: no unit tests, diffuse failure modes, delayed signals. The sequence is (1) shadow mode — duplicate prod requests to candidate model, log, compare with zero user impact; catches...Learn/~60 minutes 21A/B Testing LLM Features — GrowthBook, Statsig, and the Vibes ProblemTraditional A/B testing was not built for non-deterministic LLMs. The critical distinction: evals answer "can the model do the job?" A/B tests answer "do users care?" Both are required; shipping on vibe checks is over. What to test in 2026...Learn/~60 minutes/Python (stdlib, toy sequential test simulator)22Load Testing LLM APIs — Why k6 and Locust LieTraditional load testers were not designed for streaming responses, variable output lengths, token-level metrics, or GPU saturation. Two traps bite most teams. The GIL trap: Locust's token-level measurement runs tokenization under the Pyth...Build/~75 minutes 23SRE for AI — Multi-Agent Incident Response, Runbooks, Predictive DetectionAI SRE uses LLMs grounded in infrastructure data (logs, runbooks, service topology) via RAG to automate investigation, documentation, and coordination phases. The 2026 architecture pattern is multi-agent orchestration — specialized agents...Learn/~60 minutes 24Chaos Engineering for LLM ProductionChaos engineering for LLMs is its own discipline in 2026. Prerequisites before running experiments in production: defined SLI/SLO, trace+metric+log observability, automated rollback, runbooks, on-call. Architecture has four planes: control...Learn/~60 minutes/Python (stdlib, toy chaos experiment runner)25Security — Secrets, API Key Rotation, Audit Logs, GuardrailsEliminate secret sprawl via centralized vaults (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault). Never store credentials in config files, env files in VCS, spreadsheets. Use IAM roles over static keys; OIDC for CI/CD. The AI-gateway...Learn/~60 minutes 26Compliance — SOC 2, HIPAA, GDPR, PCI-DSS, EU AI Act, ISO 42001Multi-framework coverage is table stakes for 2026 enterprise deals. EU AI Act: in force since August 1, 2024. Most high-risk requirements enforce August 2, 2026. Fines up to €15M or 3% global annual turnover for high-risk-system obligation...Learn/~60 minutes 27FinOps for LLMs — Unit Economics and Multi-Tenant AttributionTraditional FinOps breaks on LLM spend. Costs are token-transactions, not resource-uptime. Tags don't map — an API call is a transaction, not an asset. Engineering decisions (prompt design, context window, output length) are financial deci...Learn/~60 minutes 28Self-Hosted Serving Selection — llama.cpp, Ollama, TGI, vLLM, SGLangFour engines dominate self-hosted inference in 2026. Pick based on hardware, scale, and ecosystem. llama.cpp is fastest on CPU — widest model support, full control over quantization and threading. Ollama is the dev-laptop one-command insta...Learn/~45 minutes