AI From Scratch

Phase 14/42 lessons/~42 hours

Agent Engineering

The core of modern AI engineering. Build agents from first principles.

0 / 42 complete0%

Lessons

01The Agent Loop: Observe, Think, ActUp nextEvery agent in 2026 — Claude Code, Cursor, Devin, Operator — is a variant of the ReAct loop from 2022. Reasoning tokens interleave with tool calls and observations until a stop condition fires. Learn this loop cold before touching any fram...Build/~60 minutes/Python (stdlib)02ReWOO and Plan-and-Execute: Decoupled PlanningReAct interleaves thought and action in one stream. ReWOO separates them: one big plan up front, then execute. 5x fewer tokens, +4% accuracy on HotpotQA, and you can distill the planner into a 7B model. Plan-and-Execute generalized it; Pla...Build/~60 minutes/Python (stdlib)03Reflexion: Verbal Reinforcement LearningGradient-based RL needs thousands of trials and a GPU cluster to fix a failure mode. Reflexion (Shinn et al., NeurIPS 2023) does it in natural language: after each failed trial, the agent writes a reflection, stores it in episodic memory,...Build/~60 minutes/Python (stdlib)04Tree of Thoughts and LATS: Deliberate SearchA single chain-of-thought trajectory has no room to backtrack. ToT (Yao et al., 2023) turns reasoning into a tree with self-evaluation on each node. LATS (Zhou et al., 2024) unifies ToT with ReAct and Reflexion under Monte Carlo Tree Searc...Build/~75 minutes/Python (stdlib)05Self-Refine and CRITIC: Iterative Output ImprovementSelf-Refine (Madaan et al., 2023) uses one LLM in three roles — generate, feedback, refine — in a loop. Average gain: +20 absolute on 7 tasks. CRITIC (Gou et al., 2023) hardens the feedback step by routing verification through external too...Build/~60 minutes/Python (stdlib)06Tool Use and Function CallingToolformer (Schick et al., 2023) started self-supervised tool annotation. Berkeley Function Calling Leaderboard V4 (Patil et al., 2025) sets the 2026 bar: 40% agentic, 30% multi-turn, 10% live, 10% non-live, 10% hallucination. Single-turn...Build/~60 minutes/Python (stdlib)07Memory: Virtual Context and MemGPTContext windows are finite. Conversations, documents, and tool traces are not. MemGPT (Packer et al., 2023) frames this as OS virtual memory — main context is RAM, external store is disk, the agent pages between them. This is the pattern e...Build/~75 minutes/Python (stdlib)08Memory Blocks and Sleep-Time Compute (Letta)MemGPT became Letta in 2024. The 2026 evolution adds two ideas: discrete functional memory blocks the model can edit directly, and a sleep-time agent that consolidates memory asynchronously while the primary agent is idle. This is how you...Build/~75 minutes/Python (stdlib)09Hybrid Memory: Vector + Graph + KV (Mem0)Mem0 (Chhikara et al., 2025) treats memory as three stores in parallel — vector for semantic similarity, KV for fast fact lookup, graph for entity-relationship reasoning. A scoring layer fuses the three on retrieval. This is the 2026 produ...Build/~75 minutes/Python (stdlib)10Skill Libraries and Lifelong Learning (Voyager)Voyager (Wang et al., TMLR 2024) treats executable code as a skill. Skills are named, retrievable, composable, and refined by environment feedback. This is the reference architecture for Claude Agent SDK skills, skillkit, and the 2026 skil...Build/~75 minutes/Python (stdlib)11Planning with HTN and Evolutionary SearchSymbolic planning handles the cases where the plan is provably correct. Evolutionary code search handles the cases where the fitness function is machine-checkable. ChatHTN (2025) and AlphaEvolve (2025) show what each unlocks when paired wi...Build/~75 minutes/Python (stdlib)12Anthropic's Workflow Patterns: Simple Over ComplexSchluntz and Zhang (Anthropic, Dec 2024) distinguish workflows (predefined paths) from agents (dynamic tool-use). Five workflow patterns cover most cases. Start with direct API calls. Add agents only when steps cannot be predicted.Learn + Build/~60 minutes/Python (stdlib)13LangGraph: Stateful Graphs and Durable ExecutionLangGraph is the 2026 reference for low-level stateful orchestration. Agent is a state machine; nodes are functions; edges are transitions; state is immutable and checkpointed after every step. Resume from any failure exactly where it left...Learn + Build/~75 minutes/Python (stdlib)14AutoGen v0.4: Actor Model and Agent FrameworkAutoGen v0.4 (Microsoft Research, Jan 2025) redesigned agent orchestration around the actor model. Async message exchange, event-driven agents, fault isolation, natural concurrency. The framework is now in maintenance mode while Microsoft...Learn + Build/~75 minutes/Python (stdlib)15CrewAI: Role-Based Crews and FlowsCrewAI is the 2026 role-based multi-agent framework. Four primitives: Agent, Task, Crew, Process. Two top-level shapes: Crews (autonomous, role-based collaboration) and Flows (event-driven, deterministic). The docs are blunt: "for any prod...Learn + Build/~75 minutes/Python (stdlib)16OpenAI Agents SDK: Handoffs, Guardrails, TracingOpenAI Agents SDK is the lightweight multi-agent framework built on the Responses API. Five primitives: Agent, Handoff, Guardrail, Session, Tracing. Handoffs are tools named transfer_to_. Guardrails trip on input or output. Tracing is on b...Learn + Build/~75 minutes/Python (stdlib)17Claude Agent SDK: Subagents and Session StoreThe Claude Agent SDK is the library form of the Claude Code harness. Built-in tools, subagents for context isolation, hooks, W3C trace propagation, session store parity. Claude Managed Agents is the hosted alternative for long-running asyn...Learn + Build/~75 minutes/Python (stdlib)18Agno and Mastra: Production RuntimesAgno (Python) and Mastra (TypeScript) are the 2026 production-runtime pairing. Agno aims at microsecond agent instantiation and stateless FastAPI backends. Mastra ships agents, tools, workflows, unified model routing, and composite storage...Learn/~45 minutes/Python, TypeScript 19Benchmarks: SWE-bench, GAIA, AgentBenchThree benchmarks anchor agent evaluation in 2026. SWE-bench tests code patching. GAIA tests generalist tool use. AgentBench tests multi-environment reasoning. Know their composition, their contamination story, and what they do not measure.Learn/~60 minutes/Python (stdlib)20Benchmarks: WebArena and OSWorldWebArena tests web-agent capability across four self-hosted apps. OSWorld tests desktop-agent capability across Ubuntu, Windows, macOS. At release (2023–2024) both showed a big gap between best-in-class agents and humans. The gap is narrow...Learn/~60 minutes/Python (stdlib)21Computer Use: Claude, OpenAI CUA, GeminiThree production computer-use models in 2026. All three are vision-based. All three treat screenshots, DOM text, and tool outputs as untrusted input. Only direct user instructions count as permission. Per-step safety services are the norm.Learn/~60 minutes/Python (stdlib)22Voice Agents: Pipecat and LiveKitVoice agents are a first-class production category in 2026. Pipecat gives you a Python frame-based pipeline (VAD → STT → LLM → TTS → transport). LiveKit Agents bridges AI models to users over WebRTC. Production latency targets land at 450–...Learn/~60 minutes/Python (stdlib)23OpenTelemetry GenAI Semantic ConventionsOpenTelemetry's GenAI SIG (launched April 2024) defines the standard schema for agent telemetry. Span names, attributes, and content-capture rules converge across vendors so agent traces mean the same thing in Datadog, Grafana, Jaeger, and...Learn + Build/~60 minutes/Python (stdlib)24Agent Observability: Langfuse, Phoenix, OpikThree open-source agent observability platforms dominate 2026. Langfuse (MIT) — 6M+ installs/month, tracing + prompt management + evals + session replay. Arize Phoenix (Elastic 2.0) — deep agent-specific evals, RAG relevancy, OpenInference...Learn/~45 minutes/Python (stdlib)25Multi-Agent Debate and CollaborationDu et al. (ICML 2024, "Society of Minds") run N model instances that independently propose answers, then iteratively critique each other over R rounds to converge. Improves factuality, rule-following, reasoning. Sparse topology beats full...Learn + Build/~60 minutes/Python (stdlib)26Failure Modes: Why Agents BreakMASFT (Berkeley, 2025) catalogs 14 multi-agent failure modes in 3 categories. Microsoft's Taxonomy documents how existing AI failures amplify in agentic settings. Industry field data converges on five recurring modes: hallucinated actions,...Learn + Build/~60 minutes/Python (stdlib)27Prompt Injection and the PVE DefenseGreshake et al. (AISec 2023) established indirect prompt injection as the defining agent security problem. Attacker plants instructions in data the agent retrieves; on ingest, those instructions override the developer prompt. Treat all ret...Build/~75 minutes/Python (stdlib)28Orchestration Patterns: Supervisor, Swarm, HierarchicalFour orchestration patterns recur across 2026 frameworks: supervisor-worker, swarm / peer-to-peer, hierarchical, debate. Anthropic's guidance: "It's about building the right system for your needs." Start simple; add topology only when a si...Learn + Build/~60 minutes/Python (stdlib)29Production Runtimes: Queue, Event, CronProduction agents run on six runtime shapes: request-response, streaming, durable execution, queue-based background, event-driven, and scheduled. Pick the shape before you pick the framework. Observability is load-bearing at every shape.Learn/~60 minutes/Python (stdlib)30Eval-Driven Agent DevelopmentAnthropic's guidance: "start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when needed." Evaluation is not the last step. It's the outer loop that drives every other choice in Pha...Learn + Build/~60 minutes/Python (stdlib)31Agent Workbench Engineering: Why Capable Models Still FailA capable model is not enough. Reliable agents need a workbench: instructions, state, scope, feedback, verification, review, and handoff. Strip those away and even a frontier model produces work that is unsafe to ship.Learn + Build/~45 minutes/Python (stdlib)32The Minimal Agent WorkbenchThe smallest useful workbench is three files: a root instructions router, a state file, and a task board. Everything else is layered on top. If a repo cannot carry these three, no model will save it.Build/~45 minutes/Python (stdlib)33Agent Instructions as Executable ConstraintsInstructions written as prose are wishes. Instructions written as constraints are tests. The workbench turns each rule into something an agent can check at runtime and a reviewer can verify after the fact.Build/~50 minutes/Python (stdlib)34Repo Memory and Durable StateChat history is volatile. The repo is durable. The workbench stores agent state in versioned files so the next session, the next agent, and the next reviewer all read from the same source of truth.Build/~60 minutes/Python (stdlib + jsonschema optional)35Initialization Scripts for AgentsEvery session that starts cold pays a tax. The agent reads the same files, retries the same probes, and rediscovers the same paths. An init script pays the tax once and writes the answers into state.Build/~45 minutes/Python (stdlib)36Scope Contracts and Task BoundariesThe model does not know where the work ends. A scope contract is a per-task file that says where the work begins, where it ends, and how to roll back if it spills. The contract turns "stay in scope" from a wish into a check.Build/~50 minutes/Python (stdlib)37Runtime Feedback LoopsAgents that do not see real command output guess. A feedback runner captures stdout, stderr, exit code, and timing into a structured record the next turn can read. Then the agent reacts to facts instead of to its own prediction of facts.Build/~50 minutes/Python (stdlib)38Verification GatesThe agent does not get to mark its own work as done. A verification gate reads the scope contract, the feedback log, the rule report, and the diff, and answers a single question: is this task actually complete? If the gate says no, the tas...Build/~55 minutes/Python (stdlib)39Reviewer Agent: Separate Builder from MarkerThe agent that wrote the code cannot grade it. A reviewer is a second loop with a different system prompt, a different goal, and read-only access to everything the builder produced. The gap between builder and reviewer is where most reliab...Build/~55 minutes/Python (stdlib)40Multi-Session HandoffThe session is going to end. The work is not. The handoff packet is the artifact that turns "the agent worked for an hour" into "the next session is productive in the first minute." Build it on purpose, not as an afterthought.Build/~50 minutes/Python (stdlib)41The Workbench on a Real RepoEleven lessons of surfaces are worth nothing if they do not survive contact with a real codebase. This lesson runs the same task twice on a small sample app: prompt-only versus workbench-guided. The numbers do the arguing.Build/~60 minutes/Python (stdlib)42Capstone: Ship a Reusable Agent Workbench PackThe mini-track ends with a pack you drop into any repo. Eleven lessons of surfaces compressed into a directory you can cp -r and have an agent working reliably the next morning. The capstone is the artifact this curriculum trades on.Build/~75 minutes/Python (stdlib)