AI From Scratch

Phase 12/25 lessons/~65 hours

Multimodal AI

Models that see, hear, read, and reason across modalities.

0 / 25 complete0%

Lessons

01Vision Transformers and the Patch-Token PrimitiveUp nextBefore anything multimodal, an image has to become a sequence of tokens a transformer can eat. The 2020 ViT paper answered this with 16x16 pixel patches, a linear projection, and a position embedding. Five years later every 2026 frontier m...Learn/~120 minutes/Python (stdlib, patch tokenizer + geometry calculator)02CLIP and Contrastive Vision-Language PretrainingOpenAI's CLIP (2021) proved a single idea big enough to power the next five years: align an image encoder and a text encoder in the same vector space using only noisy web image-caption pairs and a contrastive loss. Zero supervised labels....Build/~180 minutes/Python (stdlib, InfoNCE + sigmoid loss implementations)03From CLIP to BLIP-2 — Q-Former as Modality BridgeCLIP aligns image and text but cannot generate captions, answer questions, or hold a conversation. BLIP-2 (Salesforce, 2023) solved that with a small trainable bridge: 32 learnable query vectors attend over a frozen ViT's features via cros...Build/~180 minutes 04Flamingo and Gated Cross-Attention for Few-Shot VLMsDeepMind's Flamingo (2022) did two things before anyone else. It showed a single model could process arbitrarily interleaved sequences of images, videos, and text. And it showed VLMs could learn in-context — give a few-shot prompt with thr...Learn/~120 minutes 05LLaVA and Visual Instruction TuningLLaVA (April 2023) is the most copied multimodal architecture on the planet. It replaced BLIP-2's Q-Former with a 2-layer MLP, replaced Flamingo's gated cross-attention with naive token concatenation, and trained on 158k visual-instruction...Build/~180 minutes 06Any-Resolution Vision: Patch-n'-Pack and NaFlexReal images are not 224x224 squares. A receipt is 9:16, a chart is 16:9, a medical scan might be 4096x4096, a mobile screenshot is 9:19.5. The pre-2024 VLM answer — resize everything to a fixed square — threw away the signal that makes OCR...Build/~120 minutes 07Open-Weight VLM Recipes: What Actually MattersThe 2024-2026 open-weight VLM literature is a forest of ablation tables. Apple's MM1 tested 13 combinations of image encoder, connector, and data mix. Allen AI's Molmo proved detailed human captions beat GPT-4V distillation. Cambrian-1 ran...Learn + lab/~180 minutes/Python (stdlib, ablation table parser + recipe picker)08LLaVA-OneVision: Single-Image, Multi-Image, Video in One ModelBefore LLaVA-OneVision (Li et al., August 2024) the open-VLM world had separate lineages: LLaVA-1.5 for single images, multi-image models like Mantis and VILA, video models like Video-LLaVA and Video-LLaMA. Each won its benchmark and faile...Build/~180 minutes/Python (stdlib, token budget solver + curriculum planner)09Qwen-VL Family and Dynamic-FPS VideoThe Qwen-VL family — Qwen-VL (2023), Qwen2-VL (2024), Qwen2.5-VL (2025), Qwen3-VL (2025) — is the most influential open vision-language model lineage in 2026. Each generation made a single decisive architectural bet that the rest of the op...Learn/~120 minutes 10InternVL3: Native Multimodal PretrainingEvery open VLM before InternVL3 followed the same three-step recipe: take a text LLM trained on trillions of text tokens, bolt on a vision encoder, then fine-tune the seams. This works but has alignment debt — the text LLM has spent its fu...Learn/~120 minutes 11Chameleon and Early-Fusion Token-Only Multimodal ModelsEvery VLM we have seen so far keeps images and text separate. Visual tokens come from a vision encoder, flow into a projector, then meet text inside the LLM. The vision and text vocabularies never overlap. Chameleon (Meta, May 2024) asked:...Build/~180 minutes 12Emu3: Next-Token Prediction for Image and Video GenerationBAAI's Emu3 (Wang et al., September 2024) is the 2024 result that should have ended the diffusion-versus-autoregressive debate. A single Llama-style decoder-only transformer, trained only on the next-token-prediction objective, across a un...Learn/~120 minutes/Python (stdlib, 3D video tokenizer math + autoregressive sampler skeleton)13Transfusion: Autoregressive Text + Diffusion Image in One TransformerChameleon and Emu3 bet everything on discrete tokens. They work, but the quantization bottleneck is visible — the image quality plateaus below continuous-space diffusion models. Transfusion (Meta, Zhou et al., August 2024) takes the opposi...Build/~180 minutes 14Show-o and Discrete-Diffusion Unified ModelsTransfusion mixes continuous and discrete representations. Show-o (Xie et al., August 2024) goes the other way: text tokens use causal next-token prediction, image tokens use masked discrete diffusion in the spirit of MaskGIT. Both sit ins...Learn/~120 minutes 15Janus-Pro: Decoupled Encoders for Unified Multimodal ModelsUnified multimodal models have an unavoidable tension. Understanding wants semantic features — SigLIP or DINOv2 output vectors rich with concept-level information. Generation wants reconstruction-friendly codes — VQ tokens that compose bac...Build/~120 minutes 16MIO and Any-to-Any Streaming Multimodal ModelsGPT-4o ships a product most open models cannot replicate: an agent that hears voice, sees video, and speaks back in real time. The open-ecosystem answer by late 2024 was MIO (Wang et al., September 2024). MIO tokenizes text, image, speech,...Learn/~120 minutes 17Video-Language Models: Temporal Tokens and GroundingVideo is not a stack of photos. A 5-second clip has causal ordering, action verbs, and event timing that an image model cannot represent. Video-LLaMA (Zhang et al., June 2023) shipped the first open video-LLM with audio-visual grounding. V...Build/~180 minutes 18Long-Video Understanding at Million-Token ContextA 1-hour 4K video at 24 FPS, patched and embedded, produces on the order of 60 million tokens. A 2-hour podcast episode transcribed is 30,000 tokens. A full Blu-ray feature film, even compressed with aggressive pooling, is hundreds of thou...Build/~180 minutes 19Audio-Language Models: the Whisper to Audio Flamingo 3 ArcWhisper (Radford et al., December 2022) settled speech recognition — 680k hours of weakly-supervised multilingual speech, a simple encoder-decoder transformer, a benchmark that made every subsequent ASR release cite it. But recognition is...Build/~180 minutes 20Omni Models: Qwen2.5-Omni and the Thinker-Talker SplitGPT-4o's product demo in May 2024 was disruptive not because of the underlying model but because of the product shape — a voice interface where you talk, the model sees what the camera sees, and it talks back in under 250ms. The open ecosy...Build/~180 minutes/Python (stdlib, streaming pipeline latency simulator + VAD loop)21Embodied VLAs: RT-2, OpenVLA, π0, GR00TThe first time a model read a recipe off a website and executed it in a kitchen robot was RT-2 (Google DeepMind, July 2023). RT-2 discretized actions as text tokens, co-fine-tuned a VLM on web data plus robot-action data, and proved that w...Learn/~180 minutes/Python (stdlib, action tokenizer + VLA inference skeleton)22Document and Diagram UnderstandingDocuments are not photos. A PDF, scientific paper, invoice, or handwritten form has layout, tables, diagrams, footnotes, headers, and semantic structure that plain image understanding cannot capture. The pre-VLM stack was a pipeline: Tesse...Build/~180 minutes 23ColPali and Vision-Native Document RAGTraditional RAG parses PDFs into text, splits into chunks, embeds chunks, stores vectors. Every step loses signal: OCR drops chart data, chunking breaks table rows, text embeddings ignore figures. ColPali (Faysse et al., July 2024) asked t...Build/~180 minutes 24Multimodal RAG and Cross-Modal RetrievalVision-native document RAG is one slice. Production multimodal RAG goes wider — retrieving across text, images, audio, and video for workflows like trip planning ("find me a quiet vegan brunch with natural light"), medical triage ("what in...Build/~180 minutes 25Multimodal Agents and Computer-Use (Capstone)The 2026 frontier product is a multimodal agent that reads screenshots, clicks buttons, navigates web UIs, fills forms, and completes workflows end-to-end. SeeClick and CogAgent (2024) proved the GUI-grounding primitive. Ferret-UI added mo...Capstone/~240 minutes/Python (stdlib, action schema + agent loop skeleton)