AI From Scratch

Phase 08/15 lessons/~14 hours

Generative AI

Create images, video, audio, 3D, and more.

0 / 15 complete0%

Lessons

01Generative Models — Taxonomy & HistoryUp nextEvery image model, text model, video model, and 3D model fits in one of five buckets. Pick the wrong bucket and you will fight the math for weeks. Pick the right one and the field's last twelve years of progress stacks cleanly in your head.Learn/~45 minutes/Python 02Autoencoders & Variational Autoencoders (VAE)A plain autoencoder compresses then reconstructs. It memorizes. It does not generate. Add one trick — force the code to look Gaussian — and you get a sampler. That single trick, the reparameterization of z = μ + σ·ε, is why every latent-di...Build/~75 minutes/Python 03GANs — Generator vs DiscriminatorGoodfellow's trick in 2014 was to skip density entirely. Two networks. One makes fakes. One catches them. They fight until the fakes are indistinguishable from real. It shouldn't work. It often doesn't. When it does, the samples are still...Build/~75 minutes/Python 04Conditional GANs & Pix2PixThe first big unlock of 2014-2017 was controlling what a GAN makes. Attach a label, or an image, or a sentence. Pix2Pix did the image version and it still beats every generic text-to-image model on narrow image-to-image tasks.Build/~75 minutes/Python 05StyleGANMost generators stir z into every layer at the same time. StyleGAN split it apart: first map z to an intermediate w, then inject w at every resolution level through AdaIN. That single change untangled the latent space and made photorealist...Build/~45 minutes/Python 06Diffusion Models — DDPM from ScratchHo, Jain, Abbeel (2020) gave the field a recipe it could not quit. Destroy the data with noise over a thousand small steps. Train one neural net to predict the noise. Reverse the process at inference. Today every mainstream image, video, 3...Build/~75 minutes/Python 07Latent Diffusion & Stable DiffusionPixel-space diffusion on 512×512 images is a computational war crime. Rombach et al. (2022) noticed that you do not need all 786k dimensions to generate an image — you need enough to capture semantic structure, and a separate decoder for t...Build/~75 minutes/Python 08ControlNet, LoRA & ConditioningText alone is a clumsy control signal. ControlNet lets you clone a pretrained diffusion model and steer it with a depth map, pose skeleton, scribble, or edge image. LoRA lets you fine-tune a 2B-parameter model by training 10 million parame...Build/~75 minutes/Python 09Inpainting, Outpainting & Image EditingText-to-image makes new things. Inpainting fixes old ones. In production, 70% of billable image work is editing — swap a background, remove a logo, extend the canvas, regenerate a hand. Inpainting is where diffusion earns its keep.Build/~75 minutes/Python 10Video GenerationAn image is a 2-D tensor. A video is a 3-D one. The theory is the same; the compute is 10-100x harder. OpenAI's Sora (Feb 2024) proved it was possible. By 2026 Veo 2, Kling 1.5, Runway Gen-3, Pika 2.0, and WAN 2.2 ship production video fro...Build/~45 minutes/Python 11Audio GenerationAudio is a 1-D signal at 16-48 kHz. A five-second clip is 80-240k samples. No transformer attends to that sequence directly. The solution for every production audio model in 2026 is the same: a neural codec (Encodec, SoundStream, DAC) comp...Build/~45 minutes/Python 123D Generation3D is the modality where 2D-to-3D leverage is strongest. The 2023 breakthrough was 3D Gaussian Splatting. The 2024-2026 generative push layers multi-view diffusion + 3D reconstruction on top to produce objects and scenes from a single prom...Learn/~45 minutes/Python 13Flow Matching & Rectified FlowsDiffusion models take 20-50 sampling steps because they walk a curved path from noise to data. Flow matching (Lipman et al., 2023) and rectified flow (Liu et al., 2022) trained straight paths. Straighter paths mean fewer steps mean faster...Build/~45 minutes/Python 14Evaluation — FID, CLIP Score, Human PreferenceEvery generative model leaderboard cites FID, CLIP score, and a win rate from a human-preference arena. Each number has a failure mode a determined researcher can game. If you do not know the failure modes, you cannot tell a real improveme...Build/~45 minutes/Python 15Visual Autoregressive Modeling (VAR): Next-Scale PredictionDiffusion models sample iteratively in time (denoising steps). VAR samples iteratively in scale — it predicts a 1x1 token, then 2x2, then 4x4, up to the final resolution, each scale conditioning on the previous. The 2024 paper showed VAR m...Build/~90 minutes/Python (with PyTorch)