AI From Scratch

Phase 04/28 lessons/~27 hours

Computer Vision

From pixels to understanding — image, video, and 3D.

0 / 28 complete0%

Lessons

01Image Fundamentals — Pixels, Channels, Color SpacesUp nextAn image is a tensor of light samples. Every vision model you will ever use starts from this one fact.Build/~45 minutes/Python 02Convolutions from ScratchA convolution is a tiny dense layer you slide across an image, sharing the same weights at every location.Build/~75 minutes/Python 03CNNs — LeNet to ResNetEvery major CNN of the last thirty years is the same conv–nonlinearity–downsample recipe with one new idea bolted on. Learn the ideas in order.Learn + Build/~75 minutes/Python 04Image ClassificationA classifier is a function from pixels to a probability distribution over classes. Everything else is plumbing.Build/~75 minutes/Python 05Transfer Learning & Fine-TuningSomebody else spent a million GPU hours teaching a network what edges, textures, and object parts look like. You should borrow those features before training your own.Build/~75 minutes/Python 06Object Detection — YOLO from ScratchDetection is classification plus regression, run at every position in a feature map, then cleaned up with non-maximum suppression.Build/~75 minutes/Python 07Semantic Segmentation — U-NetSegmentation is classification at every pixel. U-Net makes it work by pairing a downsampling encoder with an upsampling decoder and wiring skip connections between them.Build/~75 minutes/Python 08Instance Segmentation — Mask R-CNNAdd a tiny mask branch to a Faster R-CNN detector and you have instance segmentation. The hard part is RoIAlign, and it is harder than it looks.Build + Learn/~75 minutes/Python 09Image Generation — GANsA GAN is two neural networks in a fixed game. One draws, one critiques. They get better together until the drawings fool the critic.Build/~75 minutes/Python 10Image Generation — Diffusion ModelsA diffusion model learns to denoise. Train it to remove a tiny bit of noise from a noisy image, repeat that backwards a thousand times, and you have an image generator.Build/~75 minutes/Python 11Stable Diffusion — Architecture & Fine-TuningStable Diffusion is a DDPM that runs in the latent space of a pretrained VAE, conditioned on text via cross-attention, sampled with a fast deterministic ODE solver, and steered by classifier-free guidance.Learn + Use/~75 minutes/Python 12Video Understanding — Temporal ModelingA video is a sequence of images plus the physics that connects them. Every video model either treats time as an extra axis (3D conv), a sequence to attend over (transformer), or a feature to extract once and pool (2D+pool).Learn + Build/~45 minutes/Python 133D Vision — Point Clouds & NeRFs3D vision comes in two flavours. Point clouds are the sensor's raw output. NeRFs are the learned volumetric field. Both answer "what is where in space."Learn + Build/~45 minutes/Python 14Vision Transformers (ViT)Cut the image into patches, treat each patch as a word, run a standard transformer. Don't look back.Build/~45 minutes/Python 15Real-Time Vision — Edge DeploymentEdge inference is the discipline of getting a 90-accuracy model to run at 30 fps on a device with 2 GB of RAM. Every percentage point of accuracy is traded against milliseconds of latency.Learn + Build/~75 minutes/Python 16Build a Complete Vision Pipeline — CapstoneA production vision system is a chain of models and rules stitched with data contracts. The pieces are already in this phase; the capstone wires them together end-to-end.Build/~120 minutes/Python 17Self-Supervised Vision — SimCLR, DINO, MAELabels are the bottleneck of supervised vision. Self-supervised pretraining removes them: learn visual features from 100M unlabelled images, fine-tune on 10k labelled ones.Learn + Build/~75 minutes/Python 18Open-Vocabulary Vision — CLIPTrain an image encoder and a text encoder together so that matching (image, caption) pairs land at the same point in a shared space. That is the whole trick.Build + Use/~45 minutes/Python 19OCR & Document UnderstandingOCR is a three-stage pipeline — detect text boxes, recognise the characters, then lay them out. Every modern OCR system reorders these stages or merges them.Learn + Use/~45 minutes/Python 20Image Retrieval & Metric LearningA retrieval system ranks candidates by a distance in embedding space. Metric learning is the discipline of shaping that space so the distances mean what you want.Build/~45 minutes/Python 21Keypoint Detection & Pose EstimationA pose is a set of ordered keypoints. A keypoint detector is a heatmap regressor. Everything else is bookkeeping.Build/~45 minutes/Python 223D Gaussian Splatting from ScratchA scene is a cloud of millions of 3D Gaussians. Each one has a position, orientation, scale, opacity, and a colour that depends on viewing direction. Rasterise them, backprop through the rasterisation, done.Build/~90 minutes/Python 23Diffusion Transformers & Rectified FlowThe U-Net is not the secret of diffusion. Replace it with a transformer, swap the noise schedule for a straight-line flow, and suddenly you have SD3, FLUX, and every 2026 text-to-image model.Learn + Build/~75 minutes/Python 24SAM 3 & Open-Vocabulary SegmentationGive a model a text prompt and an image and get masks for every matching object. SAM 3 made that a single forward pass.Use + Build/~60 minutes/Python 25Vision-Language Models — The ViT-MLP-LLM PatternA vision encoder converts an image into tokens. An MLP projector maps those tokens into the LLM's embedding space. A language model does the rest. That pattern — ViT-MLP-LLM — is every production VLM in 2026.Learn + Use/~75 minutes/Python 26Monocular Depth & Geometry EstimationA depth map is a single-channel image where each pixel is a distance from the camera. Predicting it from one RGB frame used to be impossible without stereo or LiDAR. In 2026 a frozen ViT encoder plus a lightweight head gets within a few pe...Build + Use/~60 minutes/Python 27Multi-Object Tracking & Video MemoryTracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.Build/~60 minutes/Python 28World Models & Video DiffusionA video model that predicts the next seconds of a scene is a world simulator. Condition that prediction on actions and you have a learned game engine.Learn + Build/~75 minutes/Python