AI From Scratch/Lesson 18/~45 minutes

Open-Vocabulary Vision — CLIP

Train an image encoder and a text encoder together so that matching (image, caption) pairs land at the same point in a shared space. That is the whole trick.

Build + UsePythonNo prerequisites

Loading lesson page...