AI From Scratch/Lesson 58/~90 minutes

Vision Encoder Patches

A vision model that reads pixels needs a tokenizer for pixels. Patch embedding is that tokenizer. Cut the image into a grid of squares, flatten each square, project it through one linear layer, then add a 2D position signal so the transfor...

BuildPythonNo prerequisites

Loading lesson page...