Loading lesson page...
AI From Scratch/Lesson 43/~90 minutes
HDF5 Tokenized Corpus
The downloaded corpus has to land in a layout the trainer can stream from at line speed. JSONL on disk does not survive 16 dataloader workers. HDF5 with a resizable, chunked integer dataset does. This lesson builds streaming tokenization i...
BuildPythonNo prerequisites