Figure 2.
Tokenization of data for large language models. (A) The tokenization of text data. The sentence which consists of 4 words, 3 spaces and 1 punctuation mark is split into distinct “tokens”. These tokens are each represented by a unique code and can consist of a combination of words, parts of words, spaces with words and punctuation marks. (B) Image data can also be tokenized where the information within the image is split into smaller individual pieces that can then be passed to the model.
