Skip to main content
. 2025 Aug 20;18:1959–1969. doi: 10.2147/CCID.S522271

Figure 2.

Figure 2

Tokenization of data for large language models. (A) The tokenization of text data. The sentence which consists of 4 words, 3 spaces and 1 punctuation mark is split into distinct “tokens”. These tokens are each represented by a unique code and can consist of a combination of words, parts of words, spaces with words and punctuation marks. (B) Image data can also be tokenized where the information within the image is split into smaller individual pieces that can then be passed to the model.