Skip to main content
. 2022 Nov;36(5-6):587–602. doi: 10.1177/10943420221121804

Figure 3.

Figure 3.

The vocabulary generated by WordPiece tokenization represents commonly occurring sub-sequences from the training data as individual tokens. The histogram shows the distribution of number of characters for all tokens in the vocab along with the chemical structure for sample tokens of different length.