Skip to main content
. 2021 Sep 21;9(9):e30223. doi: 10.2196/30223

Table 3.

Comparison of different tokenizers and the numbers of input tokens.

Tokenizer Average number of tokens Maximum number of tokens

Train Valid Test Train Valid Test
MeCab-ko 494.2 522.3 520.1 2227 2219 2240
WordPiece for BERTa 664.7 698.7 695.9 4096 2943 4171
WordPiece for ELECTRAb 564.9 596.3 593.7 2656 2472 2500
MeCab-ko and WordPiece 540.6 570.0 567.6 2608 2431 2435
MeCab-ko (trimmedc) 370.4 419.6 418.8 2162 2219 2216
MeCab-ko and WordPiece (trimmedc) 404.9 457.6 456.6 2365 2431 2412

aBERT: bidirectional encoder representations from transformers.

bELECTRA: efficiently learning an encoder that classifies token replacements accurately.

cTrimmed: The data sets were trimmed based on the keyword “thyroid” in the comprehensive medical examination text part.