. 2021 Sep 21;9(9):e30223. doi: 10.2196/30223

Table 3.

Comparison of different tokenizers and the numbers of input tokens.

Tokenizer	Average number of tokens				Maximum number of tokens
	Train	Valid	Test	Train		Valid	Test
MeCab-ko	494.2	522.3	520.1	2227		2219	2240
WordPiece for BERT^a	664.7	698.7	695.9	4096		2943	4171
WordPiece for ELECTRA^b	564.9	596.3	593.7	2656		2472	2500
MeCab-ko and WordPiece	540.6	570.0	567.6	2608		2431	2435
MeCab-ko (trimmed^c)	370.4	419.6	418.8	2162		2219	2216
MeCab-ko and WordPiece (trimmed^c)	404.9	457.6	456.6	2365		2431	2412

^aBERT: bidirectional encoder representations from transformers.

^bELECTRA: efficiently learning an encoder that classifies token replacements accurately.

^cTrimmed: The data sets were trimmed based on the keyword “thyroid” in the comprehensive medical examination text part.