Table 3.
Comparison of different tokenizers and the numbers of input tokens.
Tokenizer | Average number of tokens | Maximum number of tokens | |||||
|
Train | Valid | Test | Train | Valid | Test | |
MeCab-ko | 494.2 | 522.3 | 520.1 | 2227 | 2219 | 2240 | |
WordPiece for BERTa | 664.7 | 698.7 | 695.9 | 4096 | 2943 | 4171 | |
WordPiece for ELECTRAb | 564.9 | 596.3 | 593.7 | 2656 | 2472 | 2500 | |
MeCab-ko and WordPiece | 540.6 | 570.0 | 567.6 | 2608 | 2431 | 2435 | |
MeCab-ko (trimmedc) | 370.4 | 419.6 | 418.8 | 2162 | 2219 | 2216 | |
MeCab-ko and WordPiece (trimmedc) | 404.9 | 457.6 | 456.6 | 2365 | 2431 | 2412 |
aBERT: bidirectional encoder representations from transformers.
bELECTRA: efficiently learning an encoder that classifies token replacements accurately.
cTrimmed: The data sets were trimmed based on the keyword “thyroid” in the comprehensive medical examination text part.