Table 2. The specifications of each BERT.
UTH-BERT | KU-BERT | TU-BERT | mBERT | ||
---|---|---|---|---|---|
Publisher | The University of Tokyo Hospital | The University of Kyoto | The University of Tohoku | ||
Language | Japanese | Japanese | Japanese | Multilingual | |
Pre-training corpus | Clinical text (120 million) | JP Wikipedia (18 million) | JP Wikipedia (18 million) | 104 languages of Wikipedias | |
Tokenizer | Morphological analyzer | MeCab | Juman++ | MeCab | - |
External Dictionary | Mecab-ipadic-neologd, J-MeDic | - | Mecab-ipadic | - | |
Number of vocabularies | 25,000 | 32,000 | 32,000 | 119,448 | |
Total number of [UNK] tokens present in the MedWeb dataset. | 253 (0.68%) | 394 (1.11%) | 369 (0.94%) | 1 (0.00%) |