Skip to main content
. 2021 Nov 9;16(11):e0259763. doi: 10.1371/journal.pone.0259763

Table 2. The specifications of each BERT.

UTH-BERT KU-BERT TU-BERT mBERT
Publisher The University of Tokyo Hospital The University of Kyoto The University of Tohoku Google
Language Japanese Japanese Japanese Multilingual
Pre-training corpus Clinical text (120 million) JP Wikipedia (18 million) JP Wikipedia (18 million) 104 languages of Wikipedias
Tokenizer Morphological analyzer MeCab Juman++ MeCab -
External Dictionary Mecab-ipadic-neologd, J-MeDic - Mecab-ipadic -
Number of vocabularies 25,000 32,000 32,000 119,448
Total number of [UNK] tokens present in the MedWeb dataset. 253 (0.68%) 394 (1.11%) 369 (0.94%) 1 (0.00%)