Skip to main content
. 2021 Feb 22;9(2):e21679. doi: 10.2196/21679

Table 1.

Description of word embedding sources used.

Name Author and source Training source Unit Context window Preprocess (reduce case, remove stop words, term types) Embedding technology (gensim, FastText, GloVea, BERTb, ELMO, etc) Returned unit (1-3 ngrams) Embedding size Vocab size
BioNLPc Lab PubMed + PMCd W2V Pyysalo et al 2013 [14,33] PubMed/
PMC articles
Token 5 Mixed case, no stop words, skip-grams word2Vec 1 ngram 200 ~4 billion tokens
BioNLP LabWiki + PubMed + PMC W2V Pyysalo et al 2013 [14,33] Wikipedia, PubMed/
PMC articles
Token 5 Mixed case, no stop words, skip-grams word2Vec 1 ngram 200 ~5.4 billion tokens
BioASQ Tsatsaronis et al 2015 [13,34] PubMed abstracts Token 5 Lowercase, no stop words, continuous bag of words word2Vec 1 ngram 200 ~1.7 billion tokens
Clinical Embeddings W2V300 Flamholz et al 2019 [16,35] PubMed/
PMC/
MIMIC IIIe
Token 7 Lowercase, include stop words, skip-grams word2Vec 1-3 ngrams 300 ~300k tokens
BioWordVec Extrinsic Zhang et al 2019 [15,36] PubMed + MeSHf Character 5 lowercase, include stop words FastText 1-3 ngrams 200 ~2.3 billion tokens
BioWordVec Intrinsic Zhang et al 2019 [15,36] PubMed + MeSH Character 20 Lowercase, include stop words FastText 1-3 ngrams 200 ~2.3 million tokens
Standard GloVe Embeddings Pennington et al 2014 [11] Common Crawl Token 10 Mixed case GloVe 1 ngram 300 ~2.1 billion tokens

aGloVe: global vectors for word representation.

bBERT: Bidirectional Encoder Representations from Transformers.

cBioNLP: biomedical natural language processing.

dPMC: PubMed Central.

eMIMIC III: Medical Information Mart for Intensive Care III.

fMeSH: Medical Subject Headings.