Table 1.
Description of word embedding sources used.
| Name | Author and source | Training source | Unit | Context window | Preprocess (reduce case, remove stop words, term types) | Embedding technology (gensim, FastText, GloVea, BERTb, ELMO, etc) | Returned unit (1-3 ngrams) | Embedding size | Vocab size |
| BioNLPc Lab PubMed + PMCd W2V | Pyysalo et al 2013 [14,33] | PubMed/ PMC articles |
Token | 5 | Mixed case, no stop words, skip-grams | word2Vec | 1 ngram | 200 | ~4 billion tokens |
| BioNLP LabWiki + PubMed + PMC W2V | Pyysalo et al 2013 [14,33] | Wikipedia, PubMed/ PMC articles |
Token | 5 | Mixed case, no stop words, skip-grams | word2Vec | 1 ngram | 200 | ~5.4 billion tokens |
| BioASQ | Tsatsaronis et al 2015 [13,34] | PubMed abstracts | Token | 5 | Lowercase, no stop words, continuous bag of words | word2Vec | 1 ngram | 200 | ~1.7 billion tokens |
| Clinical Embeddings W2V300 | Flamholz et al 2019 [16,35] | PubMed/ PMC/ MIMIC IIIe |
Token | 7 | Lowercase, include stop words, skip-grams | word2Vec | 1-3 ngrams | 300 | ~300k tokens |
| BioWordVec Extrinsic | Zhang et al 2019 [15,36] | PubMed + MeSHf | Character | 5 | lowercase, include stop words | FastText | 1-3 ngrams | 200 | ~2.3 billion tokens |
| BioWordVec Intrinsic | Zhang et al 2019 [15,36] | PubMed + MeSH | Character | 20 | Lowercase, include stop words | FastText | 1-3 ngrams | 200 | ~2.3 million tokens |
| Standard GloVe Embeddings | Pennington et al 2014 [11] | Common Crawl | Token | 10 | Mixed case | GloVe | 1 ngram | 300 | ~2.1 billion tokens |
aGloVe: global vectors for word representation.
bBERT: Bidirectional Encoder Representations from Transformers.
cBioNLP: biomedical natural language processing.
dPMC: PubMed Central.
eMIMIC III: Medical Information Mart for Intensive Care III.
fMeSH: Medical Subject Headings.