Table 10.
Sizes of the raw text entries that we used to perform knowledge base lookup and the corresponding embedding sizes.
| Knowledge base | Raw size | Embedding size |
|---|---|---|
| NCBI Genea (10) | 3.9 GB | 5.5 GB |
| CTD diseases (16, 17) | 6 MB | 376 MB |
| MeSH (42) | 46 MB | 2.6 GB |
| dbSNPb (64) | – | – |
| NCBI Taxonomy (63) | 317 MB | 16 GB |
| Cellosaurus (5) | 6.3 MB | 595 MB |
| Total | 4.28 GB | 25 GB |
a We only embedded the genes for most frequent species.
b As mentioned, we use LitVar2 for performing lookups on dbSNP