Table 1.
Data Type | Description | Databases (name, reference, date of last data update*, URL, size*) | API |
---|---|---|---|
Genomic data | (1) Compilation of disease-gene associations; different species are represented in CTD, while the other two databases refer to the human. In CTD, some interactions are manually curated instead of being computationally inferred. OpenTargets and DisGeNET gather data from several curated sources. All of these databases provide a coefficient for each disease-gene association quantifying its corresponding level of evidence. | OpenTargets [97], (2019–11) https://www.opentargets.org/ 27,069 targets × 13,579 diseases |
Yes |
Comparative Toxicogenomics Database (CTD) [98], (2019–11) https://ctdbase.org/ Curated: 8,637 × 5816 Inferred: 48,634 × 3168 |
Yes | ||
DisGeNET [99], (2019–07) http://www.disgenet.org/ 17,549 targets × 24,166 diseases/traits |
Yes | ||
(2) SNP reporting; COSMIC reports expert manually-curated data. | COSMIC [100], (2019–09) https://cancer.sanger.ac.uk/cosmic 1,207,190 copy number variants 9,197,630 gene expression variants 7,929,832 differentially methylated CpGs 13,099,101 non coding variants |
Yes | |
(3) Regulatory system (e.g., cis-regulatory modules) data: in CisView, the focus is on the mouse (Mus musculus), and data is collected using a TF binding motif analysis on ChiP-seq experiments. It reports several measures of interest, such as conservation scores and quality assessment of the inferred bindings. UK BioBank collects various types of information (genomics, imaging) in a huge anonymous human cohort (around 500,000 people). | CisView [101], (2016–12) https://lgsun.irp.nia.nih.gov/geneindex/cisview.html |
No | |
UK BioBank [102], (2019–09) https://www.ukbiobank.ac.uk/ |
Yes | ||
Interaction data | (1) Protein-protein or pathway information; STRING reports PPIs (protein-protein interactions) for thousands of organisms, classified according to their level of evidence: computationally inferred (via functional enrichment analysis), experimentally-proven or extracted from curated databases. A score combining all this information is associated to each PPI. KEGG gathers manually assembled biological (signaling and metabolic) pathways. | STRING database [103], (2019–01) https://string-db.org/ 24,584,628 proteins and 3,123,056,667 interactions |
Yes |
KEGG Pathway database [104] (2019–11) https://www.genome.jp/kegg/pathway.html |
Yes | ||
(2) Biological models of gene and pathway interactions; CausalBioNet collects manually curated rat, mouse and human models which are machine readable (encoded into BEL language, convertible into SBML). BioModels lists literature-based (some of them being manually curated) models, and computationally inferred ones, mostly in SBML format. | Causal BioNet [105] http://causalbionet.com/ |
No | |
BioModels [106], (2017–06) https://www.ebi.ac.uk/biomodels/ Manually curated: 831 models Literature-based: 1640 models |
Yes | ||
(3) Drug signatures (genewise expression changes due to treatment) in human immortalized cell lines, from standardized experiments. CMap is a preliminary version of LINCS L1000, and is not supported anymore. | Connectivity Map (CMap) [107] https://portals.broadinstitute.org/cmap/ 1309 compounds × 4 cell lines × 154 concentrations |
Yes | |
LINCS [92] https://clue.io/lincs 51,423 perturbation types 2570 cell lines 4 doses |
Yes | ||
Drug-Disease associations | These databases provide information about disease potential therapeutic targets, along with interacting chemical compounds. PROMISCUOUS reports text-mining (from literature) based associations, however some of the texts are manually curated. | Therapeutic Target Database (TTD) [108], (2019–07) http://bidd.nus.edu.sg/group/cjttd/ 3419 targets × 37316 drugs |
No |
PROMISCUOUS [109] http://bioinformatics.charite.de/promiscuous/ 10,208,308 proteins × 25,170 compounds |
No | ||
Clinical trials | Repositories of clinical trial settings, status, and results. ClinicalTrials.gov is a large database which mostly collects information about US-located trials (formatted in XML), whereas RepoDB provides visualization and data querying. Clinical trial data is a good source of information for Machine Learning methods, because it lists negative results as well (that is, drugs that failed to prove to be of use in treatment), and potentially the reasons for failure. | RepoDB [110], (2017–07) http://apps.chiragjpgroup.org/repoDB/ 1571 approved drugs × 2051 diseases |
No |
ClinicalTrials.gov https://clinicaltrials.gov 323,890 studies |
Yes | ||
Chemical & Drug data | (1) Protein-related; automatic annotations. | UniProt [111], (2019–11) https://www.uniprot.org/561,356 proteins (Swiss-Prot dataset)181,787,788 proteins (TrEMBL) |
Yes |
(2) Drug-related; comprises approved, withdrawn drugs, as well as tool chemical compounds, and reports their potential indications. | Drug Bank [112], (2019-07) https://www.drugbank.ca/ 13,450 drugs |
Yes | |
(3) ADMET drug properties (among other types of relevant drug information). | ChEMBL [113], (2018-12) https://www.ebi.ac.uk/chembl/ 1,879,206 compounds × 12,482 targets |
Yes |