Skip to main content
. 2019 Dec 26;18:241–252. doi: 10.1016/j.csbj.2019.12.006

Table 1.

List of datasets that are relevant for drug development, ordered according to their type. *When provided by the contributors to the database.

Data Type Description Databases (name, reference, date of last data update*, URL, size*) API
Genomic data (1) Compilation of disease-gene associations; different species are represented in CTD, while the other two databases refer to the human. In CTD, some interactions are manually curated instead of being computationally inferred. OpenTargets and DisGeNET gather data from several curated sources. All of these databases provide a coefficient for each disease-gene association quantifying its corresponding level of evidence. OpenTargets [97], (2019–11)
https://www.opentargets.org/
27,069 targets × 13,579 diseases
Yes
Comparative Toxicogenomics Database (CTD) [98], (2019–11)
https://ctdbase.org/
Curated: 8,637 × 5816
Inferred: 48,634 × 3168
Yes
DisGeNET [99], (2019–07)
http://www.disgenet.org/
17,549 targets × 24,166 diseases/traits
Yes
(2) SNP reporting; COSMIC reports expert manually-curated data. COSMIC [100], (2019–09)
https://cancer.sanger.ac.uk/cosmic
1,207,190 copy number variants
9,197,630 gene expression variants
7,929,832 differentially methylated CpGs
13,099,101 non coding variants
Yes
(3) Regulatory system (e.g., cis-regulatory modules) data: in CisView, the focus is on the mouse (Mus musculus), and data is collected using a TF binding motif analysis on ChiP-seq experiments. It reports several measures of interest, such as conservation scores and quality assessment of the inferred bindings. UK BioBank collects various types of information (genomics, imaging) in a huge anonymous human cohort (around 500,000 people). CisView [101], (2016–12)
https://lgsun.irp.nia.nih.gov/geneindex/cisview.html
No
UK BioBank [102], (2019–09)
https://www.ukbiobank.ac.uk/
Yes
Interaction data (1) Protein-protein or pathway information; STRING reports PPIs (protein-protein interactions) for thousands of organisms, classified according to their level of evidence: computationally inferred (via functional enrichment analysis), experimentally-proven or extracted from curated databases. A score combining all this information is associated to each PPI. KEGG gathers manually assembled biological (signaling and metabolic) pathways. STRING database [103], (2019–01)
https://string-db.org/
24,584,628 proteins and 3,123,056,667 interactions
Yes
KEGG Pathway database [104] (2019–11)
https://www.genome.jp/kegg/pathway.html
Yes
(2) Biological models of gene and pathway interactions; CausalBioNet collects manually curated rat, mouse and human models which are machine readable (encoded into BEL language, convertible into SBML). BioModels lists literature-based (some of them being manually curated) models, and computationally inferred ones, mostly in SBML format. Causal BioNet [105]
http://causalbionet.com/
No
BioModels [106], (2017–06)
https://www.ebi.ac.uk/biomodels/
Manually curated: 831 models
Literature-based: 1640 models
Yes
(3) Drug signatures (genewise expression changes due to treatment) in human immortalized cell lines, from standardized experiments. CMap is a preliminary version of LINCS L1000, and is not supported anymore. Connectivity Map (CMap) [107]
https://portals.broadinstitute.org/cmap/
1309 compounds × 4 cell lines × 154 concentrations
Yes
LINCS [92]
https://clue.io/lincs
51,423 perturbation types
2570 cell lines
4 doses
Yes
Drug-Disease associations These databases provide information about disease potential therapeutic targets, along with interacting chemical compounds. PROMISCUOUS reports text-mining (from literature) based associations, however some of the texts are manually curated. Therapeutic Target Database (TTD) [108], (2019–07)
http://bidd.nus.edu.sg/group/cjttd/
3419 targets × 37316 drugs
No
PROMISCUOUS [109]
http://bioinformatics.charite.de/promiscuous/
10,208,308 proteins × 25,170 compounds
No
Clinical trials Repositories of clinical trial settings, status, and results. ClinicalTrials.gov is a large database which mostly collects information about US-located trials (formatted in XML), whereas RepoDB provides visualization and data querying. Clinical trial data is a good source of information for Machine Learning methods, because it lists negative results as well (that is, drugs that failed to prove to be of use in treatment), and potentially the reasons for failure. RepoDB [110], (2017–07)
http://apps.chiragjpgroup.org/repoDB/
1571 approved drugs × 2051 diseases
No
ClinicalTrials.gov
https://clinicaltrials.gov
323,890 studies
Yes
Chemical & Drug data (1) Protein-related; automatic annotations. UniProt [111], (2019–11)
https://www.uniprot.org/561,356 proteins
(Swiss-Prot dataset)181,787,788 proteins
(TrEMBL)
Yes
(2) Drug-related; comprises approved, withdrawn drugs, as well as tool chemical compounds, and reports their potential indications. Drug Bank [112], (2019-07)
https://www.drugbank.ca/
13,450 drugs
Yes
(3) ADMET drug properties (among other types of relevant drug information). ChEMBL [113], (2018-12)
https://www.ebi.ac.uk/chembl/
1,879,206 compounds × 12,482 targets
Yes