. 2019 Dec 26;18:241–252. doi: 10.1016/j.csbj.2019.12.006

Table 1.

List of datasets that are relevant for drug development, ordered according to their type. *When provided by the contributors to the database.

Data Type	Description	Databases (name, reference, date of last data update, URL, size)	API
Genomic data	(1) Compilation of disease-gene associations; different species are represented in CTD, while the other two databases refer to the human. In CTD, some interactions are manually curated instead of being computationally inferred. OpenTargets and DisGeNET gather data from several curated sources. All of these databases provide a coefficient for each disease-gene association quantifying its corresponding level of evidence.	OpenTargets [97], (2019–11) https://www.opentargets.org/ 27,069 targets × 13,579 diseases	Yes
		Comparative Toxicogenomics Database (CTD) [98], (2019–11) https://ctdbase.org/ Curated: 8,637 × 5816 Inferred: 48,634 × 3168	Yes
		DisGeNET [99], (2019–07) http://www.disgenet.org/ 17,549 targets × 24,166 diseases/traits	Yes
	(2) SNP reporting; COSMIC reports expert manually-curated data.	COSMIC [100], (2019–09) https://cancer.sanger.ac.uk/cosmic 1,207,190 copy number variants 9,197,630 gene expression variants 7,929,832 differentially methylated CpGs 13,099,101 non coding variants	Yes
	(3) Regulatory system (e.g., cis-regulatory modules) data: in CisView, the focus is on the mouse (Mus musculus), and data is collected using a TF binding motif analysis on ChiP-seq experiments. It reports several measures of interest, such as conservation scores and quality assessment of the inferred bindings. UK BioBank collects various types of information (genomics, imaging) in a huge anonymous human cohort (around 500,000 people).	CisView [101], (2016–12) https://lgsun.irp.nia.nih.gov/geneindex/cisview.html	No
		UK BioBank [102], (2019–09) https://www.ukbiobank.ac.uk/	Yes
Interaction data	(1) Protein-protein or pathway information; STRING reports PPIs (protein-protein interactions) for thousands of organisms, classified according to their level of evidence: computationally inferred (via functional enrichment analysis), experimentally-proven or extracted from curated databases. A score combining all this information is associated to each PPI. KEGG gathers manually assembled biological (signaling and metabolic) pathways.	STRING database [103], (2019–01) https://string-db.org/ 24,584,628 proteins and 3,123,056,667 interactions	Yes
		KEGG Pathway database [104] (2019–11) https://www.genome.jp/kegg/pathway.html	Yes
	(2) Biological models of gene and pathway interactions; CausalBioNet collects manually curated rat, mouse and human models which are machine readable (encoded into BEL language, convertible into SBML). BioModels lists literature-based (some of them being manually curated) models, and computationally inferred ones, mostly in SBML format.	Causal BioNet [105] http://causalbionet.com/	No
		BioModels [106], (2017–06) https://www.ebi.ac.uk/biomodels/ Manually curated: 831 models Literature-based: 1640 models	Yes
	(3) Drug signatures (genewise expression changes due to treatment) in human immortalized cell lines, from standardized experiments. CMap is a preliminary version of LINCS L1000, and is not supported anymore.	Connectivity Map (CMap) [107] https://portals.broadinstitute.org/cmap/ 1309 compounds × 4 cell lines × 154 concentrations	Yes
		LINCS [92] https://clue.io/lincs 51,423 perturbation types 2570 cell lines 4 doses	Yes
Drug-Disease associations	These databases provide information about disease potential therapeutic targets, along with interacting chemical compounds. PROMISCUOUS reports text-mining (from literature) based associations, however some of the texts are manually curated.	Therapeutic Target Database (TTD) [108], (2019–07) http://bidd.nus.edu.sg/group/cjttd/ 3419 targets × 37316 drugs	No
Drug-Disease associations		PROMISCUOUS [109] http://bioinformatics.charite.de/promiscuous/ 10,208,308 proteins × 25,170 compounds	No
Clinical trials	Repositories of clinical trial settings, status, and results. ClinicalTrials.gov is a large database which mostly collects information about US-located trials (formatted in XML), whereas RepoDB provides visualization and data querying. Clinical trial data is a good source of information for Machine Learning methods, because it lists negative results as well (that is, drugs that failed to prove to be of use in treatment), and potentially the reasons for failure.	RepoDB [110], (2017–07) http://apps.chiragjpgroup.org/repoDB/ 1571 approved drugs × 2051 diseases	No
Clinical trials		ClinicalTrials.gov https://clinicaltrials.gov 323,890 studies	Yes
Chemical & Drug data	(1) Protein-related; automatic annotations.	UniProt [111], (2019–11) https://www.uniprot.org/561,356 proteins (Swiss-Prot dataset)181,787,788 proteins (TrEMBL)	Yes
	(2) Drug-related; comprises approved, withdrawn drugs, as well as tool chemical compounds, and reports their potential indications.	Drug Bank [112], (2019-07) https://www.drugbank.ca/ 13,450 drugs	Yes
	(3) ADMET drug properties (among other types of relevant drug information).	ChEMBL [113], (2018-12) https://www.ebi.ac.uk/chembl/ 1,879,206 compounds × 12,482 targets	Yes