Abstract
In recent decades, the advancement of computational algorithms and the availability of big data have enabled artificial intelligence (AI) to dramatically improve predictive performance in nearly all research areas. Specifically, machine learning (ML) techniques, a major branch of AI, have been widely used in many tasks of drug discovery and development, including predicting treatment effects, identifying target genes and functional pathways, as well as selecting potential biomarkers. However, in practice, blindly applying ML methods may lead to common pitfalls, including overfitting and lack of generalizability. Therefore, how to improve the robustness and prediction accuracy of ML methods has become a crucial problem for researchers. In this review, we summarize the application of ML models to drug discovery by introducing the top-performing methods developed from large-scale drug-related data challenges in recent years.
Keywords: artificial intelligence, data challenge, drug discovery, machine learning
INTRODUCTION/BACKGROUND
The term artificial intelligence first came about in 1956 (1), and since then, humans have been enamored by the idea of computers that can learn on their own, as shown in numerous movies and novels. Although these works frequently depict AI systems that rival or even surpass human intelligence, the reality is that current AI systems possess less than 1% of the computational power of the human brain (2). Despite their inferiority to a human brain, today’s AI systems are extraordinarily useful and with the advent of large datasets, as well as advances in algorithms and hardware, AI has become an essential tool for engineers, researchers, and scientists to help improve the world.
Machine learning (ML) techniques are an application of AI which allows a machine to use prior knowledge and data in order to learn and improve upon a known algorithm (3–4). The process of machine learning starts with giving the computer a set of data and a set of algorithms and letting the machine use these algorithms to find a pattern in the data (5). The goal is that after this process, the machine will be able to apply the techniques it learned to a previously unseen dataset. The unique feature of ML is that it allows the computer to learn without human assistance and tune parameters of models by itself. ML algorithms can often be categorized as either supervised learning (6) or unsupervised learning (7). Supervised learning algorithms use existing data with supervisory labels marked by humans or measured by experiments, in order to build a model which then predicts labels for other unknown data. Supervised learning algorithms are useful because they can automate processes that are very hard to reduce to a conventional algorithm, such as image analysis. Unsupervised learning aims to explore patterns from datasets that do not contain human-generated labels. Because they use previously unknown algorithms, evaluating ML models is a major hurdle for researchers. During the evaluation of supervised models, the machine-generated labels are compared with conventionally generated labels, usually experimental data or metrics such as Pearson’s correlation or Spearman’s correlation. Evaluating unsupervised models can be very difficult because there are no predefined labels for data. As a result, most research fields have generated a set of evaluation metrics for unsupervised models that are unique to their field. Currently, ML algorithms have been successfully applied to solve various biological and biomedical problems, including pharmaceutical research (8–10).
Drug discovery is a complex and time-consuming process, which comprises target identification, experimental validation, and clinical trials (11–13). The average development time of a drug is 12 years and requires considerable research resources and effort. When combined with the standard wet-lab experiments, cutting-edge ML models can greatly assist and accelerate drug discovery and development (14–16). Specifically, ML models can predict monotherapy and polytherapy effects of small-molecule candidates, as well as identify target genes, biomarkers, and related functional pathways. However, it is important to note that when it comes to drug discovery, the current AI techniques are not omnipotent, as overfitting is a very common issue for ML. Overfitting means an ML model fits perfectly on the training data but performs poorly on unseen data. Although cross-validation strategies have been developed to alleviate overfitting in ML model development, the most effective way to prevent overfitting is using new, previously unseen data to evaluate ML models. In recent years, data challenges have emerged as platforms to unbiasedly compare and examine the predictive performance of ML algorithms for various drug-related tasks. Just like the Olympic Games bring out the best in athletes, data challenges uniquely stimulate the research community to develop better ML algorithms and set up new methods. Although the algorithms developed for challenges are often very effective, the true value of these novel algorithms is that they can be integrated and combined with other ML methods. Combining well-developed methods with novel algorithms has become a common way to improve the robustness and accuracy of ML methods. More importantly, novel ideas and strategies from different research groups can be integrated to further improve the predictive performance and robustness of ML methods. In this review, we summarize recent large-scale data challenges (Fig. 1) related to drug discovery and top-performing algorithms generated from these challenges (Fig. 2).
Fig. 1.

The recent large-scale data challenges related to drug discovery
Fig. 2.

The top-performing algorithms in the recent drug-related challenges
TOP-PERFORMING METHODS IN RECENT DRUG-RELATED DATA CHALLENGES
Multiple Kernel Learning and Mathematical Modeling in Drug Sensitivity and Synergy Prediction
The National Cancer Institute (NCI)-DREAM Drug Sensitivity and Drug Synergy Challenge (17), which was launched in 2012, is a first public challenge that aims at predicting drug treatment effects in breast cancer cell lines. The challenge consists of two subparts: subchallenge1, where teams create algorithms for predicting drug sensitivity scores, and subchallenge2, where teams create algorithms for predicting drug synergy scores. The weighted average of the probabilistic concordance index (wpc-index) for testing compounds was used to evaluate the predictive performance in this challenge. This index is similar to the standard c-index (18), except that the variation in experimental measurements of dose responses was also integrated into the calculation. In both subchallenges, the training and testing datasets were provided by the organizers with (1) the genomic profiles of 53 cell lines and (2) the GI50 concentrations for 31 compounds in 35 cell lines as the training data. This challenge has opened a wide platform for researchers to develop algorithms for predicting the drugs’ responses and introduced various ML algorithms to this field.
The best-performing algorithm in subchallenge1 is called the Bayesian multitask multiple kernel learning (17), which achieves the wpc-index of 0.583 and utilizes four ML principles: kernelized regression, multiview learning, multitask learning, and Bayesian inference. In the preprocessing stage, the best-performing team used their biological knowledge to create three data views, which they then combined with the six datasets to create 22 views for each cell line. For each of the data views, they use either a Gaussian Kernel (for real values) or a Jaccard similarity coefficient (for binary values) to construct as an N*N matrix and use this for kernelized regression. In the multiview learning stage, they construct a combined kernel matrix K* by a weight summarization of the K kernel matrices they just constructed. The kernel weights are learned through multiple kernel learning (MKL) (19). In the multitask learning, in order to improve the generalization of the model, they trained on all the drugs concurrently with the shared kernel weights. In the Bayesian inference, all model parameters are learned by using a variational approximation scheme. To summarize, the Bayesian multitask multiple kernel learning improved the prediction of drug sensitivity scores by first introducing biological information while pre-processing, then learning weights for the input datasets, and finally sharing the weights across drugs.
There are also several trails proposed by other participants in the subchallenge1. The second best-performing team used the regression tree method by applying bootstrapped sampling for each profiling dataset and imputing missing values, and they make the final prediction based on a weighted sum of each model for each dataset. Another team used Pearson’s correlation to weight features and then used the correlation of these weighted features to make the final prediction. This method achieved the third best performance despite using a very succinct algorithm. Another interesting approach in subchallenge1 is when the corresponding research teams removed the lowly expressed and/or low variance features and trained multiple least square regression models to make the final prediction. Removing the low expressed and/or low variance data improves the robustness of their model since it adds noise samples into the training dataset. This team only ranked 10th, perhaps due to the quality of the data in subchallenge1 in the year 2012, but their approaches open a new view to the research field, and this method was applied while preprocessing features in later years.
The other top-performing algorithms in subchallenge1 also proposed a similar thought: all of them were using nonlinear ML methods (kernels, regression trees). Although ML is still a very new field, the success of machine learning algorithms in the challenge is very promising.
Despite the dominance of ML algorithms in subchallenge1, in subchallenge2, the best-performing algorithm is not a machine learning algorithm. The winning team proposed the DIGRE (drug-induced genomic residual effect) model (20) and achieved the pc-index of 0.613. This model contains three steps. The first step is to generate similarity scores based on gene expression profiles and then refine the profiles using pathway information and an external dataset which was not provided by the challenge organizers. In the second step, they used a mathematical model to estimate the drug-induced genomic residual effect by considering both similarity scores and the drug-response curves. When predicting, the synergy scores are computed by taking the average of the combined score, which is estimated by the two sequential orders of treatment.
Although machine learning algorithms were not the best-performing methods in the task of predicting the drug synergy scores (subchallenge2) in 2012, 3 years later, a machine learning algorithm won the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge, which is very similar to subchallenge2 from 2012.
Network Propagation–Based Machine Learning Model in Drug Combination Prediction
In order to accelerate the understanding of drug synergy prediction, the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge (21) was launched in 2015 and provides a platform for researchers to explore drug synergy prediction models. The competition used a large-scale dataset of nearly 11.5 k laboratory experiments from 910 drug combinations across 85 cancer cell lines. Meanwhile, monotherapy drug response data and molecular profiles of these cancer cell lines were provided to the participants as well. Pearson’s correlation and three-way ANOVA between predictions and experimental measurements were used as scoring metrics to evaluate the predictive performance of different submissions. The best-performing method achieved Pearson’s correlation of 0.48 and ANOVA of 74.89. This challenge is organized by AstraZeneca, together with the European Bioinformatics Institute, the Sanger Institute, Sage Bionetworks, and the distributed DREAM community, and nearly 160 teams participated.
Although the best-performing team utilized a relatively simple single random forest model, they also proposed a novel simulation approach to purify features (22,23). Starting from baseline molecular data, they first filtered out the irrelevant genes based on the drug target information and gene interaction network. The second step is to map target genes to the rest of the genes. Then they applied two mathematical formulas based on whether they are mutation features. After generating the simulation of posttreatment molecular features, they combined them with the monotherapy data to yield the training features for a random forest model. They combined two predictions from two routines to make the final prediction. The first routine is to train the whole training dataset, and cross-validation (24) was used while training. In the second one, they first split the training dataset into several mutually non-exclusive subsets and then trained a sub-model for each subdataset (cross-validation also applied), and the prediction in the second routine was assembled together by each sub-model. This approach ranked first among all the participants in the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge and established a new state-of-the-field algorithm in the corresponding research field, and for the first time, the machine learning approaches approached the accuracy of the experimental data of drug pairs from the laboratories.
In addition to the challenges discussed, there are also two other challenges from recent years that focus on drug development: the Illuminating the Druggable Genome (IDG)-DREAM Challenge and the CTD2 Pancancer Drug Activity DREAM Challenge.
Deep Learning–Based Approach for Druggable Genome Prediction
The goal of developing the IDG-DREAM Challenge is to evaluate how machine learning (ML) models catalyze compound-target interaction mapping efforts (https://www.synapse.org/#!Synapse:syn15667962/wiki/583305). In the drug discovery field, it is crucial to map the complete target space of drugs and drug-like compounds, because this helps researchers predict the possible adverse effects more precisely before the clinical trials (25,26). In this challenge, organizers provided an open data web platform called DrugTargetCommons, and other compound-target data resources on the public domain are also allowed to be used for participants. For subchallenge1, the bootstrapped Spearman’s correlation of the predictions is used as the evaluation metric, and the bootstrapped RMSE of the predictions is used to evaluate the predictive performance in subchallenge2.
The best-performing team used the Kronecker kernel regularized least squares (CGKronRLS) regression model with a Spearman’s correlation of 0.776 in subchallenge1, and the XGBoost model was used in subchallenge2 with an RMSE value of 0.773. We would like to introduce a deep learning method, DeepDTA (27), since several participant teams cited it. The DeepDTA (27) model is a deep learning (DL) model (28) that uses only drugs and targets’ sequence information to predict the drug-target interaction binding affinities. In their model, the SMILES string and the protein sequence are preprocessed by label encoding and an embedding layer, and then, they are used as the input for two identical CNN blocks. For each CNN block, they used a max-pooling layer followed by three convolutional layers with an increasing number of filters. The two representations which are outputted from CNN blocks are concatenated together and are taken into the final block. The final block consists of three FC layers, and there is also a dropout layer with a ratio of 0.1 after the first two FC layers. The final outputs are generated after the final FC layer. In that paper, this model was compared with two baseline algorithms, the KronRLS (29) regression algorithm and the SimBoost (30) method. The result shows that the DeepDTA model outperforms the two baselines, and they prove that the performance significantly increased since the two CNN blocks could learn a better representation of drugs and proteins’ sequence data. This paper contributed a novel deep learning (DL) model for the representation of drugs, with only the drugs and targets’ raw sequence data, and that is why this model is used and explored by researchers who participated in the IDG-DREAM Challenge.
Deep Learning–Based and Matrix Factorization Approaches for Predicting Chemotherapeutic Compounds from Transcriptional Data
The CTD-squared Pancancer Drug Activity DREAM Challenge was launched last year and aims at improving the prediction of chemotherapeutic compounds by using transcriptional data that is taken after treatment (https://www.synapse.org/#!Synapse:syn20968331/wiki/597042). Participants were allowed to learn from the previous models in the NCI-DREAM Challenge and AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge, and posttreatment transcriptional data was provided for them to use. The participants were provided with the drug perturbational RNA-seq profiles of 11 cell lines treated with 30 chosen compounds, as well as the RNA-seq and mRNA-dependency data for 515 cell lines from the Cancer Cell-Line Encyclopedia (CCLE) and Achilles databases. The organizers of this challenge allowed participants to use any gene expression data during training as long as the data is available in the public domain. Key high-affinity binding targets of the compounds in each condition were evaluated as the final prediction. The final result was evaluated in two subparts: in the subchallenge1, the highest ten predicted targets were compared with calculate a p value for each compound, and in subchallenge2, among all the actual targets for each compound in the prediction of drug targets, their average confidence scores will be compared. The prediction evaluation was based on the number of top predicted targets belonging to the experimentally measured drug targets and the average prediction rank. Then the −log2(p value) calculated from comparing predictions with randomly ranked drug targets was used as the scoring metric.
The best-performing team with the score of 17.423 in subchallenge1 is from Tsinghua University, and they made the final prediction based on three stages. In the first stage, they used a single-layer neural network, including the layer normalization method and the dropout with a ratio of 0.1. They applied Bayesian personalized ranking loss and a multitask constraint loss while training. The tenfold cross-validation (24) method was used to avoid the overfitting of the model, and 10% of the data was set as test data in each fold. In order to improve the efficiency of their model, they also applied early stopping metric so that the model will stop training if there are a fixed number of epochs with no improvement of the prediction. During the second stage, they further improved the prediction by applying the ensemble learning method. They took predictions generated by 100 neural networks with different numbers of parameters and averaged them to produce the final scores. In the final stage, the scores were normalized by min-max normalization if their range is not in 0–1.
In subchallenge2, Jing Tang’s group made the final prediction by applying the matrix multiplication method with the score of 70.899. They simply made the final prediction by multiplication of two matrices: a Pearson correlation matrix and a drug-target matrix (data quantitation applied). Even though they also tried several regression models, such as the random forest and gradient boosting algorithms, they found that the simple matrix multiplication method performs better than any machine learning model they tried.
There are two points of view on Jing Tang’s work. One is that the AI winter is coming. Recently, some researchers have asserted that many mathematical models outperform ML and DL models, and to some extent, Jing Tang’s work proves that the final prediction from a single matrix multiplication is better than those from the complicated machine learning models. However, the second view is that ML and DL models have not yet reached their potential due to the data quality. If a robust dataset that contains sufficient features for each sample is provided, ML/DL models may outperform traditional methods. Another reason, as discussed in our previous commentary paper (31), the prediction can be further improved if we gain a deeper understanding of hidden biological information and integrate the hidden information into ML models. Moreover, there are still lots of current ML and DL models that have not been tried by scholars in this field, and researchers in the AI research community are still inventing more models which have more precise predictions, so it remains unknown whether those models can have a better result in this drug-related research field. As a result, it is too early to conclude that the ML models and DL models are no longer predictive in the drug-related research field, and in contradiction, we believe these methods can have a better result in this field.
CONCLUSION
In this review, we overview several applications of ML models to drug-related research fields by summarizing top-performing algorithms (Fig. 2) in four recent drug-related DREAM challenges (Fig. 1), and we conclude that ML methods are still developing and improving in predicting various tasks (i.e., drug synergistic effects, drug targets) as the AI develops.
FUNDING INFORMATION
This work is supported by NSF#1452656 and NIH R35-GM133346.
REFERENCES
- 1.Artificial intelligence abstracts. Artif Intell. 1987;32:414–5. [Google Scholar]
- 2.Ahmet C Artificial intelligence: how advance machine learning will shape the future of our world. Shockwave Publishing via PublishDrive; 2018. [Google Scholar]
- 3.Michalski RS, Carbonell JG, Mitchell TM. Machine learning: an artificial intelligence approach: Springer Science & Business Media; 2013. [Google Scholar]
- 4.Alpaydin E Machine learning: The New AI: MIT Press; 2016. [Google Scholar]
- 5.Bishop CM. Pattern recognition and machine learning: Springer; 2016. [Google Scholar]
- 6.Supervised Learning. Neural Smithing. 1999. 10.7551/mitpress/4937.003.0003. [DOI]
- 7.Unsupervised Learning. Unsupervised learning. 1999. 10.7551/mitpress/7011.003.0002. [DOI]
- 8.Deo RC. Machine learning in medicine. Circulation. 2015;132:1920–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Burbidge R, Trotter M, Buxton B, Holden S. Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput Chem. 2001;26:5–14. [DOI] [PubMed] [Google Scholar]
- 10.Ekins S The next era: deep learning in pharmaceutical research. Pharm Res. 2016;33:2594–603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Flower DR. Drug discovery: today and tomorrow. Bioinformation. 2020;16:1–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hughes JP, Rees S, Kalindjian SB, Philpott KL. Principles of early drug discovery. Br J Pharmacol. 2011;162:1239–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mohs RC, Greig NH. Drug discovery and development: role of basic biological research. Alzheimers Dement. 2017;3:651–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Vamathevan J, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. 2019;18:463–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discov Today. 2018;23:1241–50. [DOI] [PubMed] [Google Scholar]
- 16.Hamet P, Tremblay J. Artificial intelligence in medicine. Metabolism. 2017;69S:S36–40. [DOI] [PubMed] [Google Scholar]
- 17.Costello JC, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat Biotechnol. 2014;32:1202–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15:361–87. [DOI] [PubMed] [Google Scholar]
- 19.Wilson CM, Li K, Yu X, Kuan P-F, Wang X. Multiple-kernel learning for genomic data mining and prediction. BMC Bioinformatics. 2019;20:426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bansal M, et al. A community computational challenge to predict the activity of pairs of compounds. Nat Biotechnol. 2014;32:1213–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Menden MP, et al. Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen. Nat Commun. 2019;10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Li H, Li T, Quang D, Guan Y. Network propagation predicts drug synergy in cancers. Cancer Res Canres. 2018;0740.2018. 10.1158/0008-5472.can-18-0740. [DOI] [PubMed] [Google Scholar]
- 23.Li H, Hu S, Neamati N, Guan Y. TAIJI: approaching experimental replicates-level accuracy for drug synergy prediction. Bioinformatics. 2019;35:2338–9. [DOI] [PubMed] [Google Scholar]
- 24.Cristianini N Cross-Validation (K-Fold Cross-Validation, Leave-One-Out, Jackknife, Bootstrap). Dictionary of bioinformatics and computational biology. 2004. 10.1002/9780471650126.dob0148.pub2. [DOI] [Google Scholar]
- 25.Elkins JM, et al. Comprehensive characterization of the published kinase inhibitor set. Nat Biotechnol. 2016;34:95–103. [DOI] [PubMed] [Google Scholar]
- 26.Santos R, et al. A comprehensive map of molecular drug targets. Nat Rev Drug Discov. 2017;16:19–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Öztürk H, Özgür A, Ozkirimli E. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics. 2018;34:i821–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. [DOI] [PubMed] [Google Scholar]
- 29.Pahikkala T, et al. Toward more realistic drug-target interaction predictions. Brief Bioinform. 2015;16:325–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.He T, Heidemeyer M, Ban F, Cherkasov A, Ester M. SimBoost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines. Aust J Chem. 2017;9:24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wang Z, Li H, Guan Y. Machine learning for cancer drug combination. Clin Pharmacol Ther. 2020;107:749–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
