Abstract
Despite advances in molecular biology, genetics, computation, and medicinal chemistry, infectious disease remains an ominous threat to public health. Addressing the challenges posed by pathogen outbreaks, pandemics, and antimicrobial resistance will require concerted interdisciplinary efforts. In conjunction with systems and synthetic biology, artificial intelligence (AI) is now leading to rapid progress, expanding anti-infective drug discovery, enhancing our understanding of infection biology, and accelerating the development of new diagnostics. In this Review, we discuss approaches for detecting, treating, and understanding infectious diseases, underscoring the progress supported by AI in each case. We suggest future applications of AI and how it might be harnessed to help control infectious disease outbreaks and pandemics.
Teaser
Recent advances in artificial intelligence are empowering medical and biotechnological research fighting infectious diseases.
Infectious diseases, caused by transmissible pathogens including bacteria, eukarya, and viruses, continue to challenge scientists and clinicians despite advances in medicine and basic research over the past few decades. Limitations to the fast and accurate detection of infections, as well as expanding antimicrobial resistance, exacerbate these challenges (Box 1). Basic research has aimed to address these challenges, including development of anti-infective therapies, preventative measures, and fast and accurate diagnostic tools. In particular, systems and synthetic biology approaches have led to biotechnological and medical innovations—including drug treatments and modalities, vaccines, and diagnostics—that have improved how we deal with infectious diseases.
Box 1. Overarching challenges in infectious diseases and concepts in artificial intelligence.
Pathogen outbreaks and pandemics:
Recent outbreaks include COVID-19, monkeypox, Marburg virus, H5N1 influenza, Ebola, measles, Zika, E. coli, and MERS. Challenges include detecting outbreaks and new pathogens, understanding disease biology, and developing preventive measures.
Antimicrobial resistance and anti-infective drug discovery:
Problematic pathogens include carbapenem-resistant Enterobacteriaceae (CRE), methicillin-resistant Staphylococcus aureus (MRSA), multidrug-resistant tuberculosis (MDR-TB), vancomycin-resistant Enterococcus (VRE), extended-spectrum beta-lactamase (ESBL)-producing bacteria, drug-resistant Candida auris, Neisseria gonorrhoeae, Plasmodium falciparum, and Toxoplasma gondii. Challenges include practicing antimicrobial stewardship (the appropriate and responsible use of anti-infective drugs), developing new classes of anti-infective drugs, potentiating existing drugs against resistant infections, and understanding drug mechanisms of action.
Neglected, persistent, and difficult-to-treat infections:
Examples include neglected tropical diseases, chronic hepatitis B and C, chronic fungal infections, Lyme disease, infections in low-resource populations, and HIV/AIDS. Challenges include developing low-cost and field-deployable diagnostics, improving the accuracy of diagnostic tests, improving the detection of antimicrobial resistance, and making effective disease treatments available.
Artificial intelligence (AI) and machine learning (ML):
ML is a subfield of AI, and its approaches can be classified as supervised (model is told what property to predict), unsupervised (model is not told what property to predict), or reinforcement learning (model optimizes for feedback). Neural networks are a common ML architecture comprising interconnected layers of basic processing units (neurons). Different types of neural networks exist, including those that predict properties of graph-based inputs (graph neural networks), generate data by compressing what the model has learned (variational autoencoders), process sequential data (long short-term memory), and model complex dependencies by using attention mechanisms to focus on specific input elements (transformers). Not all models are neural networks, and simpler models include random forests (ensembles of decision trees), support vector machines (classifiers that separate datapoints on a plot), and regression models (functions that explicitly model the input-output relationship).
The fields of systems and synthetic biology emerged from two key developments: 1) the generation and synthesis of quantitative biological hypotheses and data from wet-lab experiments, sequencing, and systems-level modeling; and 2) an understanding of the modularity and programmability of nucleic acids, peptides, and other biomolecules, which enables control of biology. Artificial intelligence (AI), which focuses on developing machines capable of reasoning with data, has recently matured as an exciting field that draws on both these features to accelerate scientific discovery. Because AI-based approaches can integrate large amounts of quantitative and omics data, they are particularly adept at dealing with biological complexity, extending our knowledge and facilitating our efforts to reverse-engineer and control biology. AI-based approaches are particularly useful in addressing the problem of infectious diseases, which are complex across different scales, ranging from cells to communities, and for which advances in medicine and biotechnology are essential drivers of progress. In this Review, we discuss major areas in which AI-based approaches, applied to systems and synthetic biology, are substantively empowering our research to fight infectious diseases.
Artificial intelligence for anti-infective drug discovery
Anti-infective drugs, comprising antibacterials, antivirals, antifungals, and anti-parasitics, have become less effective treatments as a result of the spread of drug resistance. There is therefore an urgent need for new anti-infective treatments, particularly ones that represent unprecedented chemical spaces or therapeutic modalities. AI, and in particular machine learning (ML), a subfield of AI which uses data to train machines to make predictions, has foremost been helpful in facilitating searches of small molecule databases, such as the ZINC15 (1). ML approaches to anti-infective drug discovery have centered on training models to identify new drugs or new uses of existing drugs (Fig. 1). As the number of drug-like small molecules is essentially infinite [as large as ~1060 (2), and possibly larger, given that typical antibiotics may not be traditionally drug-like (3)], a major benefit of ML approaches is that they can virtually screen compound libraries at a scale (>109 compounds) that would be impossible to screen empirically.
Fig. 1. Artificial intelligence can predict anti-infective drug activity, drug-target interactions, and therapeutic design.
Examples of AI model inputs, model architectures or types, and model outputs relevant to anti-infective drug discovery include those focusing on drug activity (A), drug-target interactions and mechanisms of action (B), and programmable therapeutic design (C). Inputs, models, and outputs shown are representative, in part, of those discussed in refs (9–15,19–22,28–33,36,39,40).
Anti-infective drug discovery has benefitted particularly from AI integration for several reasons. First, in contrast to cancer or other diseases in which mechanism-driven approaches have remained dominant, infectious diseases are generally phenotype-driven; that is, these diseases proceed from the physiological characteristics of infectious agents, rather than their genetic or molecular compositions. The discovery of some of the first widely used antibacterials, antivirals, antiparasitics, and antifungals was based on observations of their inhibitory effects against pathogens or the symptoms caused by infections. This phenotypic line of discovery is as relevant today as it was decades ago, especially as innovations in high-throughput screening and the design of chemical libraries have enabled more quantitative and customizable discovery efforts. The focus on phenotypes implies that drug polypharmacologic effects can be common to anti-infective drugs and that biological information can be integrated across different macromolecular drug targets (4). Phenotypic properties are well-suited for analysis by ML because ML can both unify and disentangle the different types of biological information that impinge on these readouts. Second, most anti-infective drugs are small molecules, whose chemical structures can be modeled computationally as graphs comprising vertices and edges, and additional programmable modalities, including target-binding nucleic acids called aptamers (5) as well as antimicrobial peptides (AMPs) (6–8), are currently in development. Supervised graph neural networks (9–11), unsupervised generative models (12,13)—which refer to ML models capable of producing outputs similar to their training data—and other recent advances in ML architectures (Box 1) enable computers to learn, predict, or design patterns in chemical structures, offering powerful tools for modeling small molecules. The use of ML to make biologically relevant predictions from sequences of nucleic acids or amino acids allows for ML-guided design relevant to these therapeutic modalities, as exemplified by protein structure prediction platforms such as AlphaFold and RoseTTAFold (14,15). Lastly, infectious diseases are typically caused by pathogens that are, or can be, well-characterized. This biological tractability contrasts with complex diseases like neurodegeneration, for which our incomplete mechanistic understanding remains a major bottleneck. Our clearer understanding and larger databases (16–18) of the gene and protein networks of bacteria, viruses, and even simple eukaryotes—as compared to human cell types—may allow ML-driven approaches to make more accurate predictions and better identify drug mechanisms of action (MoAs; 19–21).
Despite these advantages, there are outstanding issues in applying ML, and more broadly AI, to anti-infective drug discovery. One major challenge is that it is unclear how well ML models generalize to unexplored biomolecular spaces. For instance, we have previously screened a library of small molecules for growth inhibitory activity against Escherichia coli and used this phenotypic information to train graph neural networks to predict the antibiotic activities of small molecules—including halicin—based on their chemical structures (9). Yet, these models performed best at predicting compounds in well-known antibiotic classes, such as β-lactams and quinolones. In order to tap into previously unexplored sequence spaces, different approaches are needed. For example, a sub-optimal solution was implemented during the course of a genetic algorithm—an algorithm that iteratively evolves its inputs to optimize a property—to identify the novel synthetic peptide guavanin 2. This peptide was subsequently synthesized and effectively killed bacteria in a preclinical mouse model, suggesting that the model could generalize at the cost of optimality (22). Recently, emerging computational approaches have made it possible for the first time to also mine proteomes for antibiotic discovery, leading to the identification of thousands of new antimicrobials in both extant and extinct organisms (6,23).
Overall, lead molecules are only as structurally novel as the chemical spaces that are explored, and ML-driven approaches are limited by both the structural diversity of the training sets and the ability of model architectures to prioritize novelty. Organocatalysis and cascade reaction sequences, which are chemical synthesis methods that have recently opened up chemical spaces, can provide useful experimental starting points for generating structurally diverse small molecules (24). In contrast, the computational enumeration of all feasible small molecules containing atoms found in most drugs, as provided by the GDB datasets (25), presents opportunities to exhaustively sample chemical spaces of small molecules, with the caveat that other computational models are needed to accurately predict synthesizability. Nucleic acid- and peptide-encoded combinatorial libraries of small molecules (26) and peptides (27), as well as designable aptamers (28), can further extend search spaces of interest. In each case, generalizability is paramount to ML models. Improving generalizability will require the application of new paradigms and models with improved inference capabilities, for instance, few-shot models, which are ML models that extrapolate from scarce training data (corresponding to under-sampled regions of search spaces), or multi-task models, which are ML models that combine information from diverse inputs. Models such as these will help to identify only the most promising drug candidates (29). Providing “negative” data (e.g., tested compounds that are not active) is also essential for ML model training and benchmarking, and when ML models are applied to challenging test sets, it is important that their limitations are clearly expressed (e.g., through confidence information). To express these limitations, interpretable or explainable ML approaches can be used to capture the specific aspects of training data that models have learned by pinpointing the input structural features (explainable ML) or the parts of the model that lead to a prediction (interpretable ML; 30).
Another key challenge in AI for anti-infective discovery is the need for improved mechanistic models to complement phenotypic approaches. Whereas ML models have been useful for identifying drug candidates based on phenotypic information (9,12,31–33), more work is needed so that models can accurately predict drug-target interactions and MoAs. These drug attributes remain important in light of antimicrobial resistance and the fact that we are still learning about the MoAs of anti-infective drugs discovered decades ago (34). Protein structure predictions (14,15) and other resources now provide structural information that can inform target-based predictions; yet, not knowing a protein’s structure has not typically limited drug discovery (35). Recent studies have highlighted that improvements in molecular docking—which predicts binding affinities between ligands and targets based on structural information—are still needed to accurately identify antibiotic MoAs, and that ML-driven approaches can improve prediction accuracy (36). Molecular docking approaches have largely focused on small-molecule ligands, but target predictability is just as important for AMPs, which often have unspecific membrane-active MoAs (22,31,32), as well as aptamers. Improvements in target-centric approaches can facilitate the discovery of compounds with specific binding activity and lead to improved biological understanding, which can inform predictions of emergent properties such as drug interactions and synergies. Of particular relevance to antibiotic resistance, a better understanding of how compounds interact with membranes is crucial for discovering drugs that are active against Gram-negative bacteria, whose outer membranes have proven particularly difficult to penetrate (37).
Drug development is a lengthy and intricate process influenced by numerous factors such as safety, cost, manufacturing, and clinical trial outcomes. For anti-infective drugs in particular, toxicity to host cells is a common liability. Drugs can be toxic in different ways (e.g., cytotoxic, hemolytic, and genotoxic), and ML models predicting toxicity have been limited by factors such as the lack of high-quality datasets (38). Absorption, distribution, metabolism, and excretion (ADME) properties, including chemical instability in solution and metabolic breakdown, are also needed to filter out drug candidates that are unselective or unsuitable for medicinal use. Furthermore, while high-throughput screens have focused on in vitro testing, there is substantial unmet need for anti-infective drugs that are effective against systemic infections. Predicting efficacy in animal models of acute systemic infections is a challenging task that has not yet been addressed by ML-driven approaches.
We anticipate that active areas to watch are those that combine experimental and computational approaches to address model predictive power and data scarcity. ML approaches that incorporate information from scarce training data, as well as more extensive search spaces, are likely to substantively augment anti-infective drug discovery. To guide experimental methods to augment search spaces, generative ML models will continue to propose chemical structures and peptide sequences de novo that can be synthesized and evaluated. Generative platforms such as GPT-4 and NVIDIA’s BioNeMo can also facilitate drug discovery by improving our understanding of the underlying biology and chemistry. Interpretable or explainable ML approaches (e.g., for graph neural networks) can offer powerful ways of inferring salient structural features or improving model learning from data. Computational pipelines that leverage structural predictions of proteins and other macromolecules provide a complementary way to improve model predictive power. Detailed molecular dynamics simulations and ML-augmented approaches to docking exemplify techniques that can better predict interactions between drugs and macromolecules (36,39). We anticipate that sequence-to-structure models, such as AlphaFold for proteins or FARFAR2 for RNAs (14,40), will also be useful for structure-guided design. Such models can be used to tune therapeutic candidates to achieve specific structures, bridging structural predictions with the productive augmentation of search spaces.
Artificial intelligence for infection biology and infection-related contexts
Bacterial, eukaryotic, and viral pathogens infect diverse hosts and trigger complex host responses. Pathogen load, host immunity, treatments administered, and other factors influence the course of the infection. Supervised ML models have been used to analyze structured and unstructured nucleic acid, protein, glycan, and cellular phenotypic datasets to identify critical features and molecular networks involved in host-pathogen interactions and immune responses (Fig. 2; 41–45). Various supervised and unsupervised ML models, including random forest classifiers and complex language models (models designed to understand or generate text), have been applied to identify genes and protein-protein interactions associated with host cell changes, predict immunogenicity, and evaluate pathogen killing, host cell adaptation, and virulence. Additionally, supervised models have been used to guide the development of vaccines and therapeutic drugs through the optimization of gene expression and antigen prediction and selection (46,47). Reverse vaccinology, which bases antigen prediction on immunologic and genomic information, has been facilitated by supervised ML approaches, including Vaxign-ML (47).
Fig. 2. Artificial intelligence can elucidate infection biology, facilitate vaccine design, and inform treatment strategies.
Examples of AI model inputs, model architectures or types, and model outputs focusing on infection biology (A), vaccine design (B), and anti-infective drug treatment strategies (C). Inputs, models, and outputs shown are representative, in part, of those discussed in refs (41–49,51–54).
In general, ML has made an outsized contribution to analyzing large and often convoluted datasets in infectious diseases research. While these examples illustrate the promise of using ML to elucidate key factors underlying infections and how infections progress within hosts, understanding host-pathogen interactions and immune responses remains a challenging biological problem. This problem can be addressed by integrating high-throughput datasets—including sequencing, structural, and microscopy data—with detailed mechanistic studies, experimentation, and infection models. Mechanistic and experimental studies, however, are typically low-throughput, constraining the generalizability of AI-guided approaches that rely on them. Experiments in which large datasets are systematically acquired and analyzed across different infection contexts, for instance through comprehensive CRISPR screens, RNA-seq, and mass spectrometry, would foster the development of AI models that extend beyond tools for data analysis and make generalizable hypotheses and inferences. Parameterizing these efforts with biological sequences or chemical structures, such as small molecules, guide RNAs, or amino acid sequences, would offer exciting, tunable approaches to investigating infection biology. As an example of a sequence-guided approach, a recent study developed unsupervised language models of influenza, HIV-1, and SARS-CoV-2 viral proteins based on amino acid sequence information and accurately predicted escape patterns that allow these pathogens to evade the human immune system (45). ML models that can make specific assumptions about biology, such as the relevance of syntax (grammar) and semantics (meaning) in biological sequences, or leverage structural information, have the potential to guide the generation of biological hypotheses and improve generalizability.
Additionally, ML has productively processed microscopy datasets relevant to infection biology. Various forms of microscopy, including light and electron microscopy, have been used to generate datasets underlying ML models that detect bacteria, fungi, parasites, and viruses in host cells. These analyses have led to insights in host-pathogen biology, for instance by elucidating the developmental morphologies of P. falciparum in human red blood cells using multi-color fluorescence microscopy (48) and identifying virulence factors involved in Mycobacterium abscessus pathogenesis from high-content imaging and phenogenomic data (49).
Beyond host-pathogen interactions, ML models have informed various aspects of vaccine development. Sequence-based ML approaches to mRNA and nucleic acid vaccines can accelerate design, and the turnaround times for the synthesis and experimental validation of these vaccines are short (50). Protein structure-based vaccine design (51) can also be augmented with computational predictions from AlphaFold or RoseTTAFold. Yet, the use of ML for vaccine development faces several challenges, including poor data quality, limited data availability and generalizability, and complicated testing procedures. Limited or only low-quality data may be available for certain populations or diseases, particularly for neglected tropical diseases, and these limitations can influence the choices of target antigens and constrain ML models that predict antigen presentation and vaccine targets. Different infections have different host contexts, and ML models predicting the efficacy of vaccines, which modulate immunity in host cells, may be less generalizable to biological contexts than those for anti-infective drugs. Furthermore, the validation of vaccine candidates can be time-consuming and expensive, requiring delivery to host cells and suitable immunogenicity assays. To begin addressing these challenges, comprehensive benchmarking datasets for antigen selection and vaccine efficacy will be needed. These datasets will help to standardize data quality and improve the predictive power of next-generation ML approaches to vaccine development.
Aside from infection biology and vaccine development, ML has also informed clinical decision-making in infection contexts. Of note, a recent study used regression models to implement personalized antibiotic recommendations that minimized the risk of urinary tract and wound infections (52). However, a general bottleneck in using ML to design treatment strategies is the need for data and models that are relevant to specific infection settings. An earlier study used support vector machines to analyze bacterial gene expression patterns in human patients, representing an important step toward showing that ML models can provide useful biological information relevant to clinical infections (53). Moving forward, multi-dimensional predictions of how anti-infective drugs and vaccines interact with model hosts and humans will help improve treatment strategies, anticipate adverse effects, and potentially increase success rates for new drugs in clinical trials.
As new datasets and models are needed to improve the application of ML to infection-related contexts, we anticipate that active areas to watch are those that make biology more “embeddable”—that is, able to be represented by low-dimensional features, such as sequences, vectors, or graphs. Integrating ML with next-generation systems and synthetic biology methods for cellular profiling will drive progress in this area. For instance, combining high-throughput screens and microscopy with precise methods for biological control, such as gene editing or optogenetics, would generate data relevant to key processes like host cell stress responses, enabling manipulation of these pathways to address infectious diseases. In ML, promising types of language models include large language models, which are trained on large amounts of text data, and fine-tuned language models, which are trained to perform a specific task. Fine-tuned large language models for biology, such as BioBERT (54), may unify information from diverse infection contexts and offer increased predictive power to help elucidate host-pathogen interactions, facilitate antigen selection, inform vaccine design, and design treatment strategies.
Artificial intelligence for diagnostics and synthetic biology
As large-scale testing efforts during the COVID-19 pandemic have illustrated, quick and accurate detection of infections and pathogen outbreaks remains paramount to controlling the spread of infectious diseases. Recent advances in combining AI with synthetic biology, gene expression analyses, mass spectrometry, and imaging have substantively expanded our ability to detect infections and predict drug resistance (Fig. 3; 55–60). ML is well-suited for catalyzing synthetic biology-based diagnostics because of the high programmability of biological elements, the routine generation of large or sequence-based datasets, and the ability of ML to extract meaningful information from biomolecular networks in disease biology (61).
Fig. 3. Artificial intelligence can facilitate synthetic biology research and diagnostics development.
Examples of AI model inputs, model architectures or types, and model outputs relevant to the development of synthetic biology-based diagnostics (A) and the development of other forms of diagnostics, including those based on sequencing, mass spectrometry, and imaging (B). Inputs, models, and outputs shown are representative, in part, of those discussed in refs (55–60,68–74).
Engineering genetic elements and understanding biomolecular networks remain critical to designs that harness biology. Synthetic biology approaches leveraging enzymatic reactions, toehold switches (RNAs that respond to specific nucleic acid sequences), or CRISPR-Cas enzymes have been used for the detection of malaria, Ebola, Zika, COVID-19, and other diseases (62–67). Supervised ML models have facilitated the design of toehold switches (68,69), CRISPR guide RNAs (70–72), and other biomolecules. Notably, large datasets are available for toehold switch function, CRISPR guide RNA activity, and other factors that are relevant to diagnostic design. While different types of neural networks, including feed-forward networks (neural networks with linear architectures), convolutional neural networks (networks comprised of convolutional layers), and long short-term memory models, have been commonly used to model these data, the same datasets can provide useful resources for testing newer and potentially more predictive or generative ML models, including transformers or variational autoencoders (Box 1), to more efficiently develop next-generation diagnostics.
Beyond synthetic biology, ML has been used for gene expression-, mass spectrometry-, and imaging-based diagnostics. Gene expression- and mass spectrometry-based diagnostics have been applied to antimicrobial susceptibility testing (AST). AST remains important for informing the use of anti-infective drugs, but typical, culture-based AST for bacteria, viruses, fungi, and parasites can take at least several days to complete. This turnaround time remains too long to adequately address clinical needs for acute systemic infections, such as those resulting in sepsis. Recent studies have combined gene expression and interaction profiling, structural mutation-mapping, and ML to identify genetic signatures of resistance that could be used as the basis of rapid molecular diagnostics (56,57). Supervised ML classifiers have predicted antibiotic resistance profiles correlated with clinical MALDI-TOF mass spectra of bacterial proteins, and these predictions could be completed within 24 hours after sample collection (58). Nevertheless, a potential limitation to this approach was that the areas under the receiver operating characteristic curve (AUROC) for different bacterial species were ~0.7, suggesting that improvements in classifier accuracy will be needed to make this approach useful (e.g., AUROC > 0.9) in clinical settings. ML has also informed more traditional ways of diagnosing infections, including microscopy, epitope profiling (73), chest radiographs and CT scans (59,60), and lateral flow tests (74). In each of these applications, the generation of large, multi-dimensional datasets combined with clear functional readouts, such as the presence or absence of a resistance profile or a disease, makes ML particularly useful for producing accurate predictions.
Nevertheless, there remain important challenges in applying ML to diagnosis, including low data quality or quantity for new or emerging pathogens, the limited generalizability of the current data and approaches used, and the need for highly accurate diagnostic predictions in clinical settings. Obtaining enough high-quality data relevant to new or emerging pathogens or strains, particularly in low-resource settings, remains a difficult problem that is exacerbated by a lack of scientific infrastructure and variable public health resources. ML models based on limited data may exhibit biases, promulgating inappropriate diagnostics, misdiagnoses, and greater health inequalities that make it more difficult to serve patient populations. These biases may also remain undetected, especially when black-box ML models, which do not provide any explanation or interpretation of their predictions, are used (30,61). Even when high-quality sequencing data from large infectious disease databases, such the PATRIC database (75), is available, it remains to be seen whether antimicrobial resistance predictions based on these data are generalizable when applied to genetically diverse infections found worldwide. Furthermore, unlike for anti-infective discovery—where the stakes for false positives and false negatives predicted by ML models are lower because the predictions can be further tested—the consequences of an inaccurate diagnostic prediction can be severe. In fact, a recent survey suggested that no existing model for the diagnosis or prognosis of COVID-19 from chest radiographs and CT scans was of potential clinical use due to methodological flaws, biases, or both (60). Models with comparatively high AUROC values (i.e., 0.90) may still be too weak for clinical applications, as this value implies that, given a positive and a negative diagnosis, the negative diagnosis is ranked higher than the positive diagnosis 10% of the time. Until more accurate ML models can be developed, AI-based diagnostics might play only a supporting role in clinical settings.
Moving forward, we anticipate that active areas to watch will include the ML-guided design and discovery of synthetic circuits enabling the development of low-cost and portable diagnostics, the application of AI to data generated from clinical and field-deployable diagnostics that improve accessibility and scope, and the development of ML models that provide accurate diagnoses in clinical settings. In particular, the application of sequence-to-function models, language models, and generative models to RNA switches, CRISPR-based tools, and other programmable elements will be exciting areas of growth due to the ability for rapid iteration and the precise, on-target activity of these synthetic biology approaches (67–71). By increasing the testing and reporting of infections, the development of low-cost, field-deployable diagnostics should also help produce more balanced datasets that better sample local infections and make ML models less biased. AI or ML models that can extract information from small or incomplete datasets, using tools such as transfer learning (which adapts models trained on a specific task to other tasks) and Bayesian networks (networks that allow for probabilistic inference), can play outsized roles in how infectious diseases are addressed, especially for overlooked populations in low-resource areas. Such models could lead to more personalized medicine, in which diagnoses or resistance profiles can be readily reported based on data from only a few infections and help guide the use of anti-infective drugs. On the other hand, the accuracy of ML models also needs to improve for practical use in clinical diagnoses. Future ML models will likely need to be optimized in architecture, thoroughly evaluated for biases, and trained on large amounts of robust data to achieve high accuracy. Transfer and multi-task learning, attention mechanisms, and other approaches can help these next-generation ML models provide more accurate diagnoses.
Conclusions and future outlook
Approaches combining systems and synthetic biology with ML models, including graph neural networks, sequence-to-function and sequence-to-structure frameworks, and generative models, are yielding access to new drug candidates and methods for drug discovery. Supervised classifiers, unsupervised language models, and other ML models have produced biologically relevant insights into how pathogens interact with host cells and immune responses, informing antigen determination, vaccine design, and treatment strategies. The aforementioned types of ML models have also informed the design of various diagnostic tools and improved system accuracy, helping clinicians to diagnose infections and detect antimicrobial resistance. Beyond medical and biotechnological approaches to infectious diseases, ML—and more broadly, AI—has also led to substantive advances in epidemiology and our understanding of disease transmission. Better leveraging AI to address infectious diseases will require a collaborative effort among scientists, clinicians, and public health officials.
Developing AI models that generalize and avoid bias will require the acquisition and integration of comprehensive datasets. These datasets might include high-throughput therapeutic counter-screens and explorations of diverse chemical spaces for drug discovery, data from drug-target interactions and biomolecular interactions, and genetic sequencing information that is robustly and representatively sampled from all infections, including those occurring in low-resource or hard-to-access areas. Programmable modalities, such as nucleic acid and amino acid sequences, have represented tractable and common starting points for ML models (such as those predicting structure from sequence), but advances in biology and chemistry are important to opening up search spaces and making biology more “embeddable”, or able to be represented by low-dimensional features. Progress in this area will help to predict therapeutic efficacy and drug mechanisms of action, complex host-pathogen interactions and host responses, and interactions between small molecules, proteins, peptides, and nucleic acids. Advances in AI will include approaches, such as few-shot and multi-task models, that leverage more of the available scientific information for dealing with limited or low-quality data. Furthermore, interpretable, explainable, and generative ML approaches will lead to specific biological insights. We anticipate that AI will continue to empower us to design next-generation drugs, vaccines, and diagnostics that address infectious diseases.
Acknowledgements
We thank Michael Funk and two anonymous reviewers for thorough feedback on the manuscript. We also thank Xiao Tan, Melis N. Anahtar, and Jacqueline A. Valeri for helpful comments on the manuscript.
Funding:
F.W. was supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under award number K25AI168451. C.F.N. holds a Presidential Professorship at the University of Pennsylvania, is a recipient of the Langer Prize by the AIChE Foundation, and acknowledges funding from the IADR Innovation in Oral Care Award, the Procter & Gamble Company, United Therapeutics, a BBRF Young Investigator Grant, the Nemirovsky Prize, Penn Health-Tech Accelerator Award, the Dean’s Innovation Fund from the Perelman School of Medicine at the University of Pennsylvania, the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM138201, and the Defense Threat Reduction Agency (DTRA; HDTRA11810041, HDTRA1-21-1-0014, and HDTRA1-23-1-0001). J.J.C. was supported by the Defense Threat Reduction Agency (HDTRA12210032), the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under award number R01-AI146194, and the Broad Institute of MIT and Harvard. This work is part of the Antibiotics-AI Project, which is directed by J.J.C. and supported by the Audacious Project, Flu Lab, LLC, the Sea Grape Foundation, and Rosamund Zander and Hansjorg Wyss for the Wyss Foundation.
Footnotes
Competing interests: J.J.C. is scientific co-founder and scientific advisory board chair of EnBiotix, an antibiotic drug discovery company, and Phare Bio, a non-profit venture focused on antibiotic drug development. C.F.N. provides consulting services to Invaio Sciences and is a member of the scientific advisory boards of Nowture S.L. and Phare Bio. F.W. declares no competing interests.
References and Notes
- 1.Sterling T, Irwin JJ, ZINC 15 – ligand discovery for everyone. J. Chem. Inf. Model 55, 2324–2337 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Schneider G, Automating drug discovery. Nat. Rev. Drug. Discov 17, 97–113 (2018). [DOI] [PubMed] [Google Scholar]
- 3.O’Shea R, Moser HE, Physicochemical properties of antibacterial compounds: implications for drug discovery. J. Med. Chem 51, 2871–2878 (2008). [DOI] [PubMed] [Google Scholar]
- 4.Moffat JG, Vincent F, Lee JA, Eder J, Prunotto M, Opportunities and challenges in phenotypic drug discovery: an industry perspective. Nat. Rev. Drug. Discov 16, 531–543 (2017). [DOI] [PubMed] [Google Scholar]
- 5.Afrasiabi S, Pourhajibagher M, Raoofian R, Tabarzad M, Bahador A, Therapeutic applications of nucleic acid aptamers in microbial infections. J. Biomed. Sci 27, 6 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Torres MDT et al. Mining for encrypted peptide antibiotics in the human proteome. Nat. Biomed. Eng 6, 67–75 (2022). [DOI] [PubMed] [Google Scholar]
- 7.Torres MDT, Cao J, Franco OL, Lu TK, de la Fuente-Nunez C, Synthetic biology and computer-based frameworks for antimicrobial peptide discovery. ACS Nano 15, 2143–2164 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Der Torossian Torres M, de la Fuente-Nunez C, Reprogramming biological peptides to combat infectious diseases. Chem. Commun 55, 15020–15032 (2019). [DOI] [PubMed] [Google Scholar]
- 9.Stokes JM et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Liu G et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat. Chem. Biol (2023). [DOI] [PubMed] [Google Scholar]
- 11.Wong F et al. Discovering small-molecule senolytics with deep neural networks. Nat. Aging (2023). [DOI] [PubMed] [Google Scholar]
- 12.Melo MCR, Maasch JRMA, de la Fuente-Nunez C, Accelerating antibiotic discovery through artificial intelligence. Commun. Biol 4, 1050 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wan F, Kontogiorgos-Heintz D, de la Fuente-Nunez C, Deep generative models for peptide design. Digit Discov 1, 195–208 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Jumper J et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Baek M et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Karp PD et al. The BioCyc collection of microbial genomes and metabolic pathways. Brief. Bioinform 20, 1085–1093 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Karp PD et al. The EcoCyc database. EcoSal Plus 8, 10.1128/ecosalplus.ESP-0006-2018 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Howe KL et al. Ensembl Genomes 2020—enabling non-vertebrate genomic research. Nucleic Acids Res. 48, D689–D695 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Fu C et al. Leveraging machine learning essentiality predictions and chemogenomic interactions to identify antifungal targets. Nat. Commun 12, 6497 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Espinoza JL et al. Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach. PLoS Comput. Biol 17, e1008857 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Yang J et al. A white-box machine learning approach for revealing antibiotic mechanisms of action. Cell 177, 1649–1661 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Porto WF et al. In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nat. Commun 9, 1490 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Maasch JRMA, Torres MDT, Melo MCR, de la Fuente-Nunez C, Molecular de-extinction of ancient antimicrobial peptides enabled by machine learning. bioRxiv, doi: 10.1101/2022.11.15.516443 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Jones SB, Simmons B, Mastracchio A, MacMillan DWC, Collective synthesis of natural products by means of organocascade catalysis. Nature 475, 183–188 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model 52, 2864–2875 (2012). [DOI] [PubMed] [Google Scholar]
- 26.Rössler SL, Grob NM, Buchwald SL, Pentelute BL, Abiotic peptides as carriers of information for the encoding of small-molecule library synthesis. Science 379, 939–945 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Quartararo AJ et al. Ultra-large chemical libraries for the discovery of high-affinity peptide binders. Nat. Commun 11, 3183 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Iwano N et al. Generative aptamer discovery using RaptGen. Nat. Comput. Sci 2, 378–386 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Altae-Tran H, Ramsundar B, Pappu AS, Pande V, Low data drug discovery with one-shot learning, ACS Cent. Sci 3, 283–293 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Jiménez-Luna J, Grisoni F, Schneider G, Drug discovery with explainable artificial intelligence. Nat. Mach. Intell 2, 573–584 (2020). [Google Scholar]
- 31.Das P et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng 5, 613–623 (2021). [DOI] [PubMed] [Google Scholar]
- 32.Nagarajan D et al. Computational antimicrobial peptide design and evaluation against multidrug-resistant clinical isolates of bacteria. J. Biol. Chem 293, 3492–3509 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Jin W et al. Deep learning identifies synergistic drug combinations for treating COVID-19. Proc. Natl. Acad. Sci. USA 118, e2105070118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wong F et al. Cytoplasmic condensation induced by membrane damage is associated with antibiotic lethality, Nat. Commun 12, 2321 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Lowe D, Why AlphaFold won’t revolutionise drug discovery. In Chemistry World (2022). Accessed 21 March 2023 at https://www.chemistryworld.com/opinion/why-alphafold-wont-revolutionise-drug-discovery/4016051.article.
- 36.Wong F et al. Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Mol. Syst. Biol 18, e11081 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Breidenstein EB, de la Fuente-Núñez C, Hancock RE. Pseudomonas aeruginosa: all roads lead to resistance. Trends Microbiol. 19, 419–426 (2011). [DOI] [PubMed] [Google Scholar]
- 38.Vo AH, Van Vleet TR, Gupta RR, Liguori MJ, Rao MS, An overview of machine learning and big data for drug toxicity evaluation. Chem. Res. Toxicol 33, 20–37 (2020). [DOI] [PubMed] [Google Scholar]
- 39.Palmer N, Maasch JRMA, Torres MDT, de la Fuente-Nunez C, Molecular dynamics for antimicrobial peptide discovery. Infect. Immun 89, e00703–20 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Watkins AM, Rangan R, Das R, FARFAR2: Improved de novo Rosetta prediction of complex global RNA folds. Structure 28, 963–976 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wheeler NW, Gardner PG, Barquist L, Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. PLoS Genet. 14, e1007333 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Chen H et al. Systematic evaluation of machine learning methods for identifying human–pathogen protein-protein interactions. Brief. Bioinform 22, bbaa068 (2021). [DOI] [PubMed] [Google Scholar]
- 43.Bojar D et al. Deep-learning resources for studying glycan-mediated host-microbe interactions. Cell Host Microbe 29, 132–144 (2021). [DOI] [PubMed] [Google Scholar]
- 44.Fisch D et al. Defining host-pathogen interactions employing an artificial intelligence workflow. eLife 12, e40560 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Hie B et al. Learning the language of viral evolution and escape. Science 371, 284–288 (2021). [DOI] [PubMed] [Google Scholar]
- 46.Sample PJ et al. Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotech 37, 803–809 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Ong E et al. Vaxign-ML: supervised machine learning reverse vaccinology model for improved prediction of bacterial protective antigens. Bioinformatics 36, 3185–3191 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Ashdown GW et al. A machine learning approach to define antimalarial drug action from heterogeneous cell-based screens. Sci. Adv 6, aba9338 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Boeck L et al. Mycobacterium abscessus pathogenesis identified by phenogenomic analyses. Nat. Microbiol 7, 1431–1441 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Chaudhary N, Weissman D, Whitehead KA, mRNA vaccines for infectious diseases: principles, delivery and clinical translation. Nat. Rev. Drug Discov 20, 817–838 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Crank MC et al. A proof of concept for structure-based vaccine design targeting RSV in humans. Science 365, 505–509 (2019). [DOI] [PubMed] [Google Scholar]
- 52.Stracy M et al. Minimizing treatment-induced emergence of antibiotic resistance in bacterial infections. Science 375, 889–894 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Cornforth DM et al. Pseudomonas aeruginosa transcriptome during human infection. Proc. Natl. Acad. Sci. USA 115, E5125–E5134 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Lee J et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Metsky HC et al. Designing sensitive viral diagnostics with machine learning. Nat. Biotech 40, 1123–1131 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Khaledi A et al. Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning-enabled molecular diagnostics. EMBO Mol. Med 12, e10264 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kavvas ES et al. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nat. Commun 9, 4306 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Weis C et al. Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nat. Med 28, 164–174 (2022). [DOI] [PubMed] [Google Scholar]
- 59.Mei X et al. Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. Nat. Med 26, 1224–1228 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Roberts M et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell 3, 199–217 (2021). [Google Scholar]
- 61.Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ, Next-generation machine learning for biological networks. Cell 173, 1581–1592 (2018). [DOI] [PubMed] [Google Scholar]
- 62.de Lima LF, Ferreira AL, Torres MDT, de Araujo WR, de la Fuente-Nunez C. Minute-scale detection of SARS-CoV-2 using a low-cost biosensor composed of pencil graphite electrodes. Proc. Natl. Acad. Sci. USA 118, e2106724118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Pardee K et al. Paper-based synthetic gene networks. Cell 159, 940–954 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Pardee K et al. Rapid, low-cost detection of Zika virus using programmable biomolecular components. Cell 165, 1255–1266 (2016). [DOI] [PubMed] [Google Scholar]
- 65.Karlikow M et al. Field validation of the performance of paper-based tests for the detection of the Zika and chikungunya viruses in serum samples, Nat. Biomed. Eng 6, 246–256 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Lee RA et al. Ultrasensitive CRISPR-based diagnostic for field-applicable detection of Plasmodium species in symptomatic and asymptomatic malaria. Proc. Natl. Acad. Sci. USA 117, 25722–25731 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.de Puig H et al. Minimally instrumented SHERLOCK (miSHERLOCK) for CRISPR-based point-of-care diagnosis of SARS-CoV-2 and emerging variants. Sci. Adv 7, abh2944 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Angenent-Mari NM, Garruss AS, Soenksen LR, Church G, Collins JJ, A deep learning approach to programmable RNA switches. Nat. Commun 11, 5057 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Valeri JA et al. Sequence-to-function deep learning frameworks for engineered riboregulators. Nat. Commun 11, 5058 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Chuai G et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol. 19, 80 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Kim HK et al. Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity, Nat. Biotech 36, 239–241 (2018). [DOI] [PubMed] [Google Scholar]
- 72.Wang D et al. Optimized CRISPR guide RNA design for two high-fidelity Cas9 variants by deep learning. Nat. Commun 10, 4284 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Shrock E et al. Viral epitope profiling of COVID-19 patients reveals cross-reactivity and correlates of severity. Science 370, abd4250 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Turbé V et al. Deep learning of HIV field-based rapid tests, Nat. Med 27, 1165–1170 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Wattam AR et al. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 42, D581–D591 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]