Abstract
The widespread adoption of high-throughput sequencing technologies and multi-omics approaches has led to rapid accumulation of genomic, transcriptomic, proteomic, and even single-cell multimodal datasets, resulting in an exponential growth of biological data. The massive scale and inherent complexity of these datasets pose significant challenges for data management, analysis, and interpretation in the field of bioinformatics. Concurrently, artificial intelligence (AI) techniques, particularly deep learning and reinforcement learning, have achieved groundbreaking advances in medical diagnostics, drug discovery, and genomic analyses, providing novel theoretical tools and analytical paradigms for bioinformatics research. AI techniques are now extensively applied to DNA, RNA, and protein sequence prediction and design, 3D structural elucidation, functional annotation, integrative analysis of multi-omics data, and personalized drug design for precision medicine, significantly advancing biological research. This review systematically summarizes recent research progress and representative applications of AI techniques in bioinformatics, specifically discussing suitable scenarios and advantages of traditional machine learning algorithms, deep learning models, and reinforcement learning methods. We highlight AI’s transformative impact with quantitative metrics from landmark achievements: accurate near-atomic protein structure prediction (median 0.96 Å on CASP14), robust single-cell modeling (AvgBIO
0.82), high protein design success rates (up to 92%), and sensitive cancer detection (Area Under Curve (AUC)
0.93). Furthermore, the paper provides an in-depth analysis of the latest advancements of AI in specific tasks, including biomedical text mining, multimodal omics integration, and single-cell analyses, while highlighting current challenges such as data noise and sparsity, difficulties in modeling long biological sequences, complexities in multimodal data integration, insufficient model interpretability, and ethical and privacy concerns. Finally, the paper outlines promising future research directions, emphasizing large-scale data mining, cross-domain model generalization, innovations in drug design and personalized medicine, and advocates for establishing an open and collaborative research ecosystem.
Keywords: artificial intelligence, bioinformatics, survey
Introduction
Over the past few decades, biological research paradigms have undergone profound transformations. The widespread adoption of high-throughput sequencing technologies and multi-omics approaches has resulted in an exponential accumulation of biological data, encompassing genomic, transcriptomic, proteomic, and even single-cell multimodal datasets [1–5]. Consequently, efficiently managing, analyzing, and extracting meaningful insights from these massive and complex datasets has emerged as a critical challenge in the field of bioinformatics [6–8].
Concurrently, breakthroughs in artificial intelligence (AI), particularly deep learning and reinforcement learning techniques, have shown remarkable successes across medical diagnostics, pharmaceutical discovery, and genomics, opening unprecedented opportunities for bioinformatics [9, 10]. Enabled by increased amounts of training data and computational power, AI applications in bioinformatics have evolved from simple machine-learning models leveraging limited sets of features to sophisticated multimodal, multitask deep neural networks trained on extensive biological datasets [11, 12]. These advances span various biological scales and tasks, including accurate prediction and design of DNA, RNA, and protein sequences, elucidation of their 3D structures and functional mechanisms, integrative analysis of multi-omics data, and tailored optimization for personalized medicine and drug design [10, 13, 14].
Critically, integrated AI frameworks, foundation models (FMs) and large language models (LLMs) continue to drive groundbreaking discoveries, offering biologists novel research paradigms and theoretical tools to probe biological systems [15]. At the sequence level, deep learning models are now capable of automatically identifying evolutionarily conserved regions, mutation patterns, and critical functional domains within biological sequences [16, 17]. At the structural level, next-generation Transformer architectures and their hybrid algorithms allow rapid inference of protein and nucleic acid 3D conformations and interaction interfaces directly from sequence information [10, 18]. At the functional level, AI models capture synergistic relationships among sequences, structures, epigenetic modifications, and cellular states, thereby mapping gene regulatory networks, functional modules, and disease-associated pathways, providing novel, data-driven strategies for disease diagnosis and therapeutic intervention [11, 19]. These interconnected domains, from input data to core biological problems, are illustrated in Fig. 1.
Figure 1.
AI in bioinformatics leverages machine learning, reinforcement learning, and deep learning to analyze multimodal inputs—such as genomic sequences, medical images, and chemical structures—in order to solve core biological problems ranging from protein structure prediction and gene regulation to single-cell analysis and the de novo design of molecules.
Moreover, the advent of single-cell omics and multimodal datasets has greatly expanded the complexity and potential scope of AI applications. Integration of single-cell transcriptomic and epigenomic data facilitates the generation of highly resolved cell lineage maps and cellular atlases [2, 11]. Multimodal integration further enables the synthesis of genomic sequences, protein structures, medical imaging, and clinical textual data, enhancing our capability to deeply reconstruct the dynamic interplay of biological systems across spatial–temporal scales and environmental contexts [20, 21]. With the continuous accumulation of patient cohorts and multi-omics data, AI holds immense potential in predicting rare diseases, optimizing personalized pharmacotherapy, and supporting intelligent diagnostic assistance [9, 12].
Nevertheless, significant challenges remain for widespread AI adoption in bioinformatics. Massive and heterogeneous datasets frequently contain inherent noise, biases, and class imbalance issues [22]. In addition, biological sequences, particularly those from higher organisms, are often exceedingly long, complicating effective modeling of long-range dependencies using current attention mechanisms or hierarchical strategies [23]. Furthermore, explainability and reproducibility of AI models face heightened scrutiny from the biological and medical communities, necessitating careful integration of prior biological knowledge and visualization methods [24]. Finally, privacy and ethical considerations become crucial when handling sensitive genomic and clinical patient data, demanding rigorous standards and frameworks for responsible AI deployment [25].
Despite these challenges, AI continues to provide transformative research approaches, spanning from microscopic molecular mechanisms to macroscopic biological processes, enabling comprehensive and precise characterization of life systems. The combination of traditional machine learning (ML) with contemporary deep learning, reinforcement learning, and large-scale FMs is rapidly propelling innovation in genomics, protein engineering, and precision medicine [10, 15]. Continued advancements in algorithms, data availability, and computational resources will undoubtedly strengthen the deep integration of AI within bioinformatics, providing incalculable benefits for life sciences and clinical practices, such as accelerating drug screening and optimization, enhancing diagnostic efficiency, and facilitating personalized therapeutic solutions [9, 10, 14].
This paper systematically reviews the progress and application of AI in bioinformatics. It highlights how rapid advancements in sequencing technologies and multi-omics methods have resulted in massive, complex biological datasets, posing challenges for analysis and interpretation. AI techniques, including ML, deep learning, and reinforcement learning, have effectively addressed these challenges, facilitating breakthroughs in sequence analysis, structural prediction, functional annotation, single-cell studies, and personalized medicine. The authors categorize AI techniques into ML (traditional methods like support vector machines (SVM) and random forests (RF) suitable for defined feature spaces and smaller datasets), deep learning (Transformer-based models such as AlphaFold and DNABERT excelling in sequence analysis and structural prediction), and reinforcement learning (notably effective in sequential decision-making, protein design, and drug discovery). The review details recent AI advancements in biomedical text mining, sequence and structure prediction, functional annotation, single-cell multi-omics analysis, and multimodal data integration, discussing representative models and methodologies. In addition, the authors summarize current AI application challenges, including data sparsity, noisy data, long-sequence modeling, multimodal integration complexities, model interpretability, and ethical concerns. Finally, the paper presents future research opportunities, emphasizing large-scale dataset utilization, cross-domain generalization, innovative drug design and personalized medicine, and encouraging open collaboration.
Artificial intelligence techniques in bioinformatics
AI was initially recognized for its applications in sectors such as gaming and finance [26–28], and has since expanded into healthcare [29], robotics, and various other applied science domains. In bioinformatics, AI systems, trained on extensive biological datasets, exhibit extraordinary predictive and generative capabilities through sophisticated learning algorithms. The applications of AI in bioinformatics can be categorized into three main areas: ML, deep learning, and reinforcement learning. A comprehensive summary and categorization of the methods discussed in this review are provided in Supplementary Table S1. ML models in bioinformatics are adept at interpreting and predicting biological phenomena using algorithms that analyze and learn from structured data. These models excel in classification and regression tasks, identifying patterns and making predictions from well-labeled datasets. Deep learning, an extension of ML, employs complex neural networks that perform feature extraction and transformation across multiple layers, effectively handling raw and unstructured data. Techniques such as convolutional neural networks and recurrent neural networks are crucial for capturing spatial and sequential dependencies in data, such as genomic sequences or protein structures. In particular, the emergence of Transformer-based models, such as the GPT series and LLaMA series, has further revolutionized bioinformatics. These models leverage self-attention mechanisms to improve the processing of sequential data, providing significant improvements in tasks such as protein folding prediction and genetic sequence analysis (Fig. 2). Reinforcement learning in bioinformatics is focused on decision-making processes, where algorithms iteratively learn to make a series of decisions in a dynamic environment to achieve specific goals. This approach is especially beneficial in drug discovery and genomics, where models adapt and optimize outcomes based on feedback from biological simulations or experimental data. The combined strengths of ML, deep learning, and reinforcement learning underline their extensive applicability in various facets of bioinformatics. This tripartite framework not only enhances the accuracy of biological analyses but also accelerates the pace of biomedical discoveries and therapeutic developments (Fig. 1).
Figure 2.
This timeline charts the evolution of AI in bioinformatics from 2004 to 2025, highlighting key milestones from early machine learning applications to advanced deep learning models like AlphaFold and BioBERT.
Literature search methodology
We conduct a structured literature search following PRISMA guidelines [30], with the screening workflow summarized in Fig. 3. Our corpus spans publications from 2004 to 2025; however, motivated by the profound impact of Transformer-based architectures (e.g. BERT) on AI, we placed analytical emphasis on works published between 2019 and 2025. Searches were performed in PubMed, arXiv, bioRxiv, Web of Science, and Google Scholar.
Figure 3.

PRISMA flow diagram for the literature selection.
Our inclusion criteria encompass: proposing AI or mobile computing methods in bioinformatics; conducting benchmarking or clinical evaluations of these methods; possessing clearly defined tasks, datasets, and metrics; or representing state-of-the-art comprehensive surveys and benchmarks. We excluded nonbioinformatics AI papers without biological applications, purely descriptive works lacking experimental results, and articles where full text was unavailable or entries were duplicated.
As shown in Fig. 3, our initial search identified a number of records, which were reduced after removing duplicates. Following title and abstract screening, a number of articles underwent full-text evaluation, with some excluded based on our criteria. Ultimately, the studies that met all inclusion criteria were incorporated into this review. For analytical purposes, we extracted data items including task/domain, dataset, metrics, headline results, and availability.
Based on the above systematic literature retrieval method, we identified and analyzed three main categories of AI technologies in bioinformatics. Each of these technologies has its unique advantages and applicable scenarios, which will be introduced in detail one by one below.
Machine learning techniques in bioinformatics
In bioinformatics, traditional ML methods still play a crucial role in scenarios where experimental conditions are relatively controlled, features are well defined, or the scale of data is limited. If we denote the training set as
, where
represents biological data features (such as genomic sequencing fragments, physicochemical properties of proteins, or medical imaging signals), and
is the corresponding category label or numerical target, then by learning a function
to minimize the following objective function, classification, or regression tasks can be accomplished:
![]() |
(1) |
where
represents the parameters and
denotes the loss function. Within this framework, SVM, RF, Decision Trees,
-Nearest Neighbors (KNN), and ensemble learning methods (such as XGBoost) [31, 32] are often employed for discriminative analysis of biomedical text, images, videos, and graph-structured data.
For instance, Orthogonal Transform + K-Means combines the Discrete Walsh Transform and Discrete Chebyshev Transform with vector quantization for biomedical image compression, achieving notable success in enhancing compression ratios and image quality. Taking DNA sequence analysis as an example, traditional ML methods are commonly used for classification and prediction based on sequencing data. OncoNPC combines large-scale custom sequencing panel data with polygenic risk scores (PRS) and uses the XGBoost model to classify cancers of unknown primary (CUPs) and predict posttreatment responses [33]. ELSA-seq employs SVM to identify optimized methylation sequencing patterns, achieving ultra-sensitive early detection of cancer (especially lung cancer) at extremely high dilutions (1/10,000) [34]. In RNA analysis, due to the technical variations in RNA sequencing data, researchers often use supervised learning methods from ML. The Procrustes model aligns RNA-seq gene expression matrices from different sequencing platforms through supervised learning techniques, effectively eliminating batch effects and enabling comparability of data across different studies or clinical samples [35]. For proteins, traditional ML algorithms continue to be used for protein function prediction and interaction analysis. pcPIP utilizes SVM to build a classification model from multiple structural properties of protein–protein complex interfaces, distinguishing between native and nonnative interaction interfaces, thus aiding researchers in validating experimental models [36].
Reinforcement learning techniques in bioinformatics
Reinforcement Learning is a method that allows an agent to learn optimal strategies through interaction with the environment, based on a trial-and-error mechanism [37]. The core of RL is that the agent perceives the state of the environment
at each timestep
, takes action
, and receives a reward
from the environment. The goal is to maximize the cumulative reward
, where
is the discount factor.
In RL, commonly used methods include Q-learning based on value functions [38]. The core of Q-learning is to update the state-action value function
using the Bellman equation:
![]() |
where
is the learning rate. Deep Q-Network (DQN) [39] introduces deep neural networks into Q-learning, using neural networks to approximate the
value function, thus enabling handling of high-dimensional state spaces. Furthermore, policy gradient methods directly optimize policy parameters
to maximize the expected cumulative reward
. The gradient is computed as:
![]() |
where
is the state-action value function under policy
[40].
In bioinformatics, reinforcement learning is increasingly applied to problems such as sequence decision-making, optimization of complex objective functions, and exploration of large solution spaces. Traditional bioinformatics methods often face limitations when dealing with high-dimensional and biologically complex data. With the rise of deep learning, models that combine deep neural networks with deep reinforcement learning provide new approaches to these challenges. In the processing of biomedical texts, images, videos, and graph-structured data, researchers combine reinforcement learning with deep learning to enhance the model’s capabilities. For example, Bi-Neighborhood Graph Neural Network (BN-GNN) utilizes dual DQN to automatically search for the optimal number of GNN layers for each brain network instance, thus enhancing the performance ceiling of brain network analysis tasks such as brain disease prediction [41]. MedDQN enhances attention to minority class samples by designing reward functions, using DQN to address the problem of class imbalance in biomedical image classification [42]. In addition, REIN-named entity recognition (NER) combines bi-directional long short-term memory networks (Bi-LSTM) with DQN, enhancing named entity recognition by leveraging dependencies between sequence annotations [43]. In DNA and RNA sequence analysis, RL is used to optimize sequence alignment and design. Methods such as EdgeAlign and DQNAlign use DQN to allow agents to observe local sequence fragments and dynamically choose alignment actions (match, insert, and delete) to maximize cumulative rewards, achieving efficient sequence alignment [44–46]. DyNA-PPO models the sequence design problem as a Markov decision process, using proximal policy optimization (PPO) to optimize sequence generation strategies [47]. In RNA structure prediction, LEARNA combines Monte Carlo Tree Search (et al.) with deep learning, training a policy network to generate sequences that can fold into target structures [48]. In protein research, RL plays a crucial role in tasks such as protein design, structure prediction, and complex assembly. FoldingZero treats the 2D hydrophobic-polar (HP) model of protein folding as an RL problem, solving it using convolutional neural networks and MCTS [49]. DrlComplex is the first to combine predicted inter-chain contact information, using DQN to assemble protein complexes [50]. Additionally, ProteinGAN uses generative adversarial networks (GANs) and RL methods to generate protein sequences with specific functions [51]. In drug design, RL is one of the core tools for generating new molecular entities. Methods such as MolDQN and DrugEx use policy gradients, DQN, and other algorithms to optimize the generation process of SMILES strings or molecular graphs, ensuring that the generated molecules satisfy specific pharmacological properties [52–54].
uses transcriptomic data to guide personalized drug design [55], while DRLinker optimizes the connection of molecular fragments [56]. DeepFMPO combines a molecular fragment replacement strategy with the A2C algorithm to achieve multiparameter optimization [57].
Deep learning techniques in bioinformatics
Compared to traditional ML, deep learning is highly favored primarily because it can automatically learn complex feature representations in large-scale, high-dimensional biological data, thus providing new perspectives for understanding life processes [58, 59]. If the training set is represented as
, deep neural networks typically learn the parameters
by minimizing
![]() |
(2) |
where
is the target loss function (such as cross-entropy loss), and
and
represent the inputs (e.g. genomic fragments, protein sequences, or multimodal medical images) and labels, respectively.
In the field of bioinformatics, models have evolved from traditional statistical and ML approaches to deep learning models, particularly those based on the Transformer architecture [60]. These models can automatically learn complex patterns and hierarchical features in biological data, providing new methods and perspectives for understanding life processes. For biomedical texts, images, videos, and graph-structured data, researchers have utilized the Transformer architecture in deep learning models to enhance the processing of biomedical information. For example, BioBERT [61] adapts the BERT model pretrained on large biomedical corpora such as PubMed and PMC, suitable for biomedical text mining tasks including NER, Relation Extraction (RE), and Question Answering (QA) [61]. Additionally, PubMedBERT emphasizes pretraining from scratch on biomedical texts to better capture domain-specific knowledge [62]. In DNA sequence analysis, deep learning models are used to interpret DNA sequences. DNABERT uses a pretrained bidirectional Transformer encoder to capture both global and local features of DNA sequences for predicting functional elements such as promoters and transcription factor binding sites (TFBS) [63]. Enformer introduces a Transformer architecture with a larger receptive field, enhancing predictions of gene expression levels and enhancer–promoter interactions [64]. Protein research is one of the most active areas for the application of deep learning. AlphaFold2 utilizes an innovative SE(3)-equivariant Transformer and attention mechanism to achieve atomic-level accuracy in predicting the 3D structures of protein monomers [10]. The evolutionary scale modeling (ESM) series [14, 65], such as ESM-1b, infers protein structure and function directly from sequence data through self-supervised learning on a large corpus of protein sequences [14]. In single-cell multi-omics analysis, deep learning models have transformed traditional data analysis methods. scBERT, using self-supervised learning on large-scale unlabeled single-cell RNA sequencing data, enhances model generalization, and overcomes batch effects [66].
The evolution from task-specific deep learning models to more generalized architectures has culminated in the emergence of FMs, large-scale, pretrained systems capable of adapting to diverse downstream tasks with minimal fine-tuning. These models represent a paradigm shift in how we approach biological data analysis.
Foundation models and multimodal horizons
The evolution of deep learning in bioinformatics has culminated in FMs, large-scale, pretrained systems that serve as unified computational frameworks across diverse biological tasks and scales. These models represent a paradigm shift from task-specific architectures to generalizable systems capable of transfer learning across domains. For sequence analysis, protein and genomic language models provide transferable representations; for single-cell analysis, pretrained models enable cross-study generalization; and for clinical applications, multimodal frameworks integrate textual, imaging, and molecular data for comprehensive analysis. The specific implementations and performance of these approaches are detailed in next section.
To facilitate understanding of the quantitative comparisons that follow, Table 1 provides abbreviations and task-metric mappings used throughout this paper. Tables 2 and 3 then quantify this evolution, presenting performance metrics across representative methods in key domains including protein structure prediction, genomic modeling, single-cell analysis, and drug discovery. The data reveal distinct patterns: deep learning and FMs excel through end-to-end optimization, and long-range dependency modeling, achieving state-of-the-art results on complex tasks. Traditional ML maintains advantages in data-limited scenarios with well-defined features, while reinforcement learning uniquely addresses sequential decision-making despite computational costs. To understand the practical impact of this evolution, the following section provides a quantitative comparison of these methods across key bioinformatics domains, using standardized metrics to benchmark their performance.
Table 1.
Abbreviations and task-metric mapping used in this paper.
| (A) Abbreviations used in method comparison tables | |||
|---|---|---|---|
| Abbrev. | Meaning | Abbrev. | Meaning |
|
Pearson correlation coefficient |
|
Spearman rank correlation |
| CCC | Concordance Correlation Coefficient | ||
| AUROC | Area under the ROC curve | AUC-PR | Area under the Precision-Recall curve |
| F1 | F1 score | MCC | Matthew’s Correlation Coefficient |
| Accuracy (Acc) | Classification accuracy | AvgBIO | Composite score for single-cell tasks |
| RMSD | Root-mean-square deviation | pLDDT | Predicted local distance difference test |
| Q8 Acc | 8-state secondary structure accuracy | Contact precision | Top L/10 contact prediction accuracy |
| ROUGE-L | Longest common subsequence ROUGE | Pass rate | Success rate within samples |
| Success rate | Design success percentage | Globular domain % | Percentage of globular domain proteins |
| Molecular diversity | Diversity score of generated molecules | COV | Coverage (for conformer generation) |
| (B) Task-metric mapping | |||
| Task category | Commonly used metrics | Task category | Commonly Used Metrics |
| Protein structure prediction | RMSD; pLDDT; Q8 Acc; Contact precision | Single-cell analysis | Accuracy; F1; AvgBIO |
| Protein design | pLDDT; Pass rate; Success rate; Globular Domain % | Biomedical NLP | F1 score; ROUGE-L |
| Protein representation | Spearman
|
Drug discovery | AUROC; AUC-PR; Molecular Diversity |
| Molecular property prediction | COV | Clinical diagnostics | AUROC; F1 score |
| Gene expression prediction | Pearson
|
RNA structure prediction | F1 score (2D); RMSD (3D) |
| Genomic classification | F1 score; Accuracy; MCC | ||
Table 2.
Representative AI methods with datasets, application domains, metrics, and headline results.
| Methods | Domain/Task | Training datasets | Metrics | Headline result | Code Link |
|---|---|---|---|---|---|
| Protein structure prediction | |||||
| AlphaFold2 [10] | Protein 3D structure | PDB; UniClust30 | C RMSD
|
Median Å |
https://github.com/deepmind/alphafold |
| AlphaFold3 [67] | Protein-nucleic acid complexes | PDB; CCD | LDDT | LDDT (proteins) |
https://github.com/google-deepmind/alphafold3 |
| RGN2 [68] | Protein 3D structure | UniParc | Top L/10 contact precision |
(Orphan proteins) |
https://github.com/aqlaboratory/rgn2 |
| Protein design | |||||
| RFdiffusion [69] | De novo protein design | PDB | Success rate |
(motif scaffolding) |
https://github.com/RosettaCommons/RFdiffusion |
| ESM3 [65] | Protein generation | UniRef90 | Pass rate |
(Pass@128) |
https://github.com/evolutionaryscale/esm |
| ProtGPT2 [70] | De novo protein design | UniRef50 | Globular domain % |
|
https://github.com/nferruz/ProtGPT2 |
| ProGen2 [71] | Protein generation | UniRef90; BFD30 | pLDDT | median pLDDT
|
https://github.com/enijkamp/progen2 |
| Protein representation learning | |||||
| ESM-1b [14] | Protein representation | UniParc | Q8 Accuracy |
|
https://github.com/facebookresearch/esm |
| ProteinBERT [72] | Protein sequence | UniRef90 | Spearman
|
(stability) |
https://github.com/TuviaWeiss/ProteinBERT |
| xTrimoPGLM [73] | Unified protein modeling | Uniref90; ColabFoldDB | Spearman
|
(fitness, with LoRA fine-tuning) |
https://huggingface.co/proteinglm |
| Genomic sequence modeling | |||||
| Enformer [64] | Gene expression prediction | Human/mouse genomes | Pearson
|
(human genes) |
https://github.com/deepmind/deepmind-research/tree/master/enformer |
| DNABERT [63] | Genomic classification | ENCODE 690 TFBS | Accuracy; F1 | mean Accuracy and F1
|
https://github.com/jerryji1993/DNABERT |
| DNABERT-2 [74] | Genomic classification | 135 species genomes | Average score (F1/MCC) |
(on GUE benchmark) |
https://huggingface.co/zhihan1996/DNABERT-2-117M |
| Nucleotide transformer [75] | Genomic tasks | 3202 human genomes | AUC | AUC (pathogenic variants) |
https://github.com/instadeepai/nucleotide-transformer |
| HyenaDNA [76] | Long-range genomics | Human genome | Top-1 accuracy |
(species class.) |
https://github.com/HazyResearch/hyena-dna |
| Single-cell analysis | |||||
| scGPT [77] | Cell-type annotation | CELL GENE |
AvgBIO | AvgBIO (PBMC 10k Fine-tuning) |
https://github.com/bowang-lab/scGPT |
| Geneformer [78] | Cell classification | Genecorpus-30M | Accuracy | Acc (cell state) |
https://huggingface.co/ctheodoris/Geneformer |
| scFoundation [79] | Single-cell analysis | 50M cells (GEO, HCA) | Macro F1 score | Macro F1 score (cell type) |
https://github.com/biomap-research/scFoundation |
| RNA structure prediction | |||||
| RNA-FM [80] | RNA secondary structure | RNAcentral | F1 score | F1 (ArchiveII600) |
https://github.com/ml4bio/RNA-FM |
| E2Efold [81] | RNA secondary structure | RNAStralign | F1 score | F1
|
https://github.com/ml4bio/E2Efold |
| RhoFold [82] | RNA 3D structure | PDB | RMSD | Mean RMSD Å (RNA-Puzzles) |
https://github.com/ml4bio/RhoFold |
| Clinical applications | |||||
| OncoNPC [33] | Oncology diagnostics | DFCI; MSK; VICC | F1 score | F1
|
https://github.com/itmoon7/onconpc |
| ELSA-seq [34] | ctDNA detection | Lung cancer cohorts | AUROC | AUROC
|
https://github.com/bnr-ed/mworkflow |
Table 3.
Representative AI methods (Part 3: biomedical NLP and Drug Discovery).
| Methods | Domain/Task | Training datasets | Metrics | Headline result | Code link |
|---|---|---|---|---|---|
| Biomedical NLP | |||||
| BioBERT [61] | Biomedical NER | PubMed; PMC | F1 score | F1 (BC5-chem) |
https://github.com/dmis-lab/biobert |
| PubMedBERT [62] | Biomedical NLP | PubMed abstracts | F1 score | F1 (BC5-chem) |
microsoft/BiomedNLP |
| BioELECTRA [83] | Biomedical NLP | PubMed; PMC | F1 score | F1 (BC5-chem) |
https://github.com/kamalkraj/BioELECTRA |
| BioBART [84] | Biomedical text generation | PubMed abstracts | ROUGE-L | ROUGE-L (MeQSum) |
https://huggingface.co/IDEA-CCNL/Yuyuan-Bart-139M |
| Drug discovery and molecular design | |||||
| DrugEx v3 [54] | De novo drug design | ChEMBL | Molecular diversity |
|
https://github.com/XuhanLiu/DrugEx |
| DeepDRA [85] | Drug response prediction | GDSC; CTRP | AUROC | AUROC
|
https://git.dml.ir/taha.mohammadzadeh/DeepDRA |
| DFT_ANPD [86] | Anticancer prediction | NPACT; CancerHSP | AUROC | AUROC
|
https://github.com/Rambono/DFT_ANPD |
| Uni-Mol [87] | Molecular property | 209M conformations | COV |
(GEOM-Drugs) |
https://github.com/dptech-corp/Uni-Mol |
Notes: All metrics are reported on standard benchmark datasets for each domain. Training datasets listed focus on primary pretraining data. Each method selects the most effective metric whenever possible.
However, benchmark performance alone does not guarantee practical utility. The translation from research prototypes to production systems faces additional constraints, which we address next.
Quantitative evidence of artificial intelligence’s transformative impact
While the theoretical capabilities of FMs are compelling, their practical value ultimately depends on measurable performance improvements over established methods. Table 4 and Fig. 4 provide quantitative evidence of AI’s transformative impact across key bioinformatics domains. The data reveal not incremental refinements but order-of-magnitude improvements in several critical areas.
Table 4.
This benchmark provides a comprehensive performance comparison between traditional methods and modern deep learning approaches across key biomedical domains defined in Tables 2 and 3, using one representative example per domain.
| Task domain | Traditional method | Perf.(benchmark) | Modern DL method | Perf. | Metric |
|---|---|---|---|---|---|
| Protein structure prediction | I-TASSER [88] | 4.24 Å (CASP8) | AlphaFold2 [10] |
Å (CASP14) |
C RMSD
|
| Protein design | Rosetta Pipeline [89] | 0.07% (HA); 0.14% (IL-7R ); 0.43% (PD-L1) |
RFdiffusion [69] | 19% (HA; IL-7R ; PD-L1; InsR; TrkA) |
Exp. success |
| Protein representation learning | Alignment-based methods [90] | 9% (SCOP 1.75) | xTrimoPGLM [73] | 75.6% (SCOP 1.75) | Acc. |
| Genomic sequence modeling | deltaSVM [91] | 0.26 (CAGI5 competition dataset) | Enformer [64] | 0.62 (CAGI5 Competition Dataset) | Pearson
|
| Single-cell analysis | Seurat v3 [92] | 0.724 (PBMC 10k) | scGPT [77] | 0.821 (PBMC 10k) | AvgBIO |
| RNA structure prediction | RNAfold [93] | 0.592 (ArchiveII600) | RNA-FM [80] | 0.941 (ArchiveII600) | F1 |
| Clinical applications | Mutual nearest neighbors (MNN) [94] | 0.62 (MET500) | Procrustes [35] | 0.85 (MET500) | Median CCC |
| Biomedical NLP | TaggerOne [95] | 82.9% (NCBI Disease) | BioBERT [61] | 89.7% (NCBI Disease) | NER F1 |
| Drug discovery and molecular design | iANP-EC [96] | 0.837 (NPACT + CancerHSP) | DFT_ANPD [86] | 0.91 (NPACT + CancerHSP) | AUC-PR |
Notes: Some methods reported in this table may exhibit performance metrics that differ from those in Tables 2/3. This discrepancy arises because the primary purpose of this table is to enable direct performance comparisons between traditional and modern methods on as comparable a baseline as possible. Consequently, we deliberately selected tasks and datasets most suitable for direct comparison, whereas Tables 2/3 aim to showcase each method’s representative or optimal performance.
Figure 4.

This chart quantifies, on a logarithmic scale, the performance improvement multiplier of modern deep learning over traditional methods by normalizing traditional performance to 1.0 and applying specific formulas based on whether the metric is `higher is better' (e.g., experimental success rate, Exp. success; accuracy, Acc.) or `lower is better' (e.g., Cα RMSD), while explicitly noting that these inter-generational comparisons (like CASP8 vs. CASP14) are valid only within a specific task and should not be directly compared across domains.
Most notably, protein structure prediction has witnessed a four-fold improvement in accuracy (from 4.24 Å with I-TASSER [88] to 0.96 Å with AlphaFold2 [10]), fundamentally changing our ability to understand protein function. Similarly dramatic gains appear in protein design, where success rates have increased from 0.07/%0.14%/0.43% with traditional Rosetta pipelines [89] to 19% with RFdiffusion [69]—a nearly 50-fold improvement. These advances represent qualitative shifts in capability, enabling previously impossible research directions.
Even in domains where improvements appear more modest on a linear scale, the log-scale visualization in Fig. 4 reveals consistent multiplicative gains: 1.37-fold for clinical applications, 1.08-fold for biomedical NLP, and 1.09-fold for drug discovery. These seemingly modest ratios translate to substantial practical benefits—for instance, improving clinical diagnostic performance from 62% to 85% can mean the difference between a screening tool and a clinically deployable system.
However, as Table 5 illustrates, these performance gains come with significant computational costs. FMs require 40–80 GB of GPU memory and weeks to months of training time, compared with hours and modest hardware for traditional ML approaches. This trade-off between capability and computational demand shapes the practical deployment considerations we examine next.
Table 5.
Computational requirements and scalability characteristics of AI methodologies in bioinformatics.
| Methodology | Training data | Memory | Training time | Inference | Hardware | Scalability |
|---|---|---|---|---|---|---|
| requirements | requirements | speed | requirements | |||
| Traditional ML |
– samples |
1–16 GB | Minutes–Hours | Milliseconds | CPU sufficient | Limited by features |
| Deep learning |
– samples |
8–32 GB | Hours–Days | Seconds | GPU recommended | Good with data size |
| Transformers |
– samples |
16–80 GB | Days–Weeks | Seconds | GPU/TPU required | Quadratic with length |
| Reinforcement learning |
– episodes |
8–32 GB | Days–Weeks | Variable | GPU recommended | Limited by action space |
| FMs |
– samples |
40–80 GB | Weeks–Months | Seconds–Minutes | Multi-GPU/TPU | Linear to quadratic |
| Typical computational costs for representative tasks: | ||||||
| Protein structure (AlphaFold2 [10]) | PDB + MSAs | 16 GB (<1300 residues) | 7–11 days | Minutes–Hours | 128 TPU v3 |
complexity |
| Single-cell (Geneformer[78]) |
cells |
32 GB | 2–3 days | – | 12 V100 GPU |
complexity |
| Drug design (Uni-Mol [87]) |
molecules |
32 GB | 1– 3 days | Milliseconds–Seconds | 8 V100 GPU |
complexity |
| Genomics (HyenaDNA [76]) |
nucleotides |
40/80 GB | 1–2 h/1 month | Seconds | 1/8 A100 GPU |
complexity |
Notes: Memory refers to GPU memory used by deep learning methods and RAM used by traditional ML. Training times are ML based on standard academic computing environments. Inference speed represents the processing time per sample after model training. The summary row in the table above (Traditional ML – Base Model) indicates an empirical recommendation range—for reference only, not mandatory—as actual values may vary depending on implementation and hardware configuration. Configurations and values cited in the “Computational Cost for Typical Tasks” section originate from the original paper; these reflect recommended settings in the relevant research rather than hard constraints (larger or smaller resource allocations may also be applicable).
denotes sequence length.
From prototype to practice: reproducibility, deployment, and governance
The deployment of AI models in production bioinformatics workflows requires addressing practical constraints beyond algorithmic performance. As Table 5 demonstrates, the computational demands vary dramatically across methodologies.
Technical infrastructure
Traditional ML methods can run on CPUs with 1–16 GB of memory, requiring only minutes to hours for training. Deep learning models, however, demand fundamentally different infrastructure. Based on common GPU memory requirements, foundational models typically need 16–40 GB of GPU memory, while large-scale models require 40–80 GB of GPU memory and multi-GPU/TPU clusters. Complex dependency management (CUDA versions, deep learning frameworks) further challenges reproducibility. Solutions include containerization, environment locking, and standardized deployment pipelines.
Computational resources
Significant resource disparity exists: Generally, traditional ML achieves millisecond-level inference with
–
samples, while foundational models require
–
training samples and take seconds to minutes for inference. For example, AlphaFold2 processes a 256-residue protein in just 0.6 min on a V100 GPU, but requires 2.1 h for a 2500-residue protein [10]. Thus, parameter-efficient fine-tuning and model quantization offer practical trade-offs, reducing memory consumption while maintaining performance.
Training timescales
Time investment ranges from minutes for traditional ML to weeks or months for foundational models, with varying time consumption across different tasks on the same model. Pretraining single-cell models such as Geneformer on 12 V100-32 GB GPUs takes
3 days, while the drug design framework requires 1–3 days [78, 87]. For genomics models like HyenaDNA, pretraining a small model for short-range tasks on a single A100-40 GB GPU takes only 80 min. However, pretraining a large model with a context length of 1 million requires 4 weeks [76]. This impacts research speed and also affects computational budgets.
Data governance and validation
Beyond computational constraints, clinical and genomic data require stringent privacy protection through federated learning and on-premise deployment. Robust benchmarking demands leak-free data splits and multicenter validation, while production systems need continuous monitoring for distribution shifts.
These quantified constraints—from gigabytes to terabytes of data, from CPUs to TPU clusters, from minutes to months of training—fundamentally shape how AI methods transition from research prototypes to clinical deployment.
Artificial intelligence-driven solutions for key bioinformatics problems
This section details the application of AI technologies in solving key bioinformatics problems across various domains. While the main text highlights representative models for each area, a more exhaustive list of methods, including their specific tasks, technologies, and key advancements, is available in Supplementary Table S1.
Biomedical text analysis
In the field of biomedical text mining, researchers typically need to extract and interpret key information from unstructured literature, such as PubMed abstracts and PubMed Central full texts. The main downstream tasks include NER and RE. NER aims to identify concepts such as genes, proteins, diseases, chemicals, and organisms, frequently using datasets like NCBI-Disease and BC5CDR. Meanwhile, RE focuses on discovering interaction relationships between entities (e.g. protein–protein interactions or gene–disease associations) and is often evaluated on data from BioCreative competitions or automatically/semi-automatically annotated PubMed corpora [97–100]. In addition, QA systems based on PubMedQA [101] and HealthSearchQA [102], as well as document classification and summarization tasks supported by BioASQ [77] competitions, play essential roles in supporting clinical diagnosis and academic retrieval. In recent years, bio-text FMs have further improved performance in complex tasks like entity recognition and relation extraction by leveraging fine-tuning on small-scale labeled data.
Given the challenges posed by domain-specific terminology and the coexistence of multimodal data (including genomic sequencing and medical imaging) in biomedical text, AI innovations often focus on domain-adaptive pretraining and multimodal fusion. On one hand, pretraining and fine-tuning methods for small-scale labeled data have continuously evolved. For instance, BioBERT [61], BioELECTRA [83], and PubMedBERT [62] learn specialized word representations from large-scale biomedical corpora and significantly enhance entity recognition and relation extraction. Generative architectures such as BioBART [84] also exhibit strong text-generation capabilities. On the other hand, Med-PaLM [103] and Med-PaLM2 [104] refine clinical dialogue through large-scale instruction tuning, while multimodal solutions like GMAI [102], ProtST [105], and Med-SA [106] embed text, imaging, and genomic data into a unified framework. Nonetheless, classic CNN and LSTM architectures remain valuable in specialized scenarios, such as mortality prediction from echocardiographic images [107].
At the tool level, traditional ML methods still hold significant importance. For example, Orthogonal Transform-K-Means [108] achieves efficient compression of biomedical images through mathematical transformations, whereas PubMiner [109] integrates SVM with NLP to extract protein–protein interactions. These feature-based pipelines continue to be valuable references for deep learning approaches. Meanwhile, deep models undergo rapid iteration for a variety of subtasks: not only do BioBERT [61], BioELECTRA [83], and PubMedBERT [62] excel in entity recognition, but BioBART [84], Med-PaLM [103], and Med-PaLM2 [104] advance text generation and clinical conversation tasks.
In addition, deep reinforcement learning has gradually emerged in biomedical data analysis. BN-GNN [41] adopts Double DQN to adaptively select optimal graph neural network layers for brain network analysis, and MedDQN [42] alleviates class imbalance in biomedical image classification by designing reward functions. Such strategies have also been explored in biomedical image classification [110] and named entity recognition [43].
Proteomics analysis
Current protein sequence analysis has expanded beyond straightforward alignment and homology detection to encompass diverse downstream tasks and integrative methodologies. Databases such as NCBI [111] and UniProt [112] provide annotated resources for predicting variant or mutation functional effects (e.g. using dbSNP, ClinVar, COSMIC, and deep mutational scanning experiments) on protein function [68]. Proteomics-focused predictions include post-translational modifications, signal peptides, subcellular localization, and mutation-induced stability changes, frequently validated with UniProtKB [113] or Swiss-Prot [114] and related experimental datasets [70, 71]. Novel protein sequence generation bridges computational and synthetic biology by designing amino acid sequences with specified functions or properties, as illustrated by protein language models in artificial protein design [98]. Beyond the sequence level, structural prediction tasks target protein secondary structures (e.g.
-helices and
-sheets) and full 3D conformations, leveraging references such as ProteinNet [80] and experimental data from the Protein Data Bank (PDB) [115], along with predicted structure repositories like AlphaFold DB and ESM Metagenomic Atlas [116]. Community challenges such as CASP [117] foster benchmarking for protein folding accuracy, while interface and interaction modeling illuminates binding regions between proteins, ligands, and other macromolecules, driving drug discovery and mechanistic insight. Furthermore, functional annotation tasks, covering Gene Ontology (GO), enzyme commission (EC) numbers, and subcellular localization, frequently draw on GOA and UniProt-GOA [72], whereas disease association predictions build on OMIM and DisGeNET for identifying potential therapeutic targets and interpreting phenotypic consequences. These versatile datasets and tasks enable deep learning models to capture intricate macromolecular topologies and structure–function relationships, highlighting the need for robust, domain-adaptive FMs capable of translating sequence-level patterns into high-precision function and structure predictions [64].
AI research in protein sciences focuses on domain-adaptive learning and multimodal strategies to tackle complex terminologies and structures. Transformers integrate sequence and structure via attention mechanisms and extensive pretraining, while diffusion models like denoised diffusion probabilistic models (DDPMs) refine 3D protein folding. Reinforcement learning frames protein design and folding as sequential decision-making, using feedback from docking scores and structural constraints to optimize predictions iteratively.
A wide array of tools and pipelines exemplify these innovations. Traditional ML algorithms have long played a pivotal role in predicting protein functions and analyzing protein interactions. For instance, the Ratiometric 3D DNA ML method [118] integrates KNN, SVM, and RF to effectively screen and stage early urinary diseases, leveraging fluorescence signals from target protein markers on urinary exosomes. PCPIP [119] applies SVM to multiple structural properties of protein–protein complex interfaces, enabling accurate distinction between real (native) and nonreal (nonnative) interfaces. In addition, PPH-ML [120] fuses supervised learning, active learning, and Bayesian optimization within an automated robotic platform to expedite the development of protein-stabilizing polymers while minimizing experiments in a broad design space.
Deep learning methods, especially Transformer-based architectures, has ushered in a new era of protein research. ProteinBERT [72] learns local and global sequence representations through tasks like gene ontology annotation predictions. ProtGPT2 [70] models the “language” of proteins to generate novel sequences, while ProGen [121] and ProGen2 [71] train on vast genome and macro-genome datasets, producing diverse functional proteins without requiring additional fine-tuning. ESM-1b [14] and ESM3 [65] employ large-scale unsupervised learning, with ESM3 integrating sequence, structure, and functional information via discrete structural tags to yield low-homology yet functionally active proteins. AlphaFold2 [10], AlphaFold3 [67], and Protenix [122] achieve near-atomic resolution for monomers and complexes by combining SE(3)-equivariant Transformers, attention mechanisms, and unified diffusion modeling. Meanwhile, RFdiffusion [69] leverages a DDPM to support monomer generation, symmetric oligomer design, and functional motif backbone construction. Other approaches, including rGN2 (merging Recurrent Geometric Networks (RGN) [68] and a Transformer-based module), Uni-Mol [87], xTrimoPGLM [73], CLAPE-DB [123], ProtST [105], Bingo [124], and idpGAN [125], highlight the breadth of DL’s capability—ranging from backbone generation to multimodal analysis and conformational sampling.
RL likewise advances protein research, particularly in design, structure prediction, and complex assembly. LSTM-DQN [126] and FoldingZero [49] apply RL to a 2D HP folding model using DQN and MCTS (with CNNs). Drlcomplex [50] employs DQN to assemble complexes, guided by predicted interstrand contact data. Additional RL methods include ESM-PF [127], LatProtRL [128], Protein Sequence-based RL [129], SQL [130], and A2CGAT [131], which leverage algorithms like PPO, Policy Gradient, Q-learning, and A2C to optimize or generate sequences using feedback from protein language models or docking scores. RL-DIF [132] refines diffusion models for back-folding consistency; MSAGPT [133] avoids multiple sequence alignments through AlphaFold2 feedback; GAPN [134] employs Policy Gradient for assembly order; Protein Backbone Design RL [135] integrates multi-objective constraints and MCTS for backbone geometry. PROTAC Design RL [136] reconstructs PROTAC molecules through multi-objective graph transformations; and Molformer [137] investigates RL at the protein sequence level.
Genomics analysis
DNA and genomics analysis has evolved beyond traditional sequence alignment and homology detection, progressively expanding to multifaceted tasks such as identification of functional elements and novel sequence design. Generic databases like NCBI [111] provide rich training and validation data for predicting genomic functional elements, including promoters, enhancers, TFBS [138], and splicing sites, often in conjunction with specialized projects such as ENCODE, JASPAR, and EPDnew [63, 64, 100]. Predictions on the impact of variants or functional mutations can also be comprehensively assessed using databases and deep mutational scanning experimental data from dbSNP, ClinVar, COSMIC, and others [68]. In the area of new sequence generation, some studies attempt to customize DNA sequences based on specific functional requirements, advancing the intersection of synthetic biology and computational biology [98]. Multi-modal integration and downstream clinical applications, such as tumor typing and personalized diagnostics, are also emerging, utilizing omics data, pathological imaging, and clinical texts to enhance the accuracy of disease prognosis and therapeutic decisions [64].
To address the complexities of DNA sequences, including high dimensionality, long-range dependencies, and epigenetic modifications, emerging AI technologies have made significant breakthroughs in multimodal fusion and domain adaptation. The Transformer architecture, through its self-attention mechanism, effectively models long-distance sequence interactions, as demonstrated by DNABERT [63, 74], Enformer [64], HyenaDNA [76], and the P-E-theorem–based Genotype-Fitness Transformer [139], which are tailored for the remote dependence characteristics of DNA. Domain-specific adaptive pretraining strategies, such as byte-pair encoding or cyclical learning rates, can significantly enhance model transferability to small-scale annotated data. Besides the sequences themselves, multimodal modeling of DNA–protein interactions, chromatin architecture, and epigenetic modifications is increasingly emphasized. In this process, the combination of deep learning frameworks and RL strategies can adaptively optimize models guided by feedback signals such as binding energy, expression levels, or structural prediction accuracy, achieving more efficient end-to-end predictive performance.
At the traditional ML level, OncoNPC [33] integrates large-scale sequencing panels with germline PRS and models them using XGBoost to effectively classify CUP origin and predict treatment responses. ELSA-seq [34] uses methylation sequencing and optimized SVM to detect cancer characteristics at very low dilutions, significantly improving early lung cancer detection rates. AI-Nanopore [140] interprets quantum tunneling signals from nanopore sequencing using XGBoost Regression (XGBR) and Support Vector Classification, achieving high-accuracy base identification without relying on extensive feature engineering. Multispectral 3D DNA ML [141] combines multispectral 3D DNA detection with multimodal ML methods (also using SVM), significantly enhancing the accuracy of noninvasive diagnostics for urological diseases. In the realm of DL, DNABERT and DNABERT-2 [63, 74] utilize bidirectional Transformer encoders to accurately predict functional elements (promoters, TFBS, splicing sites, etc.), with sub-byte encoding/decoding and long-sequence optimization techniques improving the modeling efficiency for very long DNA sequences. Enformer [64] optimizes gene expression and enhancer–promoter interaction predictions by extending the network’s receptive field, and HyenaDNA [76] employs a strategy combining MLP and CNN modules to systematically depict long-range genomic dependencies. In addition, Nucleotide Transformer [75] and the Evo series methods [142, 143] emphasize cross-scale multimodal fusion, integrating molecular-level and genomic-level information into a unified predictive framework. AlphaFold3 [67] also shows potential in DNA–protein homology modeling, while 2DNA [144] introduces new ideas for “rewritable” DNA information storage by combining CNN and GAN technologies.
Transcriptome analysis
RNA and transcriptomics analysis has evolved from early sequence alignment and homology search to encompass tasks such as functional element identification, structural prediction, and integration of high-dimensional data. General databases such as Rfam [145] and RNAcentral [146] provide extensive annotated information for identifying noncoding RNAs like miRNA, rRNA, and tRNA. RNA secondary structure prediction can leverage benchmarks such as RNA STRAND, bpRNA, and ProteinNet [80], assessing the accuracy of local structures like stems, loops, and base pairings. For 3D structural prediction, resources like AlphaFold DB and ESM Metagenomic Atlas [116], together with community competitions such as RNA-Puzzles [147] and RNAsolo [148], are used to compare algorithms beyond using experimental structures from the PDB [115]. Deeper functional annotations such as modifications, dynamic splicing, and expression levels under various biological conditions form a critical part of transcriptomic analysis [98]. Multimodal integration further links transcriptomic data with epigenomic and imaging data to predict patient treatment sensitivity and disease prognosis [64].
Faced with challenges such as the high-throughput, heterogeneity, and long sequence dependencies of RNA data, AI technology innovations are primarily reflected in domain-adaptive pretraining and multimodal fusion. On one hand, self-supervised or semi-supervised strategies enable models to learn effective representations from vast amounts of unlabeled RNA sequence data, thereby improving generalization on small-scale annotated data; models like RNA-FM [80], UNI-RNA [149], and RNA-MSM [150] utilize self-attention mechanisms to capture evolutionary relationships and local structural features. On the other hand, multimodal approaches integrate RNA sequences, secondary structures, epigenetic modifications, and even genome-level information: for instance, EMRNA [151] combines CNNs and Transformers to hierarchically model sequences and structures, accurately predicting RNA’s 3D atomic structures. Moreover, deep reinforcement learning is increasingly applied to RNA design, iteratively optimizing (e.g. binding energy, functional scores, or homology constraints) to generate RNA molecules with specific secondary structures or sequence features; methods like RNAinverse [152], LEARNA [48], SAMFEO [153], m2dRNAs [154], and libLEARNA [155] have been validated in various scenarios, while Ribodiffusion [156], RhoDesign [157], and RNAFlow [158] introduce diffusion models and GANs to offer new approaches for coordinated RNA structure-sequence design.
In processing transcriptomics data, batch effects and technical variations introduced by different sequencing platforms often result in inconsistencies. The Procrustes model [35] aligns gene expression matrices from multiplatform RNA-seq using supervised learning, significantly reducing technical noise in subsequent analyses and effectively preserving genuine biological differences. Models built on the deep learning paradigm, such as AlphaFold3 [67], RhoFold [82], Protenix [122], Evo [142], and Evo2 [143] not only focus on protein structure prediction but also incorporate considerations of RNA 3D structure and protein interactions, further enriching the multimodal ecosystem of biomolecules.
Single-cell analysis
Single-cell sequencing technologies encompass high-dimensional omics layers, such as scRNA-seq for transcriptome expression profiling, scATAC-seq for chromatin accessibility analysis, CITE-seq for simultaneous capture of protein and transcription data, and spatial transcriptomics for cellular localization within tissue structures. These data can be accessed through databases like the Human Cell Atlas, GEO, and ArrayExpress, providing reference and validation sets for subsequent analysis [102]. Typical tasks include: (i) Cell-type identification and clustering: automating the classification of known cell types or discovering new cell subpopulations using annotated markers and genes. (ii) Trajectory inference: reconstructing the dynamic evolutionary trajectories of cells during development or drug treatment and assessing the functional impact of differential genes. (iii) Cell–cell communication and environmental response modeling: Exploring how external stimuli (e.g. drugs and immune signals) regulate transcriptomic changes and inferring cellular signaling networks [77]. (iv) Batch effect correction and multimodal integration: correcting experimental biases in data generated on heterogeneous platforms or at different times, and integrating various omics measurements like transcriptomics and epigenomics to achieve multilayered characterizations at the single-cell level.
AI technology utilizes adaptive pretraining and multimodal fusion to handle the complexity of single-cell data. Transformer models, effective at processing large-scale unlabeled data, learn, and refine features for enhanced task performance. Generative pretraining supports diverse downstream predictions by modeling transcriptomic distributions. Additionally, deep networks optimize the integration of various omics data, enabling detailed and comprehensive biological analysis.
Traditional methods commonly use clustering algorithms and differential analysis in single-cell data, but the introduction of deep learning has brought significant transformations for multi-omics integration and downstream task prediction. Geneformer [78] pioneered the use of a context-aware strategy based on Transformers for large-scale pretraining on single-cell data, achieving good transfer performance on various downstream tasks. scGPT [77] further extends this approach by generatively pretraining on millions of single-cell samples, possessing capabilities for batch and multi-omics integration, cell-type annotation, and gene perturbation prediction. scFoundation [79] pretrained on over 50 million single cells, demonstrating high accuracy and robustness in drug response prediction and transcriptional regulation analysis. For unannotated or sparsely annotated scenarios, scBERT [66] uses self-supervised learning to significantly mitigate batch effects and enhance model generalizability. Moreover, scHyena [159] integrates linear adaptation layers and bidirectional Hyena operations in the backbone network to represent full-length scRNA-seq data effectively, retaining the original data structure and building a flexible framework for batch integration and downstream multitask prediction.
Pharmaceutical analysis
Drug design and discovery use diverse data from proteins, genomes, and compounds to identify drug targets and lead compounds, enhanced by structural biology and multi-omics. This approach leverages protein–ligand interactions, pharmacokinetic and pharmacodynamic data, and molecular graphs to develop accurate models for tasks such as lead compound screening, activity/toxicity prediction (QSAR/QSPR), molecular generation (de novo design), and multitarget optimization, focusing on efficacy, selectivity, and safety.
AI in drug design has progressed from simple feature engineering to incorporating deep neural networks and reinforcement learning, employing multimodal strategies. Technologies like Graph Convolutional Networks and Transformers extract and integrate molecular and protein structural data for complex interaction analysis. Domain-adaptive pretraining, GANs, and molecular dynamics simulations support the generation process, balancing factors like chemical synthesizability, pharmacokinetics, and clinical viability.
Although reinforcement learning dominates in de novo drug generation, recent studies have shown that well-designed neural network architectures also demonstrate significant prospects in drug discovery tasks, especially in drug reuse and natural product screening. Many recent works have well reflected this trend. For instance, DFT_ANPD [86] employs a dual-feature two-sided attention network that fuses complementary molecular descriptors and fingerprints for anticancer natural product detection, achieving robust classification performance. Similarly, DeepDRA [85] addresses drug repurposing challenges through an autoencoder-based framework that integrates multi-omics data, demonstrating how denoising and stacked encoders can bridge heterogeneous biochemical evidence to generate actionable therapeutic predictions.
De novo drug design is a core application of RL in pharmacology, optimizing generators through various RL algorithms (e.g. Policy Gradient, PPO, A2C, and DQN) to align generated molecular structures with specific design requirements. For instance, DrugEx v2 [53] incorporates evolutionary strategies to improve molecular generation, and DrugEx v3 [54] introduces backbone constraints to enhance molecular feasibility and diversity. MACDA [160] employs a multi-agent system to address multi-objective optimization challenges in drug design;
[55] leverages patient transcriptome data to provide deep feedback for personalized medication. DRlinker [56] focuses on optimizing linkages between molecular fragments, while DeepFMPO [57] combines fragment replacement with the A2C strategy, enhancing multiparameter optimization efficiency.
Subsequent studies introduced more complex graph generation models, such as ACEGEN [161] using various RL algorithms for multi-objective optimization of molecules, and 3D-MCTS [162] combining real-time energy functions with Monte Carlo tree search, integrating domain knowledge to simplify the drug design process. The cutting-edge Quantum-inspired RL [163] simulates quantum annealing mechanisms, providing new insights and breakthroughs for molecular generation and optimization.
Immunomics analysis
Immunomics research studies immune system features such as the tumor immune microenvironment (TME), immune cell receptors [e.g. T cell receptor (TCR), B cell receptor (BCR)], and antigens like neoantigens and HLA status. Variability in responses to immune checkpoint blockers (ICB) is significant across tumor types and individuals [164, 165]. Core tasks include: (i) patient typing and efficacy prediction: identifying predictive biomarkers like tumor mutational burden (TMB) [166–168], microsatellite instability (MSI) [169, 170], and neoantigens [171] using multi-omics data. (ii) Immune microenvironment characterization: analyzing TME diversity through immune cell composition [172], functional states [173], TCR diversity [174], and interaction networks [175]. (iii) Neoantigen screening and vaccine design: predicting neoantigen immunogenicity for personalized therapies.
Integration of multimodal data is vital due to the complexity of immune and tumor biology. DL and LLM [176] effectively capture data correlations through self-attention and context-aware modeling [177, 178]. Domain adaptation ensures model consistency across different settings, facilitating data analysis from sources like TCGA and GEO. This often involves fine-tuning with limited labeled data or semi-supervised pretraining to address data limitations in clinical environments.
In predicting ICB response, traditional ML methods [179–181] use multivariate regression or classifiers based on features like TMB and MSI. Advanced DL and multimodal integration can further explore various biological characteristics, including radiomics, transcriptomics, and exosomes [58]. Many studies integrate immune cell interaction networks and external immune signals, with models carrying single-cell sequencing data capturing dynamic changes in the TME [173, 174]. These DL models excel in processing high-dimensional data (e.g. RNA-seq and CITE-seq), assisting in neoantigen screening, and designing personalized vaccines and cell therapies. With the rise of LLMs and multimodal learning, some research attempts to integrate patient textual records, radiographic imaging, and omics features into a unified AI framework for more precise efficacy predictions and side-effect evaluations.
Antibody drug development
Antibody drug development focuses on identifying antibodies with high affinity and specificity for specific antigens. BCRs exhibit greater diversity than TCRs due to somatic hypermutation and class switching, enhancing their ability to recognize complex epitopes [182–184]. Immune receptor diversity, affected by mechanisms like V(D)J recombination and N-nucleotide insertion, correlates with immune responses to self-antigens [185] and pathogens [186]. Technologies like bulk and single-cell VDJ sequencing provide valuable data for identifying potential neutralizing antibodies [187–193], supported by databases such as IMGT [194–198], IEDB [199], and SabDab [200].
Traditional antibody development, often dependent on animal immunization and in vitro display [201], is transitioning to computational methods like molecular dynamics and Rosetta for virtual screening and structural optimization [202, 203]. Deep learning models such as AlphaFold [10] and RoseTTAFold [69] have revolutionized 3D antibody structure prediction, supporting advanced sequence design [204, 205]. Multimodal fusion integrates diverse data like antigen structures and antibody sequences into a predictive model, while domain adaptation ensures model performance across different biological systems, enhancing the transferability of findings.
In antibody drug design, computational methods facilitate everything from initial antibody scaffold generation to backward prediction of amino acid sequences. For example, Baker’s lab proposed a two-step process starting with protein scaffold generation followed by reverse folding prediction of amino acid sequences using ProteinMPNN [206], based on diffusion models like RFDiffusion [69]. Subsequently, evolved RFDiffusion combined with yeast display technology has successfully designed nanobodies targeting specific antigenic epitopes [207]. Updated versions of RFDiffusion have even designed more complete single-chain variable fragments (scFv) with excellent performance against targets like influenza virus hemagglutinin and toxins produced by Clostridium difficile [208]. In addition, LLM like ESM2 [209], ProGen [121], and ProtGPT2 [70] containing rich protein evolutionary information help enhance antibody affinity and thermal stability through iterative fine-tuning, even in the absence of specific antigen and structural information [210]. The latest ESM3 [65] strives to integrate sequence, structure, and functional multimodal information into a unified framework, revealing deeper sequence–function relationships to improve antibody designability and stability [211].
Translational applications and real-world impact
To anchor methods in practice, we highlight representative translational examples from our survey: (i) Oncology diagnostics: panel- and cfDNA-based AI systems (e.g. OncoNPC and ELSA-seq) demonstrate robust performance for cancer-of-unknown-primary typing and early lung cancer detection under predefined operating points, illustrating how classical ML and deep models complement each other in real-world screening workflows. (ii) Cardiac imaging: deep learning on echocardiographic videos forecasts near-term outcomes when combined with Electronic Medical Record (EHR) features, pointing to pragmatic multimodal pipelines in routine cardiology practice. (iii) Protein/antibody design: structure-prediction and diffusion-based backbones accelerate candidate triage prior to wet-lab screening, reducing iteration cycles.
Challenges
The primary challenges and future opportunities in applying AI to bioinformatics are summarized in Fig. 5.
Figure 5.
Challenges stem from complexities in data handling such as noise, long sequences, multimodal integration, interpretability, and privacy, while opportunities lie in leveraging high-throughput data, improving generalization, understanding biological functions, innovating drug discovery, and enabling personalized medicine and precision diagnostics.
Data noise and sparsity
Biological experimental data are often noisy and incomplete. In single-cell sequencing, for instance, technical errors and varying depths lead to sparse, noisy data [77, 78]. In addition, batch effects from different platforms or batches can obstruct genuine biological signal extraction [66]. Training on such datasets can cause overfitting and gradient instability. Although foundational models using self-supervised pretraining and data integration have made progress, further advancements in denoising, correction, and normalization are needed [63, 74]. Promising approaches include developing tailored denoising and anomaly detection algorithms, integrating deep phenotypic data for better generalization, and creating efficient representation learning techniques for sparse data, such as sparse attention or GNN-based methods.
Long sequences and multiscale representations
Biological molecules like the human genome (3 Gbp) have long sequences that are computationally challenging for attention-based models due to high costs and memory demands [76]. Traditional truncation methods may lose long-range contexts or cause gradient issues. Possible solutions include: (i) chunk-based training with hierarchical attention, splitting sequences into chunks and using cross-chunk attention to maintain context [64, 98]. (ii) Employing local sparse or linear attention to handle ultra-long sequences with reduced computational complexity [76]. (iii) Extracting features from key biological regions (e.g. TFBS) using attention masks or focused methods for accurate modeling.
Multimodal integration and data heterogeneity
Biological data come from various sources like DNA/RNA, protein structures, and health records [77, 102]. Integrating these multimodal data to capture inter-modal information faces challenges like modal misalignment and data heterogeneity, where different data types exhibit unique temporal or spatial characteristics [64, 66]. Another challenge is developing cross-modal representation spaces that reduce errors among modalities like images and sequences [105]. In addition, large-scale pretraining on unlabeled multimodal data raises issues due to its computational and storage demands.
To address these, emerging strategies include lightweight models and modality-specific adapters to efficiently fuse multimodal data while controlling computational costs. This approach focuses on reducing dependence on high computational resources.
Interpretability and reproducibility of results
In biological research and clinical settings, models need to explain causal relationships and biological mechanisms, not just provide predictive accuracy [10, 68]. Deep learning models often lack interpretability, affecting their trustworthiness and reproducibility. Effective strategies include: (i) Using attention visualization and gene feature weighting to highlight significant genomic areas or networks influencing predictions [66]. (ii) Integrating biological knowledge bases like pathways and gene ontology to improve model interpretability [72]. (iii) Enhancing reproducibility and portability by providing open access to resources and ensuring validation across diverse datasets [98]. To facilitate this, we have compiled the availability and links to the code or models for many of the methods mentioned in this review in Supplementary Table S2.
Ethical and privacy risks
Biomedical data, including patient genomic and phenotypic information, pose significant privacy, and ethical risks. Pretraining with data from various institutions increases these risks [212]. Developing privacy-preserving technologies like differential privacy and federated learning, along with establishing ethical review procedures and compliant data governance frameworks, is crucial for responsible AI deployment in bioinformatics.
Opportunities
Comprehensive utilization of high-throughput biological data
Massive genomic datasets enhance training capabilities
FMs leverage deep learning for self-supervised training directly from genomic sequences and structures, improving understanding of gene regulation and protein functions [63]. Large genomic and single-cell transcriptomic databases provide extensive resources for developing advanced pretrained models [77, 213].
Advances in experimental technologies enrich data diversity
New technologies like spatial transcriptomics and single-cell multi-omics expand the diversity and complexity of bioinformatics data [64, 66]. Integrating these multimodal data sources enhances the understanding of biological processes across various dimensions.
Generalization and transferability across biological domains
FMs excel in learning generalizable biomolecular representations, enhancing performance across various species, genes, and conditions [72]. Examples include: (i) Models trained on RNA or protein data quickly adapt to tasks like RNA structure prediction or protein interactions with minimal fine-tuning [80]. (ii) By using genomic data from multiple species, models identify conserved evolutionary features, aiding in the discovery of conserved regions across species [64, 76].
Understanding biological functions and facilitating innovative drug discovery
Innovative drug design and protein engineering: AI streamlines drug design by screening compounds, predicting drug–protein interactions, and aiding in protein design and evolution [71]. It understands protein structure–function relationships and can generate new protein sequences [70, 71, 214].
Personalized medicine and precision diagnostics
Multi-omics integration for precision medicine: by integrating data from genomics, transcriptomics, proteomics, and metagenomics, FMs can precisely characterize disease mechanisms and inform personalized treatments. These models efficiently identify biomarkers and classify disease subtypes after learning generalized omics relationships [77, 79].
Digital healthcare and clinical decision support: FMs enhance digital healthcare by integrating EHRs, medical imaging, and disease knowledge bases, providing diagnostic and prognostic guidance, and aiding in personalized treatment decisions [102]. They also support conversational systems for interactive diagnostic and therapeutic processes.
Building ecosystems and promoting open collaboration for bioinformatics foundation models
Shared platforms and data resources
FMs depend on extensive, high-quality biological datasets. Promoting international platforms and integrated data lakes for training bioinformatics models can streamline efforts and hasten advancements, leveraging the biology community’s tradition of open data sharing [98].
Interdisciplinary collaboration and community-driven initiatives
AI in bioinformatics requires cooperation among biologists, computer scientists, and medical experts [64]. Large collaborative projects and specialized competitions, like protein structure prediction challenges, can enhance methodological development and expand the application of bioinformatics models.
Conclusion
This paper has systematically explored the integration of AI within bioinformatics, underlining the transformative influence of ML, deep learning, and reinforcement learning on the field. The burgeoning volume and complexity of biological data, spurred by advances in high-throughput sequencing and multi-omics technologies, present formidable challenges in data analysis and interpretation, which AI technologies are progressively overcoming.
Deep learning excels in tasks that require sequence prediction and structural modeling, notably through attention-based models like AlphaFold and ESM. Reinforcement learning is crucial for optimizing decision-making processes in protein engineering and drug discovery.
Challenges such as data sparsity, noise handling, and multimodal data integration remain. Future efforts should improve long-range dependency modeling, enhance interpretability, and ensure robust cross-domain generalization. Promoting an open, collaborative research ecosystem will be vital.
Key Points
The paper categorizes artificial intelligence (AI) techniques into three main pillars. Traditional machine learning is suitable for analysis tasks with well-defined features. Deep learning, particularly Transformer-based models like AlphaFold2, has achieved revolutionary breakthroughs in sequence analysis and structure prediction by automatically learning from massive datasets. Reinforcement learning optimizes strategies through trial-and-error, playing a key role in exploratory tasks such as de novo drug molecule design.
The paper showcases the broad application of AI in solving core bioinformatics problems. Examples include accurately identifying functional elements in genomics (e.g. DNABERT), achieving high-precision structure prediction and design in proteomics (e.g. AlphaFold), efficiently processing high-dimensional data in single-cell analysis (e.g. scGPT), and accelerating the entire pipeline from target discovery to candidate drug screening.
The paper highlights several formidable challenges facing AI applications. At the data level, biological data commonly suffer from noise, sparsity, and batch effects, which severely impact model performance. At the model and algorithm level, efficiently processing ultra-long biological sequences (such as the human genome) and effectively integrating multimodal heterogeneous data (such as genomics and imaging) are critical unresolved technical bottlenecks.
The paper envisions that the core future opportunity lies in building large-scale Foundation Models for bioinformatics. By pretraining on vast biological datasets, these models can learn generalizable and transferable biological principles, thereby greatly advancing innovative drug discovery and the development of precision personalized medicine.
Finally, the paper emphasizes that driving progress in the field depends on an open and collaborative research ecosystem. This requires sharing high-quality data and models and fostering close interdisciplinary collaboration among biologists, computer scientists, and clinicians to jointly accelerate scientific discovery and technological translation.
Supplementary Material
Acknowledgements
We acknowledge technical support from Data Science Platform of Guangzhou National Laboratory and Bio-medical big data Operating System (Bio-OS).
Contributor Information
Jiyue Jiang, Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, 510005 Guangzhou Province, China; Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin District, New Territories, 999077 Hong Kong SAR, China.
Yunke Li, Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, 510005 Guangzhou Province, China; Guangzhou Medical University, Xinzao Town, Panyu District, Guangzhou, 511436 Guangdong Province, China.
Shiwei Cao, Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, 510005 Guangzhou Province, China; ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China.
Yuheng Shan, National University of Singapore, 21 Lower Kent Ridge Road, 119077 Singapore, Singapore.
Yuexing Liu, Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, 510005 Guangzhou Province, China.
Tianyi Fei, Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, 510005 Guangzhou Province, China; ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China.
Yule Yu, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China.
Yi Feng, GMU-GIBH Joint School of Life Sciences, The Guangdong-Hong Kong-Macau Joint Laboratory for Cell Fate Regulation and Diseases, Guangzhou Medical University, 511436 Guangzhou Province, China.
Yu Li, Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin District, New Territories, 999077 Hong Kong SAR, China; The CUHK Shenzhen Research Institute, Hi-Tech Park, Nanshan, Shenzhen, 518057 Guangzhou Province, China.
Yixue Li, Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, 510005 Guangzhou Province, China; GMU-GIBH Joint School of Life Sciences, The Guangdong-Hong Kong-Macau Joint Laboratory for Cell Fate Regulation and Diseases, Guangzhou Medical University, 511436 Guangzhou Province, China.
Jiao Yuan, Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, 510005 Guangzhou Province, China; GMU-GIBH Joint School of Life Sciences, The Guangdong-Hong Kong-Macau Joint Laboratory for Cell Fate Regulation and Diseases, Guangzhou Medical University, 511436 Guangzhou Province, China.
Conflict of interest
None declared.
Funding
This work was supported by the Major Project of Guangzhou National Laboratory (grant nos GZNL2024A01003, GZNL2023A02007, and GZNL2025C02028 to J.Y.; and grant nos SRPG22007 and GZNL2025C01013 to Yixue L.), National Natural Science Foundation of China (grant no. 32400547 to J.Y.), Pearl River Talent Recruitment Program (2023QN10Y296 to Jiao Yuan), Guangzhou Young Top Talent Program, National Key R&D Program of China (2023YFF1204701 to J.Y.), the Chinese University of Hong Kong (CUHK; award numbers 4937025, 4937026, 5501517, and 5501329 to Yu L.), Shenzhen Medical Research Fund (grant no. A2503002 to Yu L.), the IdeaBooster Fund (IDBF23ENG05 and IDBF24ENG06 to Yu L.), partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region (Hong Kong SAR), China (project no. CUHK 24204023 to Yu L.), a grant from the Innovation and Technology Commission of the Hong Kong SAR, China (project no. GHP/065/21SZ and ITS/247/23FP to Yu L.), and the Research Matching Grant Scheme at CUHK (award numbers 8601603 and 8601663 to Yu L.) from the Research Grants Council, Hong Kong SAR, China.
Data availability
No datasets have been utilized in this review paper.
References
- 1. Marx V. The big challenges of big data. Nature 2013;498:255–60. 10.1038/498255a [DOI] [PubMed] [Google Scholar]
- 2. Stuart T, Satija R. Integrative single-cell analysis. Nat Rev Genet 2019;20:257–72. 10.1038/s41576-019-0093-7 [DOI] [PubMed] [Google Scholar]
- 3. Wang F-a, Zhuang Z, Gao F. et al. TMO-Net: an explainable pretrained multi-omics model for multi-task learning in oncology. Genome Biol 2024;25:149. 10.1186/s13059-024-03293-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Jiang J, Wang Z, Shan Y. et al. Biological sequence with language model prompting: a survey. arXiv preprint arXiv:2503.04135, 2025.
- 5. Wang Z, Wang Z, Jiang J. et al. Large language models in bioinformatics: a survey. arXiv preprint arXiv:2503.04490, 2025.
- 6. Angermueller C, Pärnamaa T, Parts L. et al. Deep learning for computational biology. Mol Syst Biol 2016;12:878. 10.15252/msb.20156651 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Greene CS, Tan J, Ung M. et al. Big data bioinformatics. J Cell Physiol 2014;229:1896–900. 10.1002/jcp.24662 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Jiang J, Chen P, Wang J. et al. Benchmarking large language models on multiple tasks in bioinformatics NLP with prompting. arXiv preprint arXiv:2503.04013, 2025.
- 9. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25:44–56. 10.1038/s41591-018-0300-7 [DOI] [PubMed] [Google Scholar]
- 10. Jumper J, Evans R, Pritzel A. et al. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Eraslan G, Avsec Ž, Gagneur J. et al. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet 2019;20:389–403. 10.1038/s41576-019-0122-6 [DOI] [PubMed] [Google Scholar]
- 12. Esteva A, Robicquet A, Ramsundar B. et al. A guide to deep learning in healthcare. Nat Med 2019;25:24–9. 10.1038/s41591-018-0316-z [DOI] [PubMed] [Google Scholar]
- 13. Senior AW, Evans R, Jumper J. et al. Improved protein structure prediction using potentials from deep learning. Nature 2020;577:706–10. 10.1038/s41586-019-1923-7 [DOI] [PubMed] [Google Scholar]
- 14. Rives A, Meier J, Sercu T. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA 2021;118:e2016239118. 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Bommasani R, Hudson DA, Adeli E. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- 16. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 2015;12:931–4. 10.1038/nmeth.3547 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 2016;26:990–9. 10.1101/gr.200535.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Baek M, DiMaio F, Anishchenko I. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021;373:871–6. 10.1126/science.abj8754 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 2018;34:i457–66. 10.1093/bioinformatics/bty294 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Ma A, McDermaid A, Jennifer X. et al. Integrative methods and practical challenges for single-cell multi-omics. Trends Biotechnol 2020;38:1007–22. 10.1016/j.tibtech.2020.02.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Lotfollahi M, Naghipourfar M, Luecken MD. et al. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol 2022;40:121–30. 10.1038/s41587-021-01001-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Whalen S, Schreiber J, Noble WS. et al. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 2022;23:169–81. 10.1038/s41576-021-00434-9 [DOI] [PubMed] [Google Scholar]
- 23. Choromanski K, Likhosherstov V, Dohan D. et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- 24. Samek W, Montavon G, Lapuschkin S. et al. Explaining deep neural networks and beyond: a review of methods and applications. Proc IEEE 2021;109:247–78. [Google Scholar]
- 25. Nicholson Price W, Glenn Cohen I. Privacy in the age of medical big data. Nat Med 2019;25:37–43. 10.1038/s41591-018-0272-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Cao L. AI in finance: challenges, techniques, and opportunities. ACM Comput Surv 2022;55:1–38. [Google Scholar]
- 27. Kan X, Miao M, Cao L. et al. Stock price prediction based on artificial neural network. In 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), 182–185, Taiyuan, China, 2020.
- 28. Wurman PR, Stone P, Spranger M. Improving artificial intelligence with games. Science 2023;381:147–8. 10.1126/science.adh8135 [DOI] [PubMed] [Google Scholar]
- 29. Jiang J, Wang S, Li Q. et al. A cognitive stimulation dialogue system with multi-source knowledge fusion for elders with cognitive impairment. In: Rogers A, Boyd-Graber J, Okazaki N (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: long Papers), pp. 10628–40. Toronto, Canada: Association for Computational Linguistics, 2023. [Google Scholar]
- 30. Ong NY, Teo FJJ, Ee JZY. et al. Effectiveness of mindfulness-based interventions on the well-being of healthcare workers: a systematic review and meta-analysis. Gen Psychiatry 2024;37:e101115. 10.1136/gpsych-2023-101115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20:273–97. 10.1023/A:1022627411411 [DOI] [Google Scholar]
- 32. Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, San Francisco California USA, pp. 785–94, 2016.
- 33. Moon I, LoPiccolo J, Baca SC. et al. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nat Med 2023;29:2057–67. 10.1038/s41591-023-02482-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Liang N, Li B, Jia Z. et al. Ultrasensitive detection of circulating tumour DNA via deep methylation sequencing aided by machine learning. Nat Biomed Eng 2021;5:586–99. 10.1038/s41551-021-00746-5 [DOI] [PubMed] [Google Scholar]
- 35. Kotlov N, Shaposhnikov K, Tazearslan C. et al. Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data. Commun Biol 2024;7:392. 10.1038/s42003-024-06020-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. John B, Sali A. Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res 2003;31:3982–92. 10.1093/nar/gkg460 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Sutton RS, Barto AG. et al. Reinforcement Learning: An Introduction, Vol. 1. Cambridge: MIT press, 1998. [Google Scholar]
- 38. Watkins CJCH, Dayan P. Q-learning. Machine learning. 1992;8:279–92. [Google Scholar]
- 39. Mnih V, Kavukcuoglu K, Silver D. et al. Human-level control through deep reinforcement learning. Nature 2015;518:529–33. 10.1038/nature14236 [DOI] [PubMed] [Google Scholar]
- 40. Sutton RS, McAllester D, Singh S. et al. Policy gradient methods for reinforcement learning with function approximation. In: Solla S, Leen T, Müller K. (eds), Advances in neural information processing systems 1999, MIT Press, 55 Hayward, St., Cambridge, MA, United States; Denver, Colorado, USA;12. [Google Scholar]
- 41. Zhao X, Jia W, Peng H. et al. Deep reinforcement learning guided graph neural networks for brain network analysis. Neural Netw 2022;154:56–67. 10.1016/j.neunet.2022.06.035 [DOI] [PubMed] [Google Scholar]
- 42. Jayaprakash SL, Sindhu KG, Chaitanya TVSS. et al. MedDQN: a deep reinforcement learning approach for biomedical image classification. In: 2023 Global Conference on Information Technologies and Communications (GCITC), pp. 1–7. India: IEEE, 2023. [Google Scholar]
- 43. Zhao Z, Zou Y, Wang M. et al. Biomedical named entity recognition through deep reinforcement learning. In: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 779–84. Istanbul, Turkey: IEEE, 2023. [Google Scholar]
- 44. Lall A, Tallur S. Deep reinforcement learning-based pairwise DNA sequence alignment method compatible with embedded edge devices. Sci Rep 2023;13:2773. 10.1038/s41598-023-29277-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Lall A, Tallur S. Deep reinforcement learning-based pairwise DNA sequence alignment method compatible with embedded edge devices. Sci Rep 2023;13:2773. 10.1038/s41598-023-29277-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Song Y-J, Ji DJ, Seo H. et al. Pairwise heuristic sequence alignment algorithm based on deep reinforcement learning. IEEE Open J Eng Med Biol 2021;2:36–43. 10.1109/OJEMB.2021.3055424 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Guanlin W, Fang W, Wang J. et al. Dyna-PPO reinforcement learning with Gaussian process for the continuous action decision-making in autonomous driving. Appl Intell 2023;53:16893–907. 10.1007/s10489-022-04354-x [DOI] [Google Scholar]
- 48. Runge F, Stoll D, Falkner S. et al. Learning to design RNA. arXiv preprint arXiv:1812.11951, 2018.
- 49. Li Y, Kang H, Ye K. et al. Foldingzero: protein folding from scratch in hydrophobic-polar model. arXiv preprint arXiv:1812.00967, 2018.
- 50. Soltanikazemi E, Roy RS, Quadir F. et al. DRLComplex: reconstruction of protein quaternary structures using deep reinforcement learning. arXiv preprint arXiv:2205.13594, 2022.
- 51. Repecka D, Jauniskis V, Karpus L. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell 2021;3:324–33. 10.1038/s42256-021-00310-5 [DOI] [Google Scholar]
- 52.Zhou Z, Kearnes S, Li L. et al. Optimization of Molecules via Deep Reinforcement Learning. Sci Rep 2019;9:10752. 10.1038/s41598-019-47148-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Liu X, Ye K, van Vlijmen HWT. et al. Drugex v2: de novo design of drug molecules by pareto-based multi-objective reinforcement learning in polypharmacology. J Chem 2021;13:85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Liu X, Ye K, van Vlijmen HWT. et al. Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. J Chem 2023;15:24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Born J, Manica M, Oskooei A, Cadow J, Markert G, Rodríguez Martínez, M. PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning. iScience, 2021;24:102269. 10.1007/978-3-030-45257-5_18 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Tan Y, Dai L, Huang W. et al. DRlinker: deep reinforcement learning for optimization in fragment linking design. J Chem Inf Model 2022;62:5907–17. 10.1021/acs.jcim.2c00982 [DOI] [PubMed] [Google Scholar]
- 57. Ståhl N, Falkman G, Karlsson A. et al. Deep reinforcement learning for multiparameter optimization in de novo drug design. J Chem Inf Model 2019;59:3166–76. 10.1021/acs.jcim.9b00325 [DOI] [PubMed] [Google Scholar]
- 58. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–44. 10.1038/nature14539 [DOI] [PubMed] [Google Scholar]
- 59. Goodfellow I, Bengio Y, Courville A. et al. Deep Learning, Vol. 1. Cambridge: MIT press, 2016. [Google Scholar]
- 60. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. In: Advances in neural information processing systems 30. Long Beach, CA, USA, 2017. NIPS. [Google Scholar]
- 61. Lee J, Yoon W, Kim S. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020;36:1234–40. 10.1093/bioinformatics/btz682 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Yu G, Tinn R, Cheng H. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare 2021;3:1–23. 10.1145/3458754 [DOI] [Google Scholar]
- 63. Ji Y, Zhou Z, Liu H. et al. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 2021;37:2112–20. 10.1093/bioinformatics/btab083 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Avsec Ž, Agarwal V, Visentin D. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021;18:1196–203. 10.1038/s41592-021-01252-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Thomas Hayes. et al. Simulating 500 million years of evolution with a language model. Science 2025;387:850–58. 10.1126/science.ads0018 [DOI] [PubMed] [Google Scholar]
- 66. Yang F, Wang W, Wang F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell 2022;4:852–66. 10.1038/s42256-022-00534-z [DOI] [Google Scholar]
- 67. Abramson J, Adler J, Dunger J. et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature 2024;630:493–500. 10.1038/s41586-024-07487-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Chowdhury R, Bouatta N, Biswas S. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol 2022;40:1617–23. 10.1038/s41587-022-01432-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Watson JL, Juergens D, Bennett NR. et al. De novo design of protein structure and function with RFdiffusion. Nature 2023;620:1089–100. 10.1038/s41586-023-06415-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Ferruz N, Schmidt S, Höcker B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 2022;13:4348. 10.1038/s41467-022-32007-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Nijkamp E, Ruffolo JA, Weinstein EN. et al. ProGen2: exploring the boundaries of protein language models. Cell Systems 2023;14:968–978.e3. 10.1016/j.cels.2023.10.002 [DOI] [PubMed] [Google Scholar]
- 72. Brandes N, Ofer D, Peleg Y. et al. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 2022;38:2102–10. 10.1093/bioinformatics/btac020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Chen B, Cheng X, Pan L. et al. xTrimoPGLM: unified 100b-scale pre-trained transformer for deciphering the language of protein. arXiv preprint arXiv:2401.06199, 2024.
- 74. Zhou Z, Ji Y, Li W. et al. DNABERT-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.
- 75.Dalla-Torre H, Gonzalez L, Mendoza-Revilla J. et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods 2025;22:287–97. 10.1038/s41592-024-02523-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Nguyen E, Poli M, Faizi M. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S. (eds), Advances in Neural Information Processing Systems 2023; New Orleans LA USA; Neurlps; 36:43177–201. [Google Scholar]
- 77. Cui H, Wang C, Maan H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 2024;21:1470–80. 10.1038/s41592-024-02201-0 [DOI] [PubMed] [Google Scholar]
- 78. Theodoris CV, Xiao L, Chopra A. et al. Transfer learning enables predictions in network biology. Nature 2023;618:616–24. 10.1038/s41586-023-06139-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. Hao M, Gong J, Zeng X. et al. Large-scale foundation model on single-cell transcriptomics. Nat Methods 2024;21:1481–91. 10.1038/s41592-024-02305-7 [DOI] [PubMed] [Google Scholar]
- 80. Chen J, Hu Z, Sun S. et al. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. arXiv preprint arXiv:2204.00300, 2022.
- 81. Chen X, Yu L, Umarov R. et al. RNA secondary structure prediction by learning unrolled algorithms. arXiv preprint arXiv:2002.05810, 2020.
- 82. Shen T, Zhihang H, Sun S. et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat Methods 2024;21:2287–98. 10.1038/s41592-024-02487-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83. Kanakarajan KR, Kundumani B, Sankarasubbu M. Bioelectra: pretrained biomedical text encoder using discriminators. In: Proceedings of the 20th workshop on biomedical language processing, Online, BioNLP, SIGBIOMED, Association for Computational Linguistics, pp. 143–54, 2021.
- 84. Yuan H, Yuan Z, Gan R. et al. Biobart: pretraining and evaluation of a biomedical generative language model. arXiv preprint arXiv:2204.03905, 2022.
- 85. Mohammadzadeh-Vardin T, Ghareyazi A, Gharizadeh A. et al. DeepDRA: drug repurposing using multi-omics data integration with autoencoders. PLoS One 2024;19:e0307649. 10.1371/journal.pone.0307649 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86. Norouzi R, Norouzi R, Abbasi K. et al. Dft_anpd: a dual-feature two-sided attention network for anticancer natural products detection. Comput Biol Med 2025;194:110442. 10.1016/j.compbiomed.2025.110442 [DOI] [PubMed] [Google Scholar]
- 87. Zhou G, Gao Z, Ding Q. et al. Uni-Mol: A Universal 3D Molecular Representation Learning Framework, The Eleventh International Conference on Learning Representations, Kigali, Africa, 2023.
- 88. Roy A, Kucukural A, Zhang Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc 2010;5:725–38. 10.1038/nprot.2010.5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89. Jacobs TM, Kuhlman B. Using anchoring motifs for the computational design of protein–protein interactions. Biochem Soc Trans 2013;41:1141–5. 10.1042/BST20130108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90. Rao R, Bhattacharya N, Thomas N. et al. Evaluating protein transfer learning with tape. In: Advances in neural information processing systems, Vancouver, BC, Canada, Neurlps, 2019;32. [PMC free article] [PubMed] [Google Scholar]
- 91. Lee D, Gorkin DU, Baker M. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet 2015;47:955–61. 10.1038/ng.3331 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92. Stuart T, Butler A, Hoffman P. et al. Comprehensive integration of single-cell data. Cell 2019;177:1888–1902.e21. 10.1016/j.cell.2019.05.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93. Lorenz R, Bernhart SH, Siederdissen CHZ. et al. Viennarna package 2.0. Algorithms Mol Biol 2011;6. 10.1186/1748-7188-6-26 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94. Haghverdi L, Lun ATL, Morgan MD. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 2018;36:421–7. 10.1038/nbt.4091 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95. Leaman R, Zhiyong L. TaggerOne: joint named entity recognition and normalization with semi-Markov models. Bioinformatics 2016;32:2839–46. 10.1093/bioinformatics/btw343 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96. Nguyen L, Vo T-HN, Trinh QH. et al. iANP-EC: identifying anticancer natural products using ensemble learning incorporated with evolutionary computation. J Chem Inf Model 2022;62:5080–9. 10.1021/acs.jcim.1c00920 [DOI] [PubMed] [Google Scholar]
- 97. Kezhi L, Yang K, Niyongabo E. et al. Integrated network analysis of symptom clusters across disease conditions. J Biomed Inform 2020;107:103482. 10.1016/j.jbi.2020.103482 [DOI] [PubMed] [Google Scholar]
- 98. Li Q, Hu Z, Wang Y. et al. Progress and opportunities of foundation models in bioinformatics. Brief Bioinform 2024;25:bbae548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99. Koroteev MV. Bert: a review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943, 2021.
- 100.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics, 2019.
- 101. Jin Q, Dhingra B, Liu Z. et al. PubMedQA: a dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
- 102. Moor M, Banerjee O, Abad ZSH. et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259–65. 10.1038/s41586-023-05881-4 [DOI] [PubMed] [Google Scholar]
- 103. Singhal K, Azizi S, Tao T. et al. Large language models encode clinical knowledge. Nature 2023;620:172–80. 10.1038/s41586-023-06291-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Singhal K, Tu T, Gottweis J. et al. Toward expert-level medical question answering with large language models. Nat Med 2025;31:943–50. 10.1038/s41591-024-03423-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105. Minghao X, Yuan X, Miret S. et al. Protst: multi-modality learning of protein sequences and biomedical texts. In: International Conference on Machine Learning, pp. 38749–67. PMLR, 2023. [Google Scholar]
- 106.Wu J, Wang Z, Hong M, Ji W, Fu H, Xu Y, ... Jin Y. Medical sam adapter: Adapting segment anything model for medical image segmentation. Medical Image Analysis, 2025;102:103547. [DOI] [PubMed] [Google Scholar]
- 107. Ulloa AE, Cerna LJ, Good CW. et al. Deep-learning-assisted analysis of echocardiographic videos improves predictions of all-cause mortality. Nat Biomed Eng 2021;5:546–54. 10.1038/s41551-020-00667-9 [DOI] [PubMed] [Google Scholar]
- 108. Kouanou AT, Tchiotsop D, Tchinda R. et al. A machine learning algorithm for biomedical images compression using orthogonal transforms. Int J Image Graph Signal Process 2018;10:38–53. 10.5815/ijigsp.2018.11.05 [DOI] [Google Scholar]
- 109. Eom J-H, Zhang B-T. PubMiner: machine learning-based text mining for biomedical information analysis. Genomics Informat 2004;2:99–106. [Google Scholar]
- 110. Parihar S, Kukker A, Dhar S. et al. Biomedical image classification using deep reinforcement learning. In:2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), pp. 1–8. Bhilai, India: IEEE, 2024. [Google Scholar]
- 111. Sherry ST, Ward M-H, Kholodov M. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29:308–11. 10.1093/nar/29.1.308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112. UniProt Consortium . Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res 2019;47:D506–15. 10.1093/nar/gky1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Boutet E. et al. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. In: Edwards, D. (eds), Plant Bioinformatics. Methods in Molecular Biology, vol 1374. Humana Press, New York, NY. 2016. 10.1007/978-1-4939-3167-5_2 [DOI] [PubMed] [Google Scholar]
- 114. Boeckmann B, Bairoch A, Apweiler R. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003;31:365–70. 10.1093/nar/gkg095 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115. Protein Data Bank . Protein data bank. Nat New Biol 1971;233:223–1038. 10.1038/newbio233223b020480989 [DOI] [Google Scholar]
- 116. Kieser S, Brown J, Zdobnov EM. et al. Atlas: a snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. BMC Bioinform 2020;21:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117. Kryshtafovych A, Schwede T, Topf M. et al. Critical assessment of methods of protein structure prediction (CASP)—round XIV. Proteins 2021;89:1607–17. 10.1002/prot.26237 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118. Na W, Zhang X-Y, Xia J. et al. Ratiometric 3D DNA machine combined with machine learning algorithm for ultrasensitive and high-precision screening of early urinary diseases. ACS Nano 2021;15:19522–34. 10.1021/acsnano.1c06429 [DOI] [PubMed] [Google Scholar]
- 119. Das S, Chakrabarti S. Classification and prediction of protein–protein interaction interface using machine learning algorithm. Sci Rep 2021;11:1761. 10.1038/s41598-020-80900-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120. Tamasi MJ, Patel RA, Borca CH. et al. Machine learning on a robotic platform for the design of polymer–protein hybrids. Adv Mater 2022;34. 10.1002/adma.202201809 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121. Madani A, Krause B, Greene ER. et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 2023;41:1099–106. 10.1038/s41587-022-01618-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122. ByteDance AML, AI4Science team, Xinshi C. et al. Protenix-advancing structure prediction through a comprehensive AlphaFold3 reproduction. BioRxiv, 2025, 2025.01. 08.631967, 2025;2025. [Google Scholar]
- 123. Liu Y, Tian B. Protein–DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief Bioinform 2024;25. 10.1093/bib/bbad488 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124. Ma J, Song J, Young ND. et al. ‘Bingo’—a large language model-and graph neural network-based workflow for the prediction of essential genes from protein data. Brief Bioinform 2024;25. 10.1093/bib/bbad472 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125. Janson G, Valdes-Garcia G, Heo L. et al. Direct generation of protein conformational ensembles via machine learning. Nat Commun 2023;14:774. 10.1038/s41467-023-36443-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126. Yang K, Huang H, Vandans O. et al. Applying deep reinforcement learning to the HP model for protein structure prediction. Physica A 2023;609:128395. 10.1016/j.physa.2022.128395 [DOI] [Google Scholar]
- 127. Subramanian J, Sujit S, Irtisam N. et al. Reinforcement learning for sequence design leveraging protein language models. arXiv preprint arXiv:2407.03154, 2024.
- 128. Lee M, Vecchietti LF, Jung H. et al. Robust optimization in protein fitness landscapes using reinforcement learning in latent space. arXiv preprint arXiv:2405.18986, 2024.
- 129. Li Y, Li L, Xu Y. et al. Widely used and fast de novo drug design by a protein sequence-based reinforcement learning model. bioRxiv, 2022, 2022.08. 18.504370, 2022;2022. [Google Scholar]
- 130. Alexander C-RI, Gorinski PJ, Sootla A. et al. Structured q-learning for antibody design. arXiv preprint arXiv:2209.04698, 2022.
- 131. Wang Q, Xiaotong H, Wei Z. et al. Reinforcement learning-driven exploration of peptide space: accelerating generation of drug-like peptides. Brief Bioinform 2024;25:bbae444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132. Ektefaie Y, Viessmann O, Narayanan S. et al. Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding. arXiv preprint arXiv:2410.17173, 2024.
- 133. Chen B, Bei Z, Cheng X. et al. MSAGPT: neural prompting protein structure prediction via MSA generative pre-training. In: Advances in Neural Information Processing Systems 2024, Vol. 37, 37504–34. [Google Scholar]
- 134. Gao Z, Feng T, You J. et al. Deep reinforcement learning for modelling protein complexes. arXiv preprint arXiv:2405.02299, 2024.
- 135. Renard F, Courtot C, Reichlin A. et al. Model-based reinforcement learning for protein backbone design. arXiv preprint arXiv:2405.01983, 2024.
- 136. Nori D, Coley CW, Mercado R. De novo protac design using graph-based deep generative models. arXiv preprint arXiv:2211.02660, 2022.
- 137. Wu F, Radev D, Li SZ. et al. Molformer: motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence Walter E. Washington Convention Center in Washington, DC; AAAI; 2023;37:5312–20. 10.1609/aaai.v37i4.25662 [DOI] [Google Scholar]
- 138. Lenhard B, Wasserman WW. TFBS: computational framework for transcription factor binding site analysis. Bioinformatics 2002;18:1135–6. 10.1093/bioinformatics/18.8.1135 [DOI] [PubMed] [Google Scholar]
- 139. Liu Y, Luo Y, Xin L. et al. Genotypic–phenotypic landscape computation based on first principle and deep learning. Brief Bioinform 2024;25. 10.1093/bib/bbae191 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140. Jena MK, Pathak B. Development of an artificially intelligent nanopore for high-throughput DNA sequencing with a machine-learning-aided quantum-tunneling approach. Nano Lett 2023;23:2511–21. 10.1021/acs.nanolett.2c04062 [DOI] [PubMed] [Google Scholar]
- 141. Na W, Wong K-Y, Xin Y. et al. Multispectral 3D DNA machine combined with multimodal machine learning for noninvasive precise diagnosis of bladder cancer. Anal Chem 2024;96:10046–55. 10.1021/acs.analchem.4c01749 [DOI] [PubMed] [Google Scholar]
- 142.Eric Nguyen. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 2024;386:eado9336. 10.1126/science.ado9336 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 143. Brixi G, Durrant MG, Ku J. et al. Genome modeling and design across all domains of life with evo 2. BioRxiv, 2025, 2025.02. 18.638918. 2025;2025. [Google Scholar]
- 144.Pan C, Tabatabaei SK, Tabatabaei Yazdi SMH. et al. Rewritable two-dimensional DNA-based data storage with machine learning reconstruction. Nat Commun 2022;13:2984. 10.1038/s41467-022-30140-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 145. Griffiths-Jones S, Bateman A, Marshall M. et al. Rfam: an RNA family database. Nucleic Acids Res 2003;31:439–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 146. The RNAcentral Consortium, Rnacentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res 2019;47:D221–9. 10.1093/nar/gky1034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 147. Miao Z, Adamiak RW, Antczak M. et al. RNA-puzzles round IV: 3D structure predictions of four ribozymes and two aptamers. RNA 2020;26:982–95. 10.1261/rna.075341.120 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148. Adamczyk B, Antczak M, Szachniuk M. RNAsolo: a repository of cleaned PDB-derived RNA 3D structures. Bioinformatics 2022;38:3668–70. 10.1093/bioinformatics/btac386 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 149. Wang X, Ruichu G, Chen Z. et al. Uni-RNA: universal pre-trained models revolutionize RNA research. bioRxiv, 2023, 2023.07. 11.548588. 2023;2023. [Google Scholar]
- 150. Zhang Y, Lang M, Jiang J. et al. Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Res 2024;52:e3–3. 10.1093/nar/gkad1031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 151. Li T, He J, Cao H. et al. All-atom RNA structure determination from cryo-EM maps. Nat Biotechnol 2025;43:97–105. 10.1038/s41587-024-02149-8 [DOI] [PubMed] [Google Scholar]
- 152. Hofacker IL, Fontana W, Stadler PF. et al. Fast folding and comparison of RNA secondary structures. Monatshefte fur chemie 1994;125:167–7, 188. 10.1007/BF00818163 [DOI] [Google Scholar]
- 153. Zhou T, Dai N, Li S. et al. RNA design via structure-aware multifrontier ensemble optimization. Bioinformatics 2023;39:i563–71. 10.1093/bioinformatics/btad252 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 154. Rubio-Largo Á, Escobar-Encinas L, Lozano-García N. et al. Evolutionary strategy to enhance an RNA design tool performance. IEEE. Access 2024;12:15582–93. 10.1109/ACCESS.2024.3358426 [DOI] [Google Scholar]
- 155. Runge F, Franke J, Fertmann D. et al. Partial RNA design. Bioinformatics 2024;40:i437–45. 10.1093/bioinformatics/btae222 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 156. Huang H, Lin Z, He D. et al. Ribodiffusion: tertiary structure-based RNA inverse folding with generative diffusion models. Bioinformatics 2024;40:i347–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 157. Wong F, He D, Krishnan A. et al. Deep generative design of RNA aptamers using structural predictions. Nature computational. Science 2024;4:829–39. 10.1038/s43588-024-00720-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 158. Lataretu M, Hölzer M. RNAflow: an effective and simple RNA-seq differential gene expression pipeline using nextflow. Genes 2020;11:1487. 10.3390/genes11121487 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 159. Gyutaek O, Choi B, Jung I. et al. scHyena: foundation model for full-length single-cell RNA-seq analysis in brain. arXiv preprint arXiv:2310.02713, 2023.
- 160. Nguyen TM, Quinn TP, Nguyen T. et al. Counterfactual explanation with multi-agent reinforcement learning for drug target prediction. arXiv preprint arXiv:2103.12983, 2021.
- 161. Bou A, Thomas M, Dittert S. et al. ACEGEN: reinforcement learning of generative chemical agents for drug discovery. J Chem Inf Model 2024;64:5900–11. 10.1021/acs.jcim.4c00895 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 162. Hongyan D, Jiang D, Zhang O. et al. A flexible data-free framework for structure-based de novo drug design with reinforcement learning. Chem Sci 2023;14:12166–81. 10.1039/D3SC04091G [DOI] [PMC free article] [PubMed] [Google Scholar]
- 163. Wang D, Chen J, Liang Z. et al. Quantum-inspired reinforcement learning for synthesizable drug design. arXiv preprint arXiv:2409.09183, 2024.
- 164. Munn DH, Bronte V. Immune suppressive mechanisms in the tumor microenvironment. Curr Opin Immunol 2016;39:1–6. 10.1016/j.coi.2015.10.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 165. Kalbasi A, Ribas A. Tumour-intrinsic resistance to immune checkpoint blockade. Nat Rev Immunol 2020;20:25–39. 10.1038/s41577-019-0218-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 166. Samstein RM, Lee C-H, Shoushtari AN. et al. Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat Genet 2019;51:202–6. 10.1038/s41588-018-0312-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 167. Goodman AM, Kato S, Bazhenova L. et al. Tumor mutational burden as an independent predictor of response to immunotherapy in diverse cancers. Mol Cancer Ther 2017;16:2598–608. 10.1158/1535-7163.MCT-17-0386 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 168. Hellmann MD, Callahan MK, Awad MM. et al. Tumor mutational burden and efficacy of nivolumab monotherapy and in combination with ipilimumab in small-cell lung cancer. Cancer Cell 2018;33:853–861.e4. 10.1016/j.ccell.2018.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 169. Guidoboni M, Gafà R, Viel A. et al. Microsatellite instability and high content of activated cytotoxic lymphocytes identify colon cancer patients with a favorable prognosis. Am J Pathol 2001;159:297–304. 10.1016/S0002-9440(10)61695-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 170. Le DT, Uram JN, Wang H. et al. PD-1 blockade in tumors with mismatch-repair deficiency. New Engl J Med 2015;372:2509–20. 10.1056/NEJMoa1500596 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 171. McGranahan N, Furness AJS, Rosenthal R. et al. Clonal neoantigens elicit T cell immunoreactivity and sensitivity to immune checkpoint blockade. Science 2016;351:1463–9. 10.1126/science.aaf1490 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 172. Miao Y-R, Zhang Q, Lei Q. et al. Immucellai: a unique method for comprehensive t-cell subsets abundance prediction and its application in cancer immunotherapy. Adv Sci 2020;7:1902880. 10.1002/advs.201902880 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 173. Zhang N, Yang M, Yang J-M. et al. A predictive network-based immune checkpoint blockade immunotherapeutic signature optimizing patient selection and treatment strategies. Small Methods 2024;8:2301685. 10.1002/smtd.202301685 [DOI] [PubMed] [Google Scholar]
- 174. Sidhom J-W, Oliveira G, Ross-MacDonald P. et al. Deep learning reveals predictive sequence concepts within immune repertoires to immunotherapy. Sci Adv 2022;8. 10.1126/sciadv.abq5089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 175. Lee J, Kim D, Kong JH. et al. Cell-cell communication network-based interpretable machine learning predicts cancer patient response to immune checkpoint inhibitors. Sci Adv 2024;10. 10.1126/sciadv.adj0785 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 176. Clusmann J, Kolbinger FR, Muti HS. et al. The future landscape of large language models in medicine. Commun Med 2023;3. 10.1038/s43856-023-00370-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 177. Schäfer PSL, Dimitrov D, Villablanca EJ. et al. Integrating single-cell multi-omics and prior biological knowledge for a functional characterization of the immune system. Nat Immunol 2024;25:405–17. 10.1038/s41590-024-01768-2 [DOI] [PubMed] [Google Scholar]
- 178. Vanguri RS, Luo J, Aukerman AT. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(l) 1 blockade in patients with non-small cell lung cancer. Nat Cancer 2022;3:1151–64. 10.1038/s43018-022-00416-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 179. Yang Y, Zhao Y, Liu X, Huang J. Artificial intelligence for prediction of response to cancer immunotherapy. In: Seminars in Cancer Biology, Jacco van Rheenen, Vol. 87, Elsevier, 2022, 137–47. 10.1016/j.semcancer.2022.11.008. [DOI] [PubMed] [Google Scholar]
- 180. Liu Y, Altreuter J, Bodapati S. et al. Predicting patient outcomes after treatment with immune checkpoint blockade: a review of biomarkers derived from diverse data modalities. Cell Genomics 2024;4:100444. 10.1016/j.xgen.2023.100444 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 181. Chen K, Ye H, Xiao-jie L. et al. Towards in silico prediction of the immune-checkpoint blockade response. Trends Pharmacol Sci 2017;38:1041–51. 10.1016/j.tips.2017.10.002 [DOI] [PubMed] [Google Scholar]
- 182. Liu H, Pan W, Tang C. et al. The methods and advances of adaptive immune receptors repertoire sequencing. Theranostics 2021;11:8945–63. 10.7150/thno.61390 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 183. Porciello N, Franzese O, D’Ambrosio L. et al. T-cell repertoire diversity: friend or foe for protective antitumor response? J Exp Clin Cancer Res 2022;41:356. 10.1186/s13046-022-02566-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 184. Akkaya M, Kwak K, Pierce SK. B cell memory: building two walls of protection against pathogens. Nat Rev Immunol 2020;20:229–38. 10.1038/s41577-019-0244-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 185. Robinson WH. Sequencing the functional antibody repertoire—diagnostic and therapeutic discovery. Nat Rev Rheumatol 2015;11:171–82. 10.1038/nrrheum.2014.220 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 186. Tong R, Luo L, Zhao Y. et al. Characterizing the cellular and molecular variabilities of peripheral immune cells in healthy recipients of BBIBP-CorV inactivated SARS-CoV-2 vaccine by single-cell RNA sequencing. Emerging Microbes Infect 2023;12:e2187245. 10.1080/22221751.2023.2187245 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 187. Oliveira G, Wu CJ. Dynamics and specificities of T cells in cancer immunotherapy. Nat Rev Cancer 2023;23:295–316. 10.1038/s41568-023-00560-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 188. Meng H, Zhang T, Wang Z. et al. High-throughput host–microbe single-cell RNA sequencing reveals ferroptosis-associated heterogeneity during acinetobacter baumannii infection. Angew Chem Int Ed 2024;63:e202400538. 10.1002/anie.202400538 [DOI] [PubMed] [Google Scholar]
- 189. Hagen M, Bucci L, Böltz S. et al. BCMA-targeted T-cell–engager therapy for autoimmune disease. New Engl J Med 2024;391:867–9. 10.1056/NEJMc2408786 [DOI] [PubMed] [Google Scholar]
- 190. McGrath JJC, Li L, Wilson PC. Memory B cell diversity: insights for optimized vaccine design. Trends Immunol 2022;43:343–54. 10.1016/j.it.2022.03.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 191. Setliff I, Shiakolas AR, Pilewski KA. et al. High-throughput mapping of B cell receptor sequences to antigen specificity. Cell 2019;179:1636–1646.e15. 10.1016/j.cell.2019.11.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 192. Gruell H, Vanshylla K, Weber T. et al. Antibody-mediated neutralization of sars-cov-2. Immunity 2022;55:925–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 193. Cao Y, Bin S, Guo X. et al. Potent neutralizing antibodies against SARS-COV-2 identified by high-throughput single-cell sequencing of convalescent patients’ B cells. Cell 2020;182:73–84.e16. 10.1016/j.cell.2020.05.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
-
194.
Manso T, Folch G, Giudicelli V. et al. Imgt
databases, related tools and web resources through three main axes of research and development. Nucleic Acids Res 2022;50:D1262–72. 10.1093/nar/gkab1136
[DOI] [PMC free article] [PubMed] [Google Scholar] -
195.
Giudicelli V, Duroux P, Ginestoux C. et al. IMGT/LIGM-DB, the IMGT
comprehensive database of immunoglobulin and T cell receptor nucleotide sequences. Nucleic Acids Res 2006;34:D781–4. 10.1093/nar/gkj088
[DOI] [PMC free article] [PubMed] [Google Scholar] - 196. Giudicelli V, Chaume D, Lefranc M-P. IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res 2005;33:D256–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 197. Ehrenmann F, Kaas Q, Lefranc M-P. IMGT/3Dstructure-DB and IMGT/DomainGapAlign: a database and a tool for immunoglobulins or antibodies, T cell receptors, MHC, IgSF and MhcSF. Nucleic Acids Res 2010;38:D301–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
-
198.
Lefranc M-P, Giudicelli V, Duroux P. et al. IMGT
, the international immunogenetics information system
25 years on. Nucleic Acids Res 2015;43:D413–22. 10.1093/nar/gku1056
[DOI] [PMC free article] [PubMed] [Google Scholar] - 199. Vita R, Blazeska N, Marrama D. et al. The immune epitope database (IEDB): 2024 update. Nucleic Acids Res 2025;53:D436–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 200. Dunbar J, Krawczyk K, Leem J. et al. Sabdab: the structural antibody database. Nucleic Acids Res 2014;42:D1140–6. 10.1093/nar/gkt1043 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 201. Laustsen AH, Greiff V, Karatt-Vellatt A. et al. Animal immunization, in vitro display technologies, and machine learning for antibody discovery. Trends Biotechnol 2021;39:1263–73. 10.1016/j.tibtech.2021.03.003 [DOI] [PubMed] [Google Scholar]
- 202. Neamtu A, Mocci F, Laaksonen A. et al. Towards an optimal monoclonal antibody with higher binding affinity to the receptor-binding domain of SARS-COV-2 spike proteins from different variants. Colloids Surf B Biointerfaces 2023;221:112986. 10.1016/j.colsurfb.2022.112986 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 203. Leaver-Fay A, Tyka M, Lewis SM. et al. ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol 2011;487:545–74. 10.1016/B978-0-12-381270-4.00019-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 204. Xie X, Lee JS, Kim D. et al. Antibody-SGM: antigen-specific joint design of antibody sequence and structure using diffusion models. In: Yubin Xie, (ed), ICML Workshop Comput Biol, ICML; Honolulu, Hawai'i, USA; p. 2023, 2023.
- 205. Eguchi RR, Choe CA, Parekh U. et al. Deep generative design of epitope-specific binding proteins by latent conformation optimization. bioRxiv, 2022, 2022.12. 22.521698. 2022;2022. [Google Scholar]
- 206. Dauparas J, Anishchenko I, Bennett N. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022;378:49–56. 10.1126/science.add2187 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 207. Bennett NR, Watson JL, Ragotte RJ. et al. Atomically accurate de novo design of antibodies with RFdiffusion. bioRxiv, 2025, 2024.03. 14.585103. 2024;2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 208. Lab B. Designing antibodies with RFdiffusion. https://www.bakerlab.org/2025/02/28/designing-antibodies-with-rfdiffusion/, 2025. February 28
- 209. Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
- 210. Hie BL, Shanker VR, Duo X. et al. Efficient evolution of human antibodies from general protein language models. Nat Biotechnol 2024;42:275–83. 10.1038/s41587-023-01763-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 211. Kyro GW, Qiu T, Batista VS. A model-centric review of deep learning for protein design. arXiv preprint arXiv:2502.19173, 2025.
- 212. Yin X, Zhu Y, Jiankun H. A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput Surv 2021;54:1–36. [Google Scholar]
- 213. Cui Z, Tongda X, Jia Wang Y. et al. Geneformer: learned gene compression using transformer-based context modeling. In: ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP, Seoul, South Korea; pp. 8035–9. IEEE, 2024. [Google Scholar]
- 214. Li Y, Jiang J, Wang Z. et al. DS-ProGen: a dual-structure deep language model for functional protein design. arXiv preprint arXiv:2505.12511, 2025.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No datasets have been utilized in this review paper.
















































