Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Jul 25;26(4):bbaf357. doi: 10.1093/bib/bbaf357

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics

Anqi Lin 1,#, Junpu Ye 2,#, Chang Qi 3, Lingxuan Zhu 4, Weiming Mou 5,6, Wenyi Gan 7, Dongqiang Zeng 8,9, Bufu Tang 10, Mingjia Xiao 11, Guangdi Chu 12, Shengkun Peng 13, Hank Z H Wong 14, Lin Zhang 15,16, Hengguo Zhang 17, Xinpei Deng 18, Kailai Li 19, Jian Zhang 20, Aimin Jiang 21,, Zhengrui Li 22,, Peng Luo 23,24,
PMCID: PMC12289552  PMID: 40708223

Abstract

Large language models (LLMs), representing a breakthrough advancement in artificial intelligence, have demonstrated substantial application value and development potential in bioinformatics research, particularly showing significant progress in the processing and analysis of complex biological data. This comprehensive review systematically examines the development and applications of LLMs in bioinformatics, with particular emphasis on their advancements in protein and nucleic acid structure prediction, omics analysis, drug design and screening, and biomedical literature mining. This work highlights the distinctive capabilities of LLMs in end-to-end learning and knowledge transfer paradigms. Additionally, this paper thoroughly discusses the major challenges confronting LLMs in current applications, including key issues such as model interpretability and data bias. Furthermore, this review comprehensively explores the potential of LLMs in cross-modal learning and interdisciplinary development. In conclusion, this paper aims to systematically summarize the current research status of LLMs in bioinformatics, objectively evaluate their advantages and limitations, and provide insights and recommendations for future research directions, thereby positioning LLMs as essential tools in bioinformatics research and fostering innovative developments in the biomedical field.

Keywords: bioinformatics, LLMs, artificial intelligence, large language models

Introduction

Bioinformatics, which integrates biology, computer science, and statistics, has seen significant progress but continues to face persistent challenges [1]. The rapid growth of next-generation sequencing technologies has triggered an explosion in biological data, which has strained traditional processing and analysis methods [2, 3]. Heterogeneous data across domains such as protein structure prediction, multi-omics, and biomedical text mining overwhelms conventional statistical and machine learning (ML) approaches, which struggle to capture complex patterns in high-dimensional datasets. This situation underscores an urgent need for advanced deep learning tools to enhance data processing and facilitate knowledge discovery [4].

Large language models (LLMs) represent a pivotal breakthrough in artificial intelligence, demonstrating remarkable potential across diverse disciplines, and notably exhibiting distinctive applications in bioinformatics research [5–7]. LLMs are sophisticated natural language processing (NLP) models built upon deep neural network architectures, whose training methodology encompasses two critical phases: pre-training and fine-tuning. In the pre-training phase, models undergo self-supervised learning through the processing of extensive unlabeled textual data, implementing fundamental pre-training paradigms, specifically autoregressive language modeling and masked language modeling. During the fine-tuning phase, models undergo parameter optimization on task-specific labeled datasets of limited scale to enhance performance on specific tasks. In the field of NLP, LLMs represented by BERT (bidirectional encoder representations from transformers) [8] and the GPT series [9, 10] have achieved breakthrough progress, demonstrating significant performance advantages in practical applications such as intelligent customer service, machine translation, and human–computer interaction. Their core features include contextual understanding and knowledge transfer capabilities, which significantly enhance these models’ ability to process heterogeneous and complex data. Additionally, LLMs’ excellent capabilities in dialogue interaction and program writing make them valuable tools for advancing bioinformatics education and research [10]. Therefore, applying LLMs to bioinformatics not only effectively alleviates data processing challenges and improves information extraction and analysis efficiency, but also provides robust technical support for innovative developments in the field.

In recent years, LLMs, as a state-of-the-art artificial intelligence technology, have demonstrated remarkable application value and potential across diverse research domains in bioinformatics, offering innovative solutions to address fundamental scientific challenges in the field [11, 12]. In biological sequence analysis and function prediction, LLM-based computational approaches have significantly enhanced both the accuracy of genomic information analysis and the predictive performance of protein function annotation [13]. In structural biology research, LLMs have exhibited transformative capabilities, particularly in advancing protein three-dimensional structure prediction and nucleic acid spatial conformation analysis. In multi-omics data integration analysis, LLM-based computational frameworks have substantially enhanced the processing efficiency of biological big data [14], thereby expediting the identification and validation of disease-associated biomarkers. In pharmaceutical development, LLM implementation has considerably accelerated the drug development pipeline while enhancing the precision of lead compound screening and drug molecule optimization [15]. In biomedical literature mining, LLMs have exhibited distinctive advantages [16], yielding substantial improvements in literature semantic retrieval, research trajectory monitoring, and automated experimental data extraction.

This review examines LLMs’ role in bioinformatics, assessing their strengths, limitations, and future directions. While LLMs show significant potential in areas like sequence analysis, omics integration, and drug discovery, challenges persist, including data complexity, model interpretability, computational demands, and privacy concerns. Future priorities involve multimodal learning, knowledge integration, efficient architectures, interpretable AI, cross-domain collaboration with experimental biology, and ethical data practices to advance bioinformatics research and applications.

Contemporary applications and advances of LLMs in bioinformatics

LLMs, as an emerging computational paradigm integrating NLP and deep learning architectures, have demonstrated significant potential for diverse applications in bioinformatics, particularly in biomolecular structure prediction and multi-omics data analysis. Over the past decade, researchers have developed and implemented numerous specialized LLMs for bioinformatics applications (Table 1, Fig. 1), and extensive studies have systematically evaluated their performance in biological data interpretation and analysis.

Table 1.

Application status of LLMs in bioinformatics

Bioinformatics task Model name Year Base models Research direction Advantages Limitations
Protein sequence analysis and functional prediction PTMGPT2 2024 GPT Accurate prediction of post-translational modification sites (PTMs) PTMGPT2 outperforms existing deep-learning methods and tools in most cases The constrained exploration of prompt designs for certain PTM types
ProGen 2023 BERT Generate protein sequences with predictable functions Generates functional artificial proteins across protein families only with evolutionary sequence data Successfully generation artificial proteins with functional activity without fine-tuning is at a small success rate
ProGen2 2023 BERT Capture evolutionary sequence distributions, generate new active sequences, and predict protein fitness Capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning A degradation of performance was observed for narrower landscapes composed primarily of amino acid substitutions
ESM2 2023 ESM2 Assess protein sequence conservatism, identify conserved fragments in rapidly evolving sequences, and detect potential functional sites in long protein sequences Does not require a genomic database search, can parse multiple protein domains in the same run and can be accelerated by GPU Does not explain why the site is conserved, though embedding-based conservation analysis can identify conserved sites
ProtGPT-2 2022 GPT-2 De novo generation of protein sequences AlphaFold prediction of ProtGPT2-sequences yields well-folded non-idealized structures with embodiments and large loops and reveals topologies not captured in current structure databases The inclusion of conditional tags which will enable the controlled generation of specific functions
ProteinBERT 2022 BERT Rapidly training protein predictors Extremely frugal by comparison to other leading protein language models concerning size, computing, and memory
MSA Transformer 2021 MSA Sequence prediction, homology detection, coevolution, and structure–function association prediction The internal representations of the model enable state-of-the-art unsupervised structure learning with an order of magnitude fewer parameters than current protein language models Further scaling the approach in the number of parameters and input sequences
BERT4Bitter 2021 BERT Predict the biological properties of peptides BERT4Bitter makes use of raw peptide sequences without the need for the systematic design and selection of feature encodings
EpiBERTope 2021 BERT Extract information on the interaction between protein sequences EpiBERTope outperforms highly optimized machine learning models including random forest and SVM on linear epitope prediction tasks The heterogeneity of antigen sequences and scarcity of structural dataset
ELMo 2019 Protein sequence modeling Reduces the dependence on the time-consuming and computationally intensive calculation of protein profiles The ELMo embeddings alone (DeepSeqVec) did not surpass any of the best methods using evolutionary information tested on the same data set
DNA sequence analysis and functional prediction DNABERT-2 2024 BERT Genome sequence prediction Integrate several techniques such as Attention with Linear Biases (ALiBi) and Low-Rank Adaptation (LoRA) Effective modeling strategies for short and extra-long genome sequences and the introduction of training targets and data processing/augmentation methods that leverage the unique double-strand structure of DNA
DNABERT-S 2024 BERT Multi-species identification in a mixture of unlabeled genome sequences The capability of species differentiation and DNABERT-S can achieve slightly better performance with less amount of labeled samples High computational demands
HyenaDNA 2023 Hyena Genomics contextual learning HyenaDNA can learn generalizable features that can then be fine-tuned for tasks HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution
DNAGPT 2023 GPT Multiple tasks such as gene annotation, variant detection, and sequence prediction Effectively comprehends the underlying relationships and information within genomes Focus on DNA sequences and the incorporation of multimodal data
CombSAFE 2022 BERT Analyze the entire genome, clusters regions with similar functional elements Allows the comparison of a great number of genomic profiles of chromatin functional states in other conditions through HMMs, as well as the extraction of their specific variations in the different conditions Depends on the availability of multiple omics data about the same biological condition and the quality of the metadata provided in the input
DNABERT 2021 BERT Genome sequence prediction Great flexibility adapting to multiple situations, and enhanced performance with limited data Direct machine translation on DNA is not yet possible
BERT-Enhancer 2021 BERT DNA enhancer identification and DNA sequence information characterization iEnhancer-BERT consistently outperforms several other existing methods on the benchmark datasets Incapable of accurately predicting cell-specific enhancers of variable length
RoBERTa 2019 BERT Analysis of metagenomic data Achieves state-of-the-art results on GLUE, RACE, and SQuAD, without multi-task fine-tuning for GLUE or additional data for SQuAD
RNA sequence analysis and functional prediction RNABERT 2022 BERT Multiple structural alignment of RNA sequences, prediction of RNA interactions, and prediction of RNA secondary structure The base embeddings obtained by RNABERT apply to various fields in RNA informatics This study has not addressed RNA modification
GeoBoost2 2020 GeoBoost Extraction of infected host location information and augmentation of nucleotide sequence databases Not required when submitting a sequence to GenBank and provides a framework to automate this manual extraction process
Protein structure prediction ESM-AA 2024 ESM A unified approach to molecular modeling at the atomic and protein residue scales Effectively integrates molecular knowledge into the protein language model without sacrificing the understanding of proteins
ESM-IF1 2022 ESM Focus on the reverse folding problem Integrating backbone span masking into the inverse folding task and using a sequence-to-sequence transformer Evaluating single-point mutations
ESMFold 2022 ESM Direct prediction of atomic-scale protein structure from amino acid sequences Advances in speed puts far larger numbers of sequences within reach of accurate atomic-level prediction
BepiPred-3.0 2022 ESM-2 B-Cell epitope prediction Protein LMs can vastly improve B-cell epitope prediction and using only the antigen sequence as an input The current BepiPred-3.0 results are likely affected by the limited availability of experimental structures
TransPPMP 2022 ESM Protein pathogenicity analysis TransPPMP can capture functional sites that have a large influence on mutated residues Does not distinguish between the mutation type or the amino acid change
ESM-1v 2021 ESM Assess the effect of mutations on protein function Incorporating any available protein sequence and structure information
ESM-1b 2019 ESM Assess amino acid mutational effects and predict secondary structural characteristics Can be trained across evolutionary data to generalize and discover information that is not present in current state-of-the-art features The protein language models have not yet reached the limit of scale
rawMSA 2019 MSA Secondary structure prediction, relative solvent accessibility analysis, and interresidue contact map construction Automatically extracting any relevant feature from the raw data
RNA structure prediction RNABERT 2022 BERT Understand RNA secondary structure and perform structural alignment The base embeddings obtained by RNABERT apply to various fields in RNA informatics This study has not addressed RNA modification
Molecular docking and interaction prediction AraPathogen2.0 2024 ESM2 Plant pathogen PPI prediction The prediction performance for those PPIs with proteins unseen in training data has been considerably improved
PRECOGx 2022 ESM1b GPCR interaction prediction PRECOGx allows for the prediction of the effect of mutations at virtually any position within the sequence, and it can also handle larger variations Optimal outcomes for certain interactors are study specific
PepBCL 2022 BERT Predict protein–peptide binding residues and capture the conserved and non-conserved sequential characteristics of peptide-binding residues Demonstrating that protein sequences themselves contain sufficient information for the prediction of peptide-binding residues
BERT-RBP 2022 BERT Accurate prediction of RNA-RAN binding protein (RBP) interactions Can distinguish both the transcript region type and RNA secondary structure using only sequence information as inputs Can not overcome the difference in the frequency of tokens in contact
ProtBert + BiLSTM 2022 ProtBert Prediction of human leukocyte antigen (HLA)–binding peptides Extracted features may be further refined by feature selection algorithms for different prediction tasks
GeneBERT 2021 BERT Use histone modification data to predict the expression of differentially differentiated genes There is potential to use BERT models toward predicting gene expression, and the embeddings generated by RoBERTa were effective even though they were not trained on a large corpus Biased data, correlation versus causation, and black-box model
TCR-BERT 2021 BERT Recognition prediction between TCR and epitope Leverage unlabeled data to learn a more general representation of TCRs, before being applied to or fine-tuned on specific downstream tasks Does not leverage VDJ gene usage information in its design
Transcriptomics data analysis BERT-TFBS 2024 BERT Transcription factor binding site prediction and promoter sequence recognition Enhanced prediction ability of the CNN module and the CBAM and the generalization capability of BERT-TFBS in predicting TFBSs Do not combine DNA sequence information with DNA structural characteristics
scGPT 2024 GPT Cell type annotation, multi-batch integration, multi-omics integration, perturbation response prediction, and gene regulatory network prediction The learned gene networks in scGPT exhibit strong alignment with known functional groups The current pre-training does not inherently mitigate batch effects
GP-GPT 2024 GPT Genetic phenotypic representation and genomic relationship analysis Fine-tuning large language models on genomics-specific training data leads to improved hidden representations of genetic medical entities Dataset expansion, tokenization of genomics entities and model performance, and evaluation
DeepGene Transformer 2024 Biomarkers that identify multiple cancer subtypes classify tumors and their subtypes Takes account of the mRNA expression of all genes for classification
Lomicsv1.0 2024 LLama-3 Biological transcriptome analysis and generation of pathway-related gene sets Allows researchers to generate pathways and gene sets using large language models for transcriptomic analysis
scELMo 2023 GenePT Cell sorting, batch effect correction, and cell type annotation scELMo can be used for detecting novel therapeutic targets by observing the change of embeddings corresponding to the removal of certain genes Can not generate meaningful information for genes that were recently discovered or analyzed
MuLan-Methyl 2022 BERT, DistilBERT, ALBERT, XLNet, ELECTRA Prediction of DNA methylation sites Contains multimodal data
iEnhancer-BERT 2022 BERT Improved ability to identify enhancers and their functional strength iEnhancer-BERT consistently outperforms several other existing methods on the benchmark datasets Incapable of accurately predicting cell-specific enhancers of variable length
scBERT 2022 BERT Cell type annotation, discovery of new cell types scBERT surpasses the existing advanced methods on diverse benchmarks Gene expression embedding, modeling gene interactions, and the masking strategy during the pre-training stage
iDNA-ABT 2021 BERT DNA methylation prediction in different species Have strong adaptability and robustness to different species Not pre-trained on a larger dataset
GeneBERT 2021 RoBERTa Histone modifications predict different gene expressions There is potential to use BERT models toward predicting gene expression, and the embeddings generated by RoBERTa were effective even though they were not trained on a large corpus Biased data, correlation versus causation, and black-box model
Proteomics data analysis pLM4Alg 2024 ESM2 Predictive tasks that involve residue deletion sequences and sequences containing non-standard amino acid residues The primary distinction between pLM4Alg and previous approaches lies in the utilization of pLM The length restriction of input protein sequences
POOE 2024 ProtTrans Prediction of oomycete effectors The embedding generated by the ProtTrans language model could capture rich semantic information regarding protein sequence–structure–function relationships and improve downstream prediction tasks
mtx-COBRA 2024 ESM Recognition of subcellular localization (SCL) of bacterial proteins The identification of protein SCL expands our understanding of protein function and potential interactions of bacterial pathogens with host cells All the ML models consistently underperformed for 1 gram-negative organism and 2 gram-positive organisms
TooT-PLM-ionCT 2024 ProtBERT, ProtBERT-BFD, ESM-1b, ESM-2 (650M parameter) and ESM-2 (15B parameter) Identification of ion channels and membrane proteins Enhanced generalization with new dataset and dataset balance and classifier performance Encountered certain limitations primarily rooted in computational resource constraints
UESolDS 2024 ProteinBERT, ESM2, and ProtTrans Prediction of protein solubility Provides accurate predictions of protein solubility based solely on protein sequences as input Evaluation of the experimental test set shows that there is still room to improve the performance of PLM_sol
n-gram 2016 n-gram Structural and functional prediction of missing proteins The human gene-coding proteins currently undetected are identified by using biological language models
Virtual screening and drug retargeting MRCF 2024 ChatGPT Drug retargeting analysis Outperforms the conventional ChatGPT workflow and manual annotations in terms of consistency, accuracy, speed, and cost Did not encompass all datasets within the GEO database
K-BERT 2022 BERT Extract chemical information from SMILES and understand molecular representations K-BERT can extract molecular features such as computational chemists and generate the general fingerprints K-BERT-FP
MolGPT 2022 GPT Virtual screening of drugs The generative process can be interpreted using saliency maps The constraining conditions of scaffold-based drug design
ChemBERTa 2020 RoBERTa Learn molecular representation as well as molecular property prediction MLM pre-training provides a boost in predictive power for models on selected downstream tasks from MoleculeNet With the possible exception of Tox21, ChemBERTa still performs below state-of-the-art
SMILES-BERT 2019 BERT Understand and predict molecular properties from chemical structure representations Includes more diversity into the input data to prevent over-fitting
Drug–target interaction prediction PharmBERT 2023 BERT Understand a wide range of bioinformatics information, including drug–target action PharmBERT can better handle domain-specific drug-related information than the vanilla BERT, ClinicalBERT, and BioBERT in the three tasks we tested The potential bias in the data and transparency
DTI-BERT 2022 BERT Identification of drug–target interactions based on target protein sequence information Without any help of prior knowledge and handcrafted feature engineering
TransDTI 2022 ESM, ProtBert Predict drug–target interactions and classify drug–target interactions into activity Using representations that try to capture the underlying order in sequential data Do not analyze the details of what the model is learning
Prediction of drug side effects PISTON 2018 Prediction of drug side effects Considered not only the frequency between drugs and genes in the literature but also their relationships
Named entity recognition Galactica 2024 Identify and extract relevant texts in networks involving protein interactions, pathways, and gene regulation The model can show better performances when contextual text is provided Used unconnected pairs in the datasets
BioBERT 2020 BERT An understanding of complex biomedical texts Requiring minimal task-specific architectural modification
ClinicalBERT 2019 BERT Improved accuracy and efficiency in the identification of biomedical entities such as gene expression regulatory networks and molecular interactions Our clinical embeddings are superior to general domain or BioBERT-specific embeddings for non-de-ID tasks
Relationship extraction GENEVIC 2024 ChatGPT 3.5 Automatically analyze, retrieve, and visualize customized domain-specific knowledge The capability of cutting-edge generative AI to unify and streamline access to, navigation of, and automate analysis of biomedical databases and external web APIs Capabilities using a limited PGS rank database with data for only three phenotypes and a basic approach to ranking variants via PGS effect weights
MarkerGenie 2022 SciBERT Multidimensional correlation prediction is supported Introduced and tested with benchmark datasets and real-world case studies Cross-sentence relations extraction, improving negative samples selection, and handling ambiguities of short entity terms
LBERT 2021 BERT Classification of drug–drug interactions, protein–protein interactions, and protein–biological entity relationships A large variant of BioALBERT trained on PubMed outperforms previous state-of-the-art models on 5 out of 6 benchmark BioNLP tasks Pre-training of domain-specific LMs requires a large volume of domain-specific corpora and expensive computational resources
GenCLiP 3 2019 GenCLiP 2.0 Accurately identify keywords from the database Function enrichment with literature keywords, gene network analysis, and gene network search The susceptibility and specificity for recognition of molecular interaction needs to be improved
BioBERT 2019 BERT Biomedical relationship extraction Requiring minimal task-specific architectural modification

PTMs, post-translational modification sites; GPCRs, G protein-coupled receptors; RBP, RNA–RNA binding protein; HLA, human leukocyte antigen; SCL, subcellular localization; DTI, drug–target interaction

Figure 1.

Fig. 1: A diagram categorizing contemporary applications of large language models in bioinformatics into five key areas: DNA/RNA sequence Analysis, functional & structure predictions, protein sequence analysis, functional & structure prediction, multi-omics data analysis, computational drug discovery & design, and biomedical literature mining, each exemplified by various tools and models.

Contemporary applications and advances of LLMs in bioinformatics. This figure systematically categorizes computational tools and models in bioinformatics and biomedicine into five main domains: 1. DNA/RNA sequence analysis, functional and structure prediction: includes tools for sequence analysis (e.g. HvenaDNA, DNAGPT), functional prediction (e.g. BERT-enhancer, DNABERT), and structure-focused methods (e.g. RNABERT, GeoBoost2). 2. Protein sequence analysis, functional and structure prediction: covers protein sequence modeling (e.g. ESM, ProtGPT), post-translational modification prediction (e.g. EpiBERTope, TransPPMP), and structural analysis (e.g. ProteinBERT, MSA transformer). 3. Multi-omics data analysis: features tools for genomics (e.g. scGPT, iDNA-ABT), epigenomics (e.g. scELMo, Mul_an-methyl), and integrative omics approaches (e.g. DeepGene transformer, POOE). 4. Computational drug discovery and design: includes models for molecular design (e.g. MolGPT, ChemBERTa), drug–target interaction (e.g. DT-I-BERT, TransDTI), and pharmaceutical applications (e.g. PharmBERT). 5. Biomedical literature mining: lists NLP models for biomedical text analysis (e.g. BioBERT, ClinicalBERT, Galactica). This figure was created based on the tools provided by Biorender.com (accessed on 15 May 2025).

Applications of LLMs in biological sequence analysis and function prediction

Protein sequence analysis and functional annotation

Protein sequence analysis and functional prediction constitute a fundamental cornerstone of bioinformatics research, playing pivotal roles in elucidating biological mechanisms, deciphering disease pathogenesis, and facilitating drug development. Recent studies have demonstrated that LLMs, leveraging advanced sequence information embedding techniques, substantially enhance protein sequence retrieval sensitivity while maintaining computational efficiency [17]. Deep learning–based LLM approaches for amino acid sequence embedding, characterized by robust feature extraction capabilities, have emerged as the predominant methodology for protein sequence annotation, demonstrating remarkable advantages in the clustering analysis and functional prediction of novel proteins. In 2019, Embeddings from Language Models (ELMo), a pioneering bidirectional language model, was successfully implemented in protein sequence modeling [18].

The BERT model has been extensively utilized in amino acid sequence analysis and protein function prediction due to its remarkable sequence processing capabilities [19]. As a bidirectional encoding representation model built upon the Transformer architecture, BERT exhibits exceptional performance across diverse sequence analysis tasks through its bidirectional context modeling and self-supervised pre-training approaches [20]. Leveraging the BERT architecture, researchers developed BERT4Bitter, which accurately predicts peptide biological properties exclusively from amino acid sequence information, thereby substantially streamlining the prediction process [21]. The BERT-Kcr model, introduced by Qiao et al., achieved high-precision identification of protein lysine crotonylation (Kcr) sites [22]. ProteinBERT, developed by the Brandes team, demonstrated unprecedented advances across multiple protein property prediction benchmarks [23]. EpiBERTope, developed by Park et al. based on the BERT framework, enables both the prediction of multiple epitopes in proteins and the effective extraction of protein sequence interaction information [24]. ProGen, a next-generation large language model, generates protein sequences with predictable functions within extensive protein families, substantially enhancing the controllability and accuracy of homologous family protein sequence and function prediction [25]. Building upon these advances, the ProGen2 model introduced by the Nijkamp team has overcome significant technical limitations. The model accurately captures evolutionary sequence distributions, generates novel active sequences, and predicts protein adaptability without additional fine-tuning, establishing state-of-the-art performance benchmarks [26]. Building upon the GPT architecture, researchers developed the ProtGPT2 model, which was trained on protein sequence space to efficiently generate de novo protein sequences adhering to natural sequence principles [27]. Furthermore, the interpretable protein language model PTMGPT2 achieved accurate prediction of post-translational modification (PTM) sites through prompt-based fine-tuning strategies [28]. The EMCBOW-GPCR model, leveraging NLP techniques, significantly enhanced the recognition accuracy of G protein-coupled receptors (GPCRs) [29]. ESM2 (Evolutionary Scale Modeling 2) exhibited superior performance in evaluating protein sequence conservation, identifying conserved segments within rapidly evolving sequences, and detecting potential functional sites in extended protein sequences [30]. The Multiple Sequence Alignment (MSA)–based MSA Transformer achieved state-of-the-art performance in predicting new sequences of functionally related protein families, homology detection, coevolution analysis, and structure–function relationship prediction [31]. These models have demonstrated significant breakthroughs in sequence information interpretation, protein sequence alignment, homologous protein recognition, and functional protein family prediction, not only substantially improving prediction accuracy but also achieving remarkable computational efficiency enhancements [27]. The rapid advancement of LLMs has provided researchers with robust analytical tools for understanding protein sequence–function relationships, thereby driving technological innovation and theoretical breakthroughs in bioinformatics. These advances have established new research directions and methodological approaches for future protein function studies and drug development, presenting promising opportunities for advancing the field.

DNA/RNA sequence analysis and functional prediction

In bioinformatics research, DNA and RNA sequence analysis and functional prediction not only constitute fundamental research areas but also play pivotal roles in elucidating molecular mechanisms, decoding biological processes, and advancing the development of therapeutic strategies and diagnostic approaches. LLMs have demonstrated remarkable capabilities in genomics research, particularly in the identification and analysis of gene sequences, genetic variants, and sequence-specific features [32]. Recent advances in NLP techniques have enabled researchers to systematically decode biological information embedded within genomic sequences, yielding substantial progress [33]. BERT, a revolutionary language model based on the Transformer architecture, has demonstrated exceptional performance and exhibits extensive potential for diverse applications in this field.

DNABERT, pioneered by Ji et al., represents a state-of-the-art pre-trained bidirectional encoder architecture based on the BERT framework, specifically engineered to capture contextual patterns in nucleotide sequences and facilitate genomic sequence prediction. This model exhibits remarkable performance in promoter recognition, splice site prediction, and transcription factor binding site analysis [34]. DNABERT-2 demonstrated substantial enhancements in both performance metrics and computational efficiency compared to its predecessor through the implementation of refined pre-training strategies [35]. DNABERT-S, introduced by Zhou et al., marked a significant advancement by enabling accurate multi-species identification from complex, unlabeled genomic sequence mixtures [36].

In practical applications, Zhao et al. demonstrated that word embedding techniques derived from the DNABERT model could effectively extract semantic and feature information from 16S rRNA gene sequences, thereby substantially enhancing the accuracy and efficiency of microbial species identification [37]. Le et al. developed BERT-Enhancer, which demonstrated remarkable performance in both DNA enhancer recognition and sequence representation analysis [20]. Liu’s team developed the RoBERTa model, which substantially enhanced performance through optimized BERT pre-training protocols. Additionally, its derivative, MetaBERTa, specifically engineered for metagenomic data analysis, successfully addressed challenges related to data complexity and heterogeneity [38].

As a pioneering pre-trained foundation model for genomic context learning, HyenaDNA has consistently demonstrated superior performance across comprehensive multi-dataset evaluations [39]. CombSAFE has effectively implemented genome-wide analysis of ChIP-seq data through the integration of advanced NLP techniques, thereby facilitating robust clustering and enrichment analysis of functionally similar genomic regions [40]. The GPT-based DNAGPT model represents a novel approach that processes nucleotide sequences as natural language, demonstrating significant advancements in multiple genomic tasks—including gene annotation, variant detection, and sequence prediction—through comprehensive pre-training on large-scale DNA datasets [41].

In the field of RNA sequence analysis, RNABERT, a BERT-based pre-training algorithm, has demonstrated remarkable efficacy in multiple aspects, including structural alignment of RNA sequences, RNA interaction prediction, and RNA secondary structure prediction [42]. GeoBoost2, through the integration of NLP techniques, has enabled the efficient extraction of infected host location information and facilitated the expansion of nucleotide sequence databases, thereby providing essential support for viral geography and genomic epidemiology research [43, 44].

Through comprehensive analysis of DNA and RNA sequences and their functional characteristics, researchers have not only elucidated the molecular mechanisms of gene expression regulation but also established new research directions for the systematic understanding of complex biological processes and the identification and targeted therapy of disease-related genes [45].

Structural biology and computational analysis

Computational approaches for protein structure prediction

Protein structure prediction, a fundamental research domain in bioinformatics, focuses on the computational determination of three-dimensional protein conformations from primary amino acid sequences. The intricate relationship between protein structure and function underscores the critical importance of accurate structure prediction in elucidating biological mechanisms, facilitating drug design, and understanding disease pathogenesis. In recent years, the innovative application of LLMs in protein structure prediction has substantially improved prediction accuracy and computational efficiency, yielding revolutionary breakthroughs in structural biology research. Recent advances in algorithmic approaches, including ML and TopoFormer [46], coupled with breakthroughs in high-throughput structure alignment methodologies [47], have catalyzed unprecedented progress in analyzing protein structural dynamics and intermolecular interactions.

The evolutionary scale modeling (ESM) represents a protein language model based on the Transformer deep learning architecture, which is specifically designed for protein structure prediction and functional annotation. ESM has markedly advanced protein understanding by leveraging large-scale evolutionary information, notably achieving breakthroughs in protein sequence-to-structure mapping. ESM-1b, a state-of-the-art protein language model based on the Transformer architecture, demonstrates multiple capabilities, including accurately predicting protein structure and functional characteristics, evaluating amino acid mutation effects, and predicting secondary structure features [48]. Building upon ESM-1b, researchers have subsequently developed improved models including ESM-1v [49] and ESM-IF1 [50]. ESM-1v employs deep attention mechanisms to identify key residue positions, evaluate the impact of mutations on protein function, and investigate the regulatory mechanisms of single amino acid changes on overall protein stability and activity [49]. ESM-IF1 focuses on the inverse folding problem, aiming to generate amino acid sequences that match specific target structures, thereby providing new approaches for protein design engineering [50]. ESMFold, developed by Lin et al., is a protein structure prediction tool based on deep language models that can directly predict atomic-level protein structures from amino acid sequences, achieving notable advances in both prediction accuracy and computational efficiency. Researchers have utilized ESMFold to predict structures for over 617 million metagenomic protein sequences, establishing a systematic metagenomic structural atlas that provides a crucial foundation for understanding protein diversity [51]. ESM-AA (ESM All-Atom) implements a unified modeling approach at both atomic and protein residue scales, demonstrating exceptional performance across diverse protein molecular-related tasks [52]. BepiPred-3.0, an ESM-2–based deep learning architecture, has demonstrated substantial advances in B-cell epitope prediction accuracy [53]. TransPPMP leverages ESM pre-trained models for sequence encoding, thereby facilitating comprehensive protein pathogenicity analysis, particularly for nonsense and frameshift mutations [54]. Beyond ESM, protein structure prediction models incorporating MSA [55], notably rawMSA [56], exhibit substantial advantages in downstream applications, including secondary structure prediction, relative solvent accessibility analysis, and residue contact map construction. The transformative advances in LLMs for protein structure prediction are predominantly characterized by their capacity to efficiently decode intricate correlation patterns between amino acid sequences and three-dimensional conformations. Through ongoing refinement of word embedding methodologies, the performance of LLMs in supervised protein analysis continues to improve, providing increasingly robust computational support for advanced structural biology investigations.

Advances in RNA structure prediction

The prediction of RNA secondary and tertiary structures plays a crucial role in elucidating RNA functional mechanisms, decoding intermolecular interaction networks, and understanding post-transcriptional regulatory processes, thus providing critical theoretical foundations for both basic RNA research and clinical translational applications. The BERT-based RNABERT model has demonstrated remarkable proficiency in processing complex nucleotide sequences, yielding significant advances in RNA secondary structure analysis and structural alignment [42]. While LLMs have demonstrated substantial breakthroughs in protein structure prediction, their implementation in RNA structure prediction remains largely in an exploratory phase. As LLMs continue to evolve in the domain of RNA structure prediction, they show promise for offering novel research paradigms and methodological approaches for investigating RNA regulatory mechanisms in human physiological and pathological processes.

Molecular docking and biomolecular interaction prediction

Molecular docking and interaction prediction serve as fundamental pillars in drug design and biomolecular behavior research. LLMs, leveraging their robust deep learning capabilities, effectively analyze complex molecular interaction networks, thereby substantially enhancing prediction accuracy and computational efficiency. Protein–protein interaction (PPI) is crucial for deciphering cellular signaling networks and biological functions, playing a significant role in elucidating biological processes and their involvement in disease development. NLP methodologies have been systematically employed to construct protein interaction network maps since their initial implementation in 2005 [57]. With the rapid development of deep learning technology, researchers have successively developed various LLM-based PPI prediction tools, which have significantly advanced the systematic study of complex molecular interaction networks in living organisms. PRECOGx integrates ESM 1b protein embeddings with established data from public repositories to investigate the interactions between GPCRs, G proteins, and β-arrestin, demonstrating superior performance in predicting diverse GPCR interaction patterns [58]. AraPathogen2.0 is a plant pathogen PPI predictor that incorporates sequence encodings from ESM2 and node representations from Arabidopsis PPI networks. Benchmark tests demonstrated that AraPathogen2.0 exhibited exceptional performance when processing test datasets with novel, unseen proteins [59]. In protein binding site prediction, PepBCL, a novel BERT-based language model, can predict protein–peptide binding residues based on amino acid sequences, capturing both conserved and non-conserved sequential features of peptide-binding residues while demonstrating the flexibility and adaptability of protein language models.

Furthermore, protein–nucleic acid interaction prediction plays a fundamental role in structural biology research, as these interactions are essential for crucial biological processes, including gene expression regulation and DNA repair, and are indispensable for maintaining cellular functions and organismal homeostasis. GeneBERT, developed by Ruan et al., has demonstrated significant efficacy in predicting differential gene expression through the analysis of histone modification data. GeneBERT enables the investigation of complex interactions between histone modifications and genomic expression, thereby facilitating advances in understanding gene expression regulatory networks and epigenetic modifications [60]. In RNA-protein interaction research, BERT-RBP, a pre-trained model based on the BERT architecture, demonstrates the capability to identify transcription regions from RNA sequences and predict RNA secondary structures, ultimately enabling accurate predictions of RNA-binding protein (RBP) interactions in human genomic data. These findings provide substantial evidence supporting the application of BERT’s fine-tuning mechanism to diverse RNA-related challenges [61]. In the field of immunology research, LLMs have made significant breakthroughs in predicting the interactions between T-cell receptors (TCRs) and their corresponding binding epitopes. TCR-BERT, a BERT-based large language model, has demonstrated exceptional performance in predicting recognition between TCRs and antigenic epitopes, thereby significantly improving the accuracy of antigen–antibody binding analysis [62]. The unsupervised cascaded ProtBert + BiLSTM model integrates features extracted from the pre-trained ProtBert model with a bidirectional long short-term memory (BiLSTM) architecture for predicting human leukocyte antigen (HLA) binding peptides, demonstrating potential as a pre-training framework for other protein sequence prediction tasks [63]. In conclusion, LLMs have exhibited remarkable research achievements in molecular docking and interaction prediction, specifically demonstrating significant performance improvements in PPI prediction. With the ongoing expansion of data scales and continuous advancement in algorithmic optimization, this technology demonstrates substantial promise for precise modeling and practical applications in complex biological systems.

Multi-omics data analysis

Transcriptomic data analysis and integration

Transcriptomic analysis represents a comprehensive systematic approach that provides essential theoretical frameworks for researchers to elucidate cellular regulatory mechanisms, disease pathogenesis, and organismal responses to environmental stimuli through the detailed examination of dynamic gene expression patterns. LLMs, with their advanced feature extraction and pattern recognition capabilities, demonstrate significant advantages in analyzing gene expression profiles, performing functional annotations, and interpreting biological significance.

In epigenetic modification prediction, the BERT-based iDNA-ABT deep learning model effectively learns sequence features from multi-species data, demonstrating robust generalization capabilities in cross-species DNA methylation site prediction [64]. The GeneBERT model, which integrates BERT, RoBERTa, and XGBoost architectures, has demonstrated significant accuracy in predicting histone modification levels and gene expression regulation, thus enhancing our understanding of epigenetic modifications [60]. MuLan-Methyl represents a significant advancement in DNA methylation site prediction, exhibiting both superior performance on standard benchmark datasets and exceptional capability to identify distinct methylation patterns across diverse species [65].

Regarding regulatory element identification, LLM-based prediction tools have demonstrated significant technological advances. BERT-TFBS, which utilizes the pre-trained DNABERT-2 model framework, has substantially enhanced both the prediction accuracy of transcription factor binding sites and the identification efficiency of promoter sequences [66]. iEnhancer-BERT has achieved substantial advances in enhancer identification and functional strength prediction, offering robust computational tools for the comprehensive analysis of gene expression regulatory networks, thereby advancing the application of LLMs in transcriptomics research [67].

In the field of cell type annotation, LLMs have substantially enhanced the accuracy and computational efficiency of cell classification through the systematic extraction and analysis of cellular gene expression features. The GscBERT model, developed based on the BERT architecture, exhibits exceptional performance in cell type annotation and novel cell identification tasks through pre-training on large-scale unlabeled single-cell RNA sequencing (scRNA-seq) data, while simultaneously enhancing model stability and interpretability [68]. As a foundation model for single-cell biology, scGPT enables the systematic extraction of key biological features from genes and cells, and effectively accomplishes multiple downstream tasks, including cell type annotation, batch integration, and multi-omics integration through transfer learning [69]. Building upon these advances, the scELMo model developed by Liu et al. implements zero-shot learning, facilitating cell classification and batch effect correction without additional training, while achieving superior performance with reduced computational resource requirements [70].

LLMs have demonstrated remarkable capabilities in processing complex transcriptomic data, particularly in the interpretation of mRNA expression profiles and gene regulatory networks, thereby offering researchers robust analytical support for transcriptome information [71]. GP-GPT, pioneered by Lyu et al., represents the first large language model specifically engineered for genetic phenotype representation and genomic association analysis, constituting a significant advancement in medical genetics research and gene-phenotype association studies [72]. DeepGene Transformer, which implements an end-to-end deep learning approach, achieves precise tumor classification through biomarker identification across diverse cancer subtypes, thus effectively addressing the challenges inherent in high-dimensional gene expression data analysis [73]. Lomicsv1.0, a Python-based bioinformatics toolkit, harnesses the advantages of LLM technology to facilitate efficient and accurate transcriptome analysis and pathway gene set construction [74]. Moreover, LLMs, exemplified by ChatGPT, exhibit substantial application value and development potential in the interpretation of biological pathways and the construction of gene association networks [75].

In conclusion, LLMs have made unprecedented progress in transcriptomics analysis by leveraging deep learning and NLP technologies, thus enabling multidimensional, comprehensive interpretation of gene expression data, which has significantly enhanced our systematic understanding of gene regulatory networks and biological functions.

Advanced proteomics data analysis and applications

Proteomics research, a fundamental discipline in molecular biology, systematically investigates the structural characteristics of functional proteins, elucidates their biological mechanisms, and reveals the pathogenic pathways underlying various diseases. LLMs, powered by sophisticated deep learning algorithms, have revolutionized proteomics research by substantially improving the efficiency of complex protein sequence analysis, enhancing functional prediction accuracy, and providing robust computational frameworks for bioinformatics analysis and drug discovery.

In the field of protein recognition and classification, researchers initially adopted the n-gram model from NLP to predict the structural and functional properties of uncharacterized proteins [76]. Building upon ESM2, the innovative pLM4Alg model pioneered the prediction of sequences containing residue gaps and non-standard amino acid residues, demonstrating remarkable accuracy in allergenic protein/peptide prediction and achieving superior performance across multiple benchmark evaluations [77]. Within microbial and animal proteomics research, investigators have established numerous innovative pre-trained protein language models (PLMs), among which ProtTrans stands as a prominent example [78]. The POOE model, incorporating sequence embedding techniques derived from ProtTrans, exhibits exceptional accuracy in oomycete effector prediction, offering novel insights into the functional mechanisms of effectors in plant–pathogen interactions [78].

In the field of specific protein function prediction, investigators have developed several specialized computational models. The mtx-COBRA model, which is based on the ESM architecture, has significantly enhanced the prediction accuracy of bacterial protein subcellular localization (SCL) through comprehensive protein sequence analysis and can effectively classify bacterial proteins with unknown SCL [79]. Additionally, the TooT-PLM-ionCT framework systematically integrates the advantages of six distinct protein language models, including ProtBERT, ProtBERT-BFD, ESM-1b, ESM-2 (650 M parameters), and ESM-2 (15B parameters), achieving substantial advances in ion channel and membrane protein recognition while demonstrating excellent robustness and generalization capabilities in bioinformatics-related tasks [80]. Furthermore, the UESolDS system incorporates three PLMs—specifically proteinBERT, ESM2, and ProtTrans—to establish a multi-model ensemble system for protein solubility prediction, achieving significant predictive performance across multiple independent test sets [81].

In conclusion, LLMs have demonstrated remarkable breakthroughs in proteomics data analysis, fundamentally enhancing the efficiency and accuracy of bioinformatics research through the optimization of key processes, including protein classification, functional annotation, and homology detection.

Computational approaches in drug discovery and design

Advanced virtual screening and drug repositioning strategies

NLP technologies have emerged as powerful tools in contemporary drug development and optimization, substantially enhancing both the efficiency and success rates of drug discovery while simultaneously establishing themselves as a pivotal research direction in modern pharmaceutical development. LLMs leverage their robust computational capabilities to comprehensively analyze drug–target interaction (DTI) networks, thereby facilitating efficient predictive methodologies for novel drug discovery and optimization processes. Within molecular generation research, K-BERT, a BERT-based language model architecture, demonstrates significant capability in extracting chemical information from SMILES notation and interpreting molecular representations, thus exhibiting significant potential for molecular property prediction and practical pharmaceutical applications [82]. SMILES-BERT, an advanced BERT-based language model, effectively interprets and accurately predicts molecular properties through its comprehensive analysis of chemical structure representations [83]. The ChemBERTa model utilizes large-scale self-supervised pre-training on carefully curated SMILES datasets, thereby establishing an innovative framework for molecular representation learning and accurate property prediction [84].

In the field of virtual drug screening, MolGPT, a GPT-based large language model, utilizes a transformer decoder architecture trained through next-token prediction tasks with self-attention mechanisms, thereby demonstrating its capability to selectively optimize multiple properties of generated molecular compounds [85]. Within the domain of medical data analysis, researchers have developed the Multi-Role ChatGPT Framework (MRCF), which has successfully implemented both a user-friendly database system and novel analytical tools to facilitate efficient drug repositioning studies [86]. LLMs have exhibited substantial promise in virtual screening and drug repositioning applications, and despite the nascent stage of related research, their implementation in drug development is expected to progressively evolve and broaden in parallel with advances in deep learning technologies.

DTI prediction and analysis

DTI prediction represents an advanced computational methodology that integrates bioinformatics approaches to systematically investigate and assess the binding characteristics and interaction mechanisms between drug molecules and their corresponding biological targets. Through sophisticated mining and comprehensive analysis of extensive genomic and proteomic datasets, LLMs can effectively identify and validate potential therapeutic targets, thus providing critical support for novel drug development [71]. DTI-BERT, a specialized BERT-based architecture, demonstrates superior capability in identifying DTIs through the analysis of target protein sequence information, achieving significant prediction accuracy across critical target families, including GPCRs, ion channels, enzymes, and nuclear receptors [87]. PharmBERT, developed by Valizadeh Aslani et al., represents an advanced large language model extensively pre-trained on pharmaceutical label data, capable of interpreting multidimensional bioinformatics knowledge, particularly DTIs, through sophisticated analysis of drug labels, thereby offering crucial evidence for clinical therapeutic decisions [88]. TransDTI, a sophisticated pre-trained language model based on the Transformer architecture, facilitates sequence-based DTI prediction and activity classification through extensive training on large-scale DTI datasets [89]. While the implementation of NLP technologies in DTI prediction remains in its developmental phase, LLMs have demonstrated remarkable potential in elucidating drug mechanisms and biological effects, showing promise to deliver unprecedented insights and revolutionary approaches for innovative drug development.

Prediction and assessment of adverse drug effects

Through systematic analysis of extensive medical literature and clinical data repositories, LLMs demonstrate capabilities not only in effectively identifying and predicting potential adverse drug reactions (ADRs) but also in providing a crucial theoretical framework for comprehensive drug safety assessments. Researchers have developed an innovative algorithmic framework for adverse drug effect prediction that integrates advanced ML models with NLP techniques. This computational framework facilitates the extraction of drug–gene association patterns from literature abstracts through sophisticated text mining algorithms, while simultaneously analyzing their interaction networks using state-of-the-art NLP technologies [90]. Subsequent investigations have successfully integrated large-scale observational healthcare databases with LLMs, establishing an automated high-throughput drug surveillance platform that enables precise identification of potential therapeutic interventions and ADRs [91]. While the application of LLMs in predicting ADRs remains in its nascent stage, their significant clinical utility and extensive potential applications suggest this research direction will continue to garner substantial academic interest, potentially yielding more robust theoretical frameworks and sophisticated decision support systems for clinical pharmacovigilance.

Biomedical literature mining

Named entity recognition

Named entity recognition (NER) is a fundamental task in NLP that aims to automatically identify and classify entities with specific semantic or referential meanings within textual data. Within the biomedical domain, these entities predominantly include specialized terminology, such as genes, proteins, diseases, and pharmaceutical compounds. During the initial phase of biomedical NER research, investigators developed the MicrO ontology system, which established systematic associations between microbial categories and biological entities through a network of logical axioms, providing essential ontological support for bioinformatics tools [92]. BioBERT, a BERT-based pre-trained language model, achieved significant advances in biomedical named entity recognition tasks through extensive training on comprehensive biomedical corpora, which substantially improved the interpretation of complex biomedical texts [93]. ClinicalBERT exhibited superior performance when optimized for specific annotation patterns in clinical texts, demonstrating enhanced accuracy and efficiency in identifying biomedical entities, particularly in gene expression regulatory networks and molecular interactions [94]. Galactica, a domain-specific large language model developed for the biomedical field, demonstrates superior capability in accurately identifying and extracting key information related to protein interactions, metabolic pathways, and gene regulatory networks from extensive biomedical literature. This addresses the limitations of existing databases in terms of biomedical data integrity and facilitates researchers’ comprehensive understanding of biological systems and disease mechanisms [95]. The development of LLM-based biomedical concept extraction methods has substantially enhanced the accuracy and reliability of entity recognition [96], establishing a novel technical paradigm in this field. In conclusion, the integration of LLMs into biomedical named entity recognition has not merely enhanced recognition accuracy and processing efficiency, but has also provided robust technical support for automated biomedical literature analysis and knowledge discovery, thereby accelerating the advancement of bioinformatics research.

Biological relation extraction

Relation extraction in biomedical literature encompasses the automated identification and extraction of semantic associations between biological entities from unstructured or semi-structured textual data, representing a fundamental task in biomedical text mining. LLMs have substantially enhanced the precision and robustness of relation extraction tasks through their advanced capabilities in deep semantic comprehension and contextual analysis of relationships. GenCLiP 3, serving as a sophisticated retrieval system, effectively identifies keywords from biological databases, thereby enhancing researchers’ understanding of functionally associated genes and their regulatory networks [97]. Across diverse biological tasks, including drug–drug interactions, PPIs, and protein–bioentity relationship classification, the biological relation extraction (BRE) model LBERT consistently outperforms conventional deep learning approaches [98]. MarkerGenie, an advanced deep learning–based NLP model designed for multidimensional association prediction, demonstrates superior performance in biomedical entity relationship recognition [99]. GENEVIC, as a comprehensive computational tool in biomedical research, not only facilitates automated analysis, retrieval, and visualization of domain-specific knowledge but also incorporates advanced functions for generating protein interaction networks, performing gene set enrichment analysis, and conducting systematic literature searches, thereby effectively bridging the gap between high-throughput gene data generation and biomedical knowledge discovery [100]. BioBERT (BERT for Biomedical Text Mining) outperforms previous state-of-the-art models across multiple biomedical text mining tasks, particularly in biomedical relationship extraction and entity recognition [93]. In conclusion, LLMs, by leveraging their sophisticated contextual understanding and advanced semantic analysis capabilities, offer significant advantages in biomedical literature parsing and complex entity relationship extraction, thereby providing robust and systematic support for the advancement of bioinformatics research.

Advantages and implications of LLM applications in bioinformatics

LLMs, representing a paradigm-shifting breakthrough in artificial intelligence, have demonstrated exceptional potential in advancing bioinformatics research. Recent studies have demonstrated that LLMs, leveraging self-supervised learning and end-to-end learning paradigms, effectively process high-dimensional biological data and precisely identify contextual relationships and complex semantic patterns within sequences, while simultaneously facilitating cross-modal learning and knowledge transfer, thereby substantially enhancing bioinformatics data processing efficiency (Fig. 2).

Figure 2.

Fig. 2: A visual representation of the five core advantages of large language models in bioinformatics research, highlighting their ability to process extended sequences and high-dimensional data, capture complex semantic and contextual information, facilitate cross-modal learning and knowledge transfer, reduce manual feature engineering, and leverage massive unlabeled data through self-supervised learning.

Key advantages of large language models in bioinformatics research. The implementation of large language models (LLMs) in bioinformatics demonstrates several distinct advantages, primarily in their capability to process extended sequences and high-dimensional data, capture complex semantic and contextual information, perform cross-modal learning and knowledge transfer, reduce manual feature engineering through end-to-end learning, and leverage massive unlabeled data through self-supervised learning. The processing of extended sequences and high-dimensional data is facilitated through advanced sequence tokenization techniques, integrated dimensionality reduction technologies, autoencoder architectures, and multi-head attention mechanisms. Through self-supervised learning approaches and transformer-based architectures, particularly bidirectional encoder representations from transformers (BERT), LLMs demonstrate superior capability in capturing intricate semantic relationships and contextual information. Moreover, LLMs exhibit remarkable efficacy in integrating and processing multimodal data, including textual, visual, and audio inputs, while achieving efficient cross-corpus transfer. The self-supervised learning paradigm, leveraging vast quantities of unlabeled data, utilizes sophisticated multilayer neural network architectures to automatically process complex biological data. Additionally, end-to-end learning approaches significantly reduce the necessity for manual feature engineering, effectively addressing the limitations of traditional supervised learning’s dependence on manually annotated data, particularly in applications such as protein sequence prediction and nucleotide sequence analysis. This figure was created based on the tools provided by Biorender.com (accessed on 15 May 2025). LLMs, large language models; BERT, bidirectional encoder representations from transformers.

Enhanced capabilities for processing long-sequence and high-dimensional biological data

LLMs possess distinctive advantages in processing biological sequence data, leveraging their superior contextual modeling capabilities and robust feature extraction mechanisms to establish novel paradigms for bioinformatics analysis. By utilizing advanced tokenization techniques, LLMs effectively decompose extensive biological sequences into processable fundamental units, thereby substantially enhancing both computational efficiency and model performance [101]. LLMs effectively integrate dimensionality reduction techniques and autoencoder architectures to extract refined low-dimensional data representations, significantly improving both analytical efficiency and accuracy. To address the pervasive noise and redundancy challenges in high-dimensional biological data, LLMs implement selective information extraction through multi-head attention mechanisms [102], facilitating the capture of long-range dependencies in biological sequences and enabling precise interpretation of complex sequence features.

Advanced semantic and contextual information processing in biological data

LLMs, through their sophisticated deep neural network architectures and advanced self-attention mechanisms, demonstrate exceptional capabilities in systematically analyzing complex semantic relationships and contextual information inherent in biological sequences and textual data. BERT, underpinned by the Transformer architecture, effectively captures complex semantic information within biological sequences and textual data through advanced contextualized word embedding methodologies [20]. These comprehensive capabilities have facilitated unprecedented advances in critical bioinformatics applications, specifically in nucleotide sequence prediction [103] and protein sequence prediction [54]. In diverse domains encompassing biological text mining, process interpretation, and disease mechanism inference, LLMs exhibit remarkable analytical prowess, providing researchers with sophisticated computational tools for elucidating novel biological regulatory mechanisms and identifying potential therapeutic targets.

Cross-modal learning and knowledge transfer in bioinformatics

LLMs exhibit distinctive capabilities in cross-modal learning and knowledge transfer, particularly excelling at the efficient integration and comprehensive understanding of multimodal data. From an implementation standpoint, while single-modal data representation faces inherent limitations, LLMs demonstrate superior capability for integrating and processing multimodal data, including text, images, and audio, encompassing both structured and unstructured information [104]. In the context of knowledge transfer, pre-trained models facilitate efficient transfer learning across diverse corpora, simultaneously preserving the beneficial characteristics of the source corpus while substantially reducing data requirements and computational time, thereby optimizing model generalization and processing efficiency. The synergistic integration of cross-modal learning and knowledge transfer methodologies has substantially enhanced model interpretability and facilitated significant advances in drug development, disease diagnosis, and personalized medicine, thereby establishing a robust theoretical framework and technical foundation for the comprehensive application of LLMs in bioinformatics.

End-to-end learning reduces manual feature engineering

End-to-end learning represents an advanced ML paradigm that leverages a single deep learning model to achieve direct mapping from raw input to target output, thereby eliminating the need for complex feature engineering and intermediate processing steps inherent in traditional approaches. This automated feature learning mechanism significantly enhances research efficiency and minimizes human resource requirements while, crucially, reducing potential subjective biases inherent in manual feature selection, thus strengthening model objectivity and reliability. In conclusion, the end-to-end learning framework establishes a robust technical foundation for LLM applications in bioinformatics, facilitating substantial advances in automated and intelligent biological data analysis through iterative model optimization and performance enhancement.

Advancing bioinformatics through self-supervised learning with large-scale unlabeled data

Self-supervised learning, as an emerging machine learning (ML) paradigm, inherently generates supervisory signals from the intrinsic structure of data, thereby facilitating efficient model optimization. Within the bioinformatics domain, self-supervised learning demonstrates remarkable capability in extracting embedded linguistic features and semantic patterns from unlabeled data, thus enabling comprehensive interpretation of biological sequences. Deep learning frameworks based on self-supervised learning have achieved breakthrough progress in areas such as protein sequence prediction and nucleotide sequence prediction [103]. These frameworks effectively overcome technical challenges in biological sequence analysis, including context dependency, low signal-to-noise ratio, and scarcity of annotated data [103], while simultaneously achieving leading performance across multiple evaluation metrics. In conclusion, the application of self-supervised learning techniques has enabled LLMs to overcome the constraints of annotated data, thereby demonstrating excellent transfer learning capabilities in diverse tasks, significantly improving performance in specific bioinformatics tasks, and highlighting their adaptability and computational efficiency in processing complex biological data.

Current challenges and limitations of LLMs in bioinformatics applications

Inherent specificity and complexity of biological data analysis

While LLMs demonstrate considerable promise in bioinformatics, they encounter numerous technical challenges and practical limitations in real-world applications. Bioinformatics data exhibit substantial heterogeneity, comprising multidimensional information such as genomic sequences, protein structures, and metabolic pathways, which presents significant challenges for model integration and processing capabilities. Empirical studies have revealed that LLMs exhibit significant limitations in their interpretational accuracy, particularly when processing quantitative analyses involving visual features and color perception [105]. Therefore, in practical bioinformatics applications, despite the significant advantages offered by LLMs, careful handling of the specificity and complexity of biological data remains essential to ensure the reliability and utility of analytical results.

Challenges in model interpretability and transparency

Model interpretability refers to the capacity to analyze and validate the reasoning processes and predictive outcomes of deep learning systems in a human-comprehensible and scientifically rigorous manner. LLMs inherently exhibit “black box” characteristics, which are primarily manifested through opaque decision-making processes and limited interpretability of feature representations. This opacity substantially limits their capability to generate reliable mechanistic explanations for biological phenomena and clinical outcomes. Consequently, enhancing the interpretability of LLMs in bioinformatics applications is imperative for validating scientific findings, facilitating clinical translation, and establishing robust, traceable decision-making frameworks.

Computational resource constraints and infrastructure challenges

The implementation of LLMs in bioinformatics faces substantial technical constraints due to their extensive computational resource requirements. Multiple factors, including pre-training dataset limitations, fine-tuning data quality concerns, and insufficient cross-modal datasets, collectively impede the advancement of LLMs in bioinformatics applications [106]. These models, which comprise hundreds of millions to billions of parameters, necessitate robust computing clusters and extensive storage infrastructure for both training and inference operations. Consequently, achieving optimal resource utilization while maintaining model performance remains a critical technical challenge for advancing the widespread adoption of LLMs in bioinformatics.

Quality control and bias mitigation in training data

In the implementation of LLMs within bioinformatics, the management of training data quality and mitigation of inherent biases represents fundamental technical challenges that demand immediate attention. Biological data exhibits substantial heterogeneity and originates from diverse sources, encompassing multiple laboratories and database systems across the global research community. These multi-source heterogeneous characteristics result in considerable variations in data quality, which manifest primarily as data noise, information gaps, systematic biases, and limited data portability [107]. Quality-related challenges predominantly manifest as inconsistency in global sequencing standards, incomplete sequence metadata, and systematic biases stemming from inadequate quality control protocols [108]. Empirical studies have demonstrated that inter-database biases substantially influence the accuracy of gene prediction algorithms [109]. Consequently, the establishment of standardized, high-quality, diverse databases and the development of robust bias correction methodologies are imperative for optimizing the reliability and efficacy of LLMs in bioinformatics applications.

Emerging directions and future perspectives

As discussed in Section “Current Challenges and Limitations of LLMs in Bioinformatics Applications”, LLMs have exhibited remarkable potential and extensive applications in bioinformatics; however, when applied to various bioinformatics tasks, substantial challenges persist that require systematic investigation. This section systematically delineates the future developmental trajectories of LLMs (Fig. 3), encompassing multimodal fusion learning, knowledge-driven architectural design, model optimization and efficient inference, explainable artificial intelligence frameworks, seamless integration with experimental biology, strengthening of ethical and privacy protection mechanisms, and advancement of interdisciplinary collaboration and open science initiatives.

Figure 3.

Fig. 3: A figure illustrating the seven key future development directions for large language models in bioinformatics, including multimodal fusion learning, knowledge-guided architectural design, model optimization for efficient deployment and inference, development of explainable AI systems, deep integration with experimental biology, enhancement of ethical and privacy protection, and promotion of interdisciplinary collaboration and open science.

Future development directions of large language models in bioinformatics. The future development and applications of large language models (LLMs) in bioinformatics encompass several key areas: multimodal fusion learning, knowledge-guided architectural design, model optimization for lightweight deployment, efficient inference, development of explainable artificial intelligence systems, deep integration with experimental biology, enhancement of ethical and privacy protection mechanisms, and promotion of interdisciplinary collaboration and open science. Specifically, multimodal fusion learning can be advanced through multidimensional deep analysis, systematic integration of multi-omics data, and enhancement of model generalization capabilities. Regarding knowledge-guided architectural design, integration of biomedical ontologies and knowledge graphs into model frameworks is essential, alongside the development of knowledge distillation techniques for constructing adaptive learning systems with automatic knowledge base updating capabilities. Model optimization and efficient inference can be achieved through specialized attention mechanisms, model distillation techniques, and implementation of federated learning strategies. In the development of explainable artificial intelligence (AI) systems, advanced visualization techniques and counterfactual explanation methods warrant investigation, coupled with the development of interactive interpretation systems to enhance model transparency. To facilitate deep integration between LLMs and experimental biology, intelligent experimental design systems should be developed, incorporating experimental feedback mechanisms and comprehensive evaluation frameworks that bridge computational predictions with experimental outcomes. For strengthening ethical and privacy protection mechanisms, advanced technologies including federated learning, bias mitigation, and differential privacy should be explored, while establishing robust ethical review systems and standardized regulatory frameworks at the institutional level. In the context of interdisciplinary collaboration and open science, development of cross-disciplinary research tools should be prioritized, along with the establishment of open-access biological databases, standardized evaluation benchmarks, and comprehensive open-source platforms. This figure was created based on the tools provided by Biorender.com (accessed on 15 May 2025).

Advances in multimodal fusion learning for biological data integration

The evolution of LLMs is increasingly focusing on the sophisticated integration of multimodal data, aiming to comprehensively characterize and understand the intrinsic complexity of biological systems. Researchers are advancing through the development of comprehensive end-to-end intelligent systems that seamlessly integrate diverse data types, including biological sequences, molecular structures, medical images [105], and clinical information, thereby enabling multi-scale and hierarchical analysis of complex biological datasets. Based on these advances, the application of multimodal fusion techniques will substantially improve the accuracy of biomarker discovery and disease diagnosis, thereby driving innovative developments in the field of bioinformatics.

Knowledge-guided architectural design for LLMs

The next generation of LLMs will prioritize the comprehensive integration of biological domain knowledge systems and the establishment of knowledge-driven intelligent analytical frameworks. Through the systematic incorporation of biomedical ontologies and knowledge graphs into model architectures [96], these systems can leverage enriched semantic information to substantially enhance the precision of biological data analysis. Concurrently, a comprehensive investigation of knowledge distillation methodologies facilitates the efficient transfer of domain expertise to lightweight models, optimizing their operational efficiency in practical applications [110]. Ultimately, the implementation of adaptive learning systems with autonomous knowledge base updating and expansion capabilities will ensure that LLMs continuously evolve in parallel with cutting-edge biological research, maintaining both the relevance and sophistication of their analytical capabilities.

Lightweight model architecture and efficient inference strategies

The development of lightweight and efficient model architectures has emerged as a critical frontier in bioinformatics research, particularly when addressing the demands of practical applications. Through the comprehensive investigation of model distillation techniques and knowledge transfer methodologies, researchers can develop models that integrate high performance with lightweight characteristics [111], thereby optimizing computational efficiency without compromising predictive accuracy. Moreover, the exploration of federated learning paradigms and distributed computing frameworks offers promising approaches to enhancing large-scale biological data processing capabilities, thus facilitating robust computational support for massive-scale data analytics.

Advances in explainable artificial intelligence for bioinformatics

Enhancing model interpretability, a fundamental direction for future research, aims to strengthen life sciences researchers’ comprehension of and confidence in model-generated predictions. Model interpretability encompasses transparent explanatory mechanisms for decision-making processes and predictions, in which LLMs demonstrate substantial advantages that have markedly enhanced their practical applications in bioinformatics research [112]. The advancement of innovative visualization techniques facilitates the intuitive representation of model decision mechanisms, consequently enhancing researchers’ comprehension of underlying model principles [113]. Additionally, the development of interactive interpretation systems, particularly Application Programming Interface (API)-based localized bioinformatics assistance platforms [114], enables researchers to actively explore and validate model reasoning processes, optimize the user experience of bioinformatics tools [115], enhance human–computer interaction efficiency, and promote the comprehensive development of bioinformatics research.

Synergistic integration of LLMs with experimental biology

The synergistic integration of LLMs with experimental biology establishes comprehensive closed-loop research systems, facilitating a paradigm shift from traditional experience-driven methodologies to data-informed hypothesis-driven approaches. Initially, the implementation of AI-driven experimental design and optimization systems enables enhanced experimental efficiency and analytical precision while facilitating the development of robust, standardized experimental protocols. Ultimately, the development of integrated evaluation frameworks that combine computational predictions with experimental outcomes enhances the bidirectional relationship between in silico modeling and experimental validation, advancing bioinformatics research toward more systematic and standardized methodologies.

Advancing ethical guidelines and privacy safeguards in bioinformatics

As LLMs continue to proliferate and evolve within bioinformatics, the establishment of robust ethical frameworks and privacy protection protocols has emerged as an urgent imperative. At the technical level, privacy-preserving federated learning architectures present a promising methodological framework [116], which ensures the security and confidentiality of sensitive biomedical data through distributed computational approaches that enable localized data processing. Furthermore, through systematic investigation of algorithmic fairness and bias mitigation techniques, researchers can develop quantifiable metrics, comparative analysis of AI categories for clinical applications, and bias mitigation strategies [117]. These approaches can effectively minimize performance disparities across diverse demographic groups, ensuring equitable technology deployment while fostering public trust and support for bioinformatics research.

Interdisciplinary collaboration and Open Science initiatives in LLM-based bioinformatics

In the context of rapidly evolving bioinformatics, interdisciplinary collaboration and open science principles have emerged as fundamental driving forces for innovative LLM applications. LLMs, serving as innovative interdisciplinary research tools, facilitate comprehensive collaboration among experts across multiple domains, including computer science [118], biology, and medicine, thus accelerating the development of novel methodologies and technological innovations through the synthesis of diverse disciplinary perspectives and knowledge frameworks. Currently, many key tasks in bioinformatics lack unified evaluation standard systems [119], which significantly limits the comparability of research outcomes. Therefore, developing standardized evaluation benchmarks (such as BioCoder [120] developed by Tang et al.) will facilitate objective comparisons between different methods and help ensure the credibility and reproducibility of research results. By systematically advancing the development and improvement of open-source tools and platforms, lowering technical barriers, expanding researcher participation, and accelerating the innovation and adoption of bioinformatics technologies, the sustainable development of the discipline can be promoted.

Conclusions

LLMs have revolutionized key bioinformatics fields, including sequence analysis, structural biology, omics, drug discovery, and literature mining through five core strengths: (1) handling complex biological data, (2) capturing semantic patterns, (3) cross-modal learning, (4) minimizing manual feature engineering, and (5) leveraging unlabeled data via self-supervised learning. Persistent challenges in this domain encompass data complexity, model interpretability, computational costs, and data quality issues. Future progress in the field hinges on multimodal integration, knowledge-enhanced architectures, efficient inference, explainable AI, experimental collaboration, ethical safeguards, and interdisciplinary partnerships. These advancements have the potential to transform bioinformatics research and precision medicine, necessitating coordinated efforts across technical, ethical, and social domains for responsible development.

Key Points

  • Large language models (LLMs) demonstrate significant potential across critical bioinformatics domains, including protein/DNA/RNA sequence analysis, structural biology, multi-omics data integration, drug discovery, and biomedical literature mining.

  • LLMs excel at processing long-sequence and high-dimensional biological data through advanced tokenization, attention mechanisms, and dimensionality reduction techniques.

  • Key limitations of LLMs include the complexity and heterogeneity of biological data, insufficient model interpretability, high computational resource requirements, biases in training data, and significant ethical and privacy concerns.

  • Future research directions for LLMs focus on multimodal fusion learning, knowledge-guided architectures, lightweight models for efficient deployment, explainable AI frameworks, and robust ethical safeguards.

Abbreviations

LLMs, Large language models

ELMo, Embeddings from language models

BERT, Bidirectional encoder representations from transformers

Kcr, protein lysine crotonylation

PTM, post-translational modification

GPCRs, G protein-coupled receptors

MSA, Multiple sequence alignment

ML, Machine learning

ESM, Evolutionary scale modeling

ESM-AA, ESM all-atom

PPI, Protein–protein interaction

NLP, Natural language processing

RBP, RNA–RNA binding protein

TCRs, T-cell receptors

BiLSTM, Bidirectional long short-term memory

HLA, Human leukocyte antigen

scRNA-seq, single-cell RNA sequencing

SCL, subcellular localization

PLMs, pre-trained protein language models

MRCF, Multi-role ChatGPT framework

DTI, Drug–target interaction

ADRs, Adverse drug reactions

NER, Named entity recognition

BRE, Biological relation extraction

BioBERT, BERT for biomedical text mining

API, Application Programming Interface

Contributor Information

Anqi Lin, Donghai County People’s Hospital (Affiliated Kangda College of Nanjing Medical University); Department of Oncology, Zhujiang Hospital, Southern Medical University, Lianyungang 222000, China.

Junpu Ye, Donghai County People’s Hospital (Affiliated Kangda College of Nanjing Medical University); Department of Oncology, Zhujiang Hospital, Southern Medical University, Lianyungang 222000, China.

Chang Qi, Institute of Logic and Computation, Vienna University of Technology, Vienna, Austria.

Lingxuan Zhu, Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou 510282, China.

Weiming Mou, Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou 510282, China; Department of Urology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.

Wenyi Gan, Department of Joint Surgery and Sports Medicine, Zhuhai People’s Hospital (Zhuhai hospital affiliated with Jinan University), Guangdong, China.

Dongqiang Zeng, Department of Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China; Cancer Center, The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Foshan, 528000, China.

Bufu Tang, Department of Radiation Oncology, Zhongshan Hospital Affiliated to Fudan University, Shanghai, China.

Mingjia Xiao, Hepatobiliary Surgery Department, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People’s Hospital, China.

Guangdi Chu, Department of Urology, The Affiliated Hospital of Qingdao University, Qingdao, China.

Shengkun Peng, Department of Radiology, Sichuan Provincial People’s Hospital, University of Electronic Science and Technology of China, Chengdu, 610072, China.

Hank Z H Wong, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China.

Lin Zhang, The School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3000, Australia; Suzhou Industrial Park Monash Research Institute of Science and Technology, Suzhou, Jiangsu 215000, China.

Hengguo Zhang, College & Hospital of Stomatology, Anhui Medical University, Key Laboratory of Oral Diseases Research of Anhui Province, Hefei, 230032, China.

Xinpei Deng, Department of Urology, State Key Laboratory of Oncology in Southern China, Sun Yat-sen University Cancer Center, Guangdong Provincial Clinical Research Center for Cancer, Guangzhou, 510060, China.

Kailai Li, Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou 510282, China.

Jian Zhang, Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou 510282, China.

Aimin Jiang, Department of Urology, Changhai Hospital, Naval Medical University (Second Military Medical University), Shanghai, China.

Zhengrui Li, Department of Oral and Cranio-Maxillofacial Surgery, Shanghai Ninth People’s Hospital, College of Stomatology, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Oral Diseases, Shanghai Key Laboratory of Stomatology and Shanghai Research Institute of Stomatology, Shanghai 200011, China.

Peng Luo, Donghai County People’s Hospital (Affiliated Kangda College of Nanjing Medical University); Department of Oncology, Zhujiang Hospital, Southern Medical University, Lianyungang 222000, China; Department of Microbiology, State Key Laboratory of Emerging Infectious Diseases, Carol Yu Centre for Infection, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China.

Author contributions

Writing—original draft, J.P.Y., A.Q.L.; Conceptualization, A.M.J., Z.R.L., P.L.; Investigation, A.Q.L., J.P.Y., A.M.J., Z.R.L., P.L.; Writing—review and editing, A.Q.L., J.P.Y., C.Q., L.X.Z., W.M.M., W.Y.G., D.Q.Z., B.F.T., M.J.X., G.D.C., S.K.P., H.Z.W., L.Z., H.G.Z., X.P.D., K.L.L., J.Z., A.M.J., Z.R.L., P.L.; Visualization, J.P.Y., A.Q.L. All authors have read and agreed to the published version of the manuscript.

Conflict of interest: None declared.

Funding

None declared.

Data availability

Not applicable.

References

  • 1. Chaussabel  D. Biomedical literature mining: challenges and solutions in the ‘omics’ era. Am J Pharmacogenomics  2004;4:383–93. 10.2165/00129785-200404060-00005. [DOI] [PubMed] [Google Scholar]
  • 2. Zhang  J, Li  H, Tao  W. et al.  GseaVis: an R package for enhanced visualization of gene set enrichment analysis in biomedicine. Med Research  1:131–5. 10.1002/mdr2.70000. [DOI] [Google Scholar]
  • 3. Chen  J, Lin  A, Jiang  A. et al.  Computational frameworks transform antagonism to synergy in optimizing combination therapies. NPJ Digit Med  2025;8:44. 10.1038/s41746-025-01435-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Eraslan  G, Avsec  Ž, Gagneur  J. et al.  Deep learning: new computational modelling techniques for genomics. Nat Rev Genet  2019;20:389–403. 10.1038/s41576-019-0122-6. [DOI] [PubMed] [Google Scholar]
  • 5. Hacking  S. ChatGPT and medicine: together we embrace the AI renaissance. JMIR Bioinform Biotechnol  2024;5:e52700. 10.2196/52700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Chen  J, Zhu  L, Mou  W. et al.  STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability. iMetaOmics 2024;1:e7. 10.1002/imo2.7. [DOI] [Google Scholar]
  • 7. Lin  A, Zhu  L, Mou  W. et al.  Advancing generative artificial intelligence in medicine: recommendations for standardized evaluation. Int J Surg  2024;110:4547–51. 10.1097/JS9.0000000000001583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Naseem  U, Dunn  AG, Khushi  M. et al.  Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT. BMC Bioinformatics  2022;23:144. 10.1186/s12859-022-04688-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Schaefer  M, Reichl  S, ter Horst  R. et al.  GPT-4 as a biomedical simulator. Comput Biol Med  2024;178:108796. 10.1016/j.compbiomed.2024.108796. [DOI] [PubMed] [Google Scholar]
  • 10. Shue  E, Liu  L, Li  B. et al.  Empowering beginners in bioinformatics with ChatGPT. bioRxiv  2023;11:105–8. 10.15302/J-QB-023-0327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Simon  E, Swanson  K, Zou  J. Language models for biological research: a primer. Nat Methods  2024;21:1422–9. 10.1038/s41592-024-02354-y. [DOI] [PubMed] [Google Scholar]
  • 12. Zhu  L, Mou  W, Lai  Y. et al.  Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)’s ability to interpret radiological images. Int J Surg  2024;110:4096–102. 10.1097/JS9.0000000000001359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Liu  J, Dong  H, Wang  X. et al.  Large language models in bioinformatics: applications and perspectives. ArXiv  2024. [Google Scholar]
  • 14. Tran  C, Khadkikar  S, Porollo  A. Survey of protein sequence embedding models. Int J Mol Sci  2023;24. 10.3390/ijms24043775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Hornick  T, Mao  C, Koynov  A. et al.  In silico formulation optimization and particle engineering of pharmaceutical products using a generative artificial intelligence structure synthesis method. Nat Commun  2024;15:9622. 10.1038/s41467-024-54011-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Zeng  Z. et al.  Survey of natural language processing techniques in bioinformatics. Comput Math Methods Med  2015;2015:674296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Johnson  SR, Peshwa  M, Sun  Z. Sensitive remote homology search by local alignment of small positional embeddings from protein language models. Elife  2024;12. 10.7554/eLife.91415.3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Heinzinger  M, Elnaggar  A, Wang  Y. et al.  Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics  2019;20:723. 10.1186/s12859-019-3220-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Taju  SW, Shah  SMA, Ou  YY. Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders. Anal Biochem  2021;633:114416. 10.1016/j.ab.2021.114416. [DOI] [PubMed] [Google Scholar]
  • 20. Le  NQK. et al.  A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform  2021;22. 10.1093/bib/bbab005. [DOI] [PubMed] [Google Scholar]
  • 21. Charoenkwan  P, Nantasenamat  C, Hasan  MM. et al.  BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics  2021;37:2556–62. 10.1093/bioinformatics/btab133. [DOI] [PubMed] [Google Scholar]
  • 22. Qiao  Y, Zhu  X, Gong  H. BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics  2022;38:648–54. 10.1093/bioinformatics/btab712. [DOI] [PubMed] [Google Scholar]
  • 23. Brandes  N, Ofer  D, Peleg  Y. et al.  ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics  2022;38:2102–10. 10.1093/bioinformatics/btac020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Park  M. et al.  EpiBERTope: a sequence-based pre-trained BERT model improves linear and structural epitope prediction by learning long-distance protein interactions effectively. 2022. 2022.02.27.481241.
  • 25. Madani  A, Krause  B, Greene  ER. et al.  Large language models generate functional protein sequences across diverse families. Nat Biotechnol  2023;41:1099–106. 10.1038/s41587-022-01618-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Nijkamp  E, Ruffolo  JA, Weinstein  EN. et al.  ProGen2: exploring the boundaries of protein language models. Cell Syst  2023;14:968–978.e3. 10.1016/j.cels.2023.10.002. [DOI] [PubMed] [Google Scholar]
  • 27. Ferruz  N, Schmidt  S, Hocker  B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun  2022;13:4348. 10.1038/s41467-022-32007-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Shrestha  P, Kandel  J, Tayara  H. et al.  Post-translational modification prediction via prompt-based fine-tuning of a GPT-2 model. Nat Commun  2024;15:6699. 10.1038/s41467-024-51071-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Qiu  W, Lv  Z, Xiao  X. et al.  EMCBOW-GPCR: a method for identifying G-protein coupled receptors based on word embedding and wordbooks. Comput Struct Biotechnol J  2021;19:4961–9. 10.1016/j.csbj.2021.08.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Yeung  W, Zhou  Z, Li  S. et al.  Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings. Brief Bioinform  2023;24. 10.1093/bib/bbac599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Sgarbossa  D, Lupo  U, Bitbol  AF. Generative power of a protein language model trained on multiple sequence alignments. Elife  2023;12:12. 10.7554/eLife.79854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Murmu  A, Gyorffy  B. Artificial intelligence methods available for cancer research. Front Med  2024;18:778–97. 10.1007/s11684-024-1085-3. [DOI] [PubMed] [Google Scholar]
  • 33. Nguyen  TTD, Trinh  VN, le  NQK. et al.  Using k-mer embeddings learned from a skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model. Plant Mol Biol  2021;107:533–42. 10.1007/s11103-021-01204-1. [DOI] [PubMed] [Google Scholar]
  • 34. Ji  Y, Zhou  Z, Liu  H. et al.  DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics  2021;37:2112–20. 10.1093/bioinformatics/btab083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Zhou  Z. et al.  DNABERT-2: efficient foundation model and benchmark for multi-species genome. 2023arXiv:2306.15006. 10.48550/arXiv.2306.15006. [DOI]
  • 36. Zhou  Z. et al.  DNABERT-S: pioneering species differentiation with species-aware DNA embeddings. 2024arXiv:2402.08777. 10.48550/arXiv.2402.08777. [DOI] [PMC free article] [PubMed]
  • 37. Zhao  H, Zhang  S, Qin  H. et al.  DSNetax: a deep learning species annotation method based on a deep-shallow parallel framework. Brief Bioinform  2024;25. 10.1093/bib/bbae157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Refahi  MS, Sokhansanj  BA, Rosen  GL. Leveraging large language models for metagenomic analysis. In: 2023 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2023.
  • 39. Nguyen  E. et al.  HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. 2023arXiv:2306.15794. 10.48550/arXiv.2306.15794. [DOI]
  • 40. Leone  M, Galeota  E, Masseroli  M. et al.  Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions. Bioinformatics  2022;38:1183–90. 10.1093/bioinformatics/btab815. [DOI] [PubMed] [Google Scholar]
  • 41. Zhang  D. et al.  DNAGPT: a generalized pre-trained tool for versatile DNA sequence analysis tasks. 2023arXiv:2307.05628. 10.48550/arXiv.2307.05628. [DOI]
  • 42. Akiyama  M, Sakakibara  Y. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genom Bioinform  2022;4:lqac012. 10.1093/nargab/lqac012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Magge  A, Weissenbacher  D, O’Connor  K. et al.  GeoBoost2: a natural language processing pipeline for GenBank metadata enrichment for virus phylogeography. Bioinformatics  2020;36:5120–1. 10.1093/bioinformatics/btaa647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Sadad  T, Aurangzeb  RA, Safran  M. et al.  Classification of highly divergent viruses from DNA/RNA sequence using transformer-based models. Biomedicines  2023;11. 10.3390/biomedicines11051323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Dudley  J, Butte  AJ. Enabling integrative genomic analysis of high-impact human diseases through text mining. Pac Symp Biocomput  2008;580–91. [PMC free article] [PubMed] [Google Scholar]
  • 46. Chen  D, Liu  J, Wei  GW. TopoFormer: multiscale topology-enabled structure-to-sequence transformer for protein-ligand interaction predictions. Res Sq  2024;6:799–810. 10.1038/s42256-024-00855-1. [DOI] [Google Scholar]
  • 47. Bordin  N, Dallago  C, Heinzinger  M. et al.  Novel machine learning approaches revolutionize protein knowledge. Trends Biochem Sci  2023;48:345–59. 10.1016/j.tibs.2022.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Rives  A, Meier  J, Sercu  T. et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A  2021;118. 10.1073/pnas.2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Mansoor  S, Baek  M, Juergens  D. et al.  Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold. Protein Sci  2023;32:e4780. 10.1002/pro.4780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Hsu  C. et al.  Learning inverse folding from millions of predicted structures. 2022. 2022.04.10.487779
  • 51. Lin  Z, Akin  H, Rao  R. et al.  Evolutionary-scale prediction of atomic-level protein structure with a language model. Science  2023;379:1123–30. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 52. Chowdhury  R, Bouatta  N, Biswas  S. et al.  Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol  2022;40:1617–23. 10.1038/s41587-022-01432-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Clifford  JN, Høie  MH, Deleuran  S. et al.  BepiPred-3.0: improved B-cell epitope prediction using protein language models. Protein Sci  2022;31:e4497. 10.1002/pro.4497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Nie  L, Quan  L, Wu  T. et al.  TransPPMP: predicting pathogenicity of frameshift and non-sense mutations by a transformer based on protein features. Bioinformatics  2022;38:2705–11. 10.1093/bioinformatics/btac188. [DOI] [PubMed] [Google Scholar]
  • 55. Wu  F, Jing  X, Luo  X. et al.  Improving protein structure prediction using templates and sequence embedding. Bioinformatics  2023;39. 10.1093/bioinformatics/btac723. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Mirabello  C, Wallner  B. rawMSA: end-to-end deep learning using raw multiple sequence alignments. PloS One  2019;14:e0220182. 10.1371/journal.pone.0220182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Santos  C, Eggle  D, States  DJ. Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics  2005;21:1653–8. 10.1093/bioinformatics/bti165. [DOI] [PubMed] [Google Scholar]
  • 58. Matic  M, Singh  G, Carli  F. et al.  PRECOGx: exploring GPCR signaling mechanisms with deep protein representations. Nucleic Acids Res  2022;50:W598–610. 10.1093/nar/gkac426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Lei  C, Zhou  K, Zheng  J. et al.  AraPathogen2.0: an improved prediction of plant-pathogen protein-protein interactions empowered by the natural language processing technique. J Proteome Res  2024;23:494–9. 10.1021/acs.jproteome.3c00364. [DOI] [PubMed] [Google Scholar]
  • 60. Ruan  AMA, George  C, GeneBERT: BERT for predicting differential gene expression from histone modifications. 2021. https://github.com/ZovcIfzm/GeneBERT GitHub repository.
  • 61. Yamada  K, Hamada  M. Prediction of RNA-protein interactions using a nucleotide language model. Bioinform Adv  2022;2:vbac023. 10.1093/bioadv/vbac023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Wu  H, Z.  Y, Wang  W. TCR-BERT: a deep learning approach for t-cell receptor sequence analysis. In: International Conference on Learning Representations (ICLR) 2021. 2021.
  • 63. Zhang  Y, Zhu  G, Li  K. et al.  HLAB: learning the BiLSTM features from the ProtBert-encoded proteins for the class I HLA-peptide binding prediction. Brief Bioinform  2022;23. 10.1093/bib/bbac173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Yu  Y, He  W, Jin  J. et al.  iDNA-ABT: advanced deep learning model for detecting DNA methylation with adaptive features and transductive information maximization. Bioinformatics  2021;37:4603–10. 10.1093/bioinformatics/btab677. [DOI] [PubMed] [Google Scholar]
  • 65. Zeng  W, Gautam  A, Huson  DH. MuLan-methyl-multiple transformer-based language models for accurate DNA methylation prediction. Gigascience  2022;12:12. 10.1093/gigascience/giad054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Wang  K, Zeng  X, Zhou  J. et al.  BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform  2024;25. 10.1093/bib/bbae195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Huang, D.-S., K.-H.  Jo, and X.-L.  Zhang, Intelligent Computing Theories and Application: 14th International Conference, ICIC 2018, Wuhan, China, August 15–18, 2018, Proceedings, Part II. Vol. 10955. 2018: Springer. [Google Scholar]
  • 68. Yang  F, Wang  W, Wang  F. et al.  scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell  2022;4:852–66. 10.1038/s42256-022-00534-z. [DOI] [Google Scholar]
  • 69. Cui  H, Wang  C, Maan  H. et al.  scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods  2024;21:1470–80. 10.1038/s41592-024-02201-0. [DOI] [PubMed] [Google Scholar]
  • 70. Liu  T. et al.  scelmo: Embeddings from language models are good learners for single-cell data analysis. bioRxiv (Cold Spring Harbor Laboratory) 2023; 2023.12. 07.569910.
  • 71. Zheng  Y. et al.  Large Language Models in Drug Discovery and Development: From Disease Mechanisms to Clinical Trials. ArXiv 2024. abs/2409.04481.
  • 72. Lyu  Y. et al.  GP-GPT: Large Language Model for Gene-Phenotype Mapping. ArXiv 2024. abs/2409.09825.
  • 73. Khan  A, Lee  BJESWA. DeepGene transformer: Transformer for the gene expression-based classification of cancer subtypes. Expert Systems With Applications 2023;226:120047. [Google Scholar]
  • 74. Wong  C-K. et al.  Lomics: generation of pathways and gene sets using large language models for transcriptomic analysis. 2024. arXiv: 2407.09089 [qbio.MN]. 10.48550/arXiv.2407.09089. [DOI]
  • 75. Chen  Y. et al.  Iterative prompt refinement for mining gene relationships from ChatGPT. bioRxiv 2023.bioRxiv. 2023 Dec 23:2023.12.23.573201. 10.1101/2023.12.23.573201. [DOI]
  • 76. Dong  Q, Wang  K, Liu  X. Identifying the missing proteins in human proteome by biological language model. BMC Syst Biol  2016;10:113. 10.1186/s12918-016-0352-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Du  Z. et al.  pLM4Alg: protein language model-based predictors for allergenic proteins and peptides. J Agric Food Chem  2024;72:752–60. 10.1021/acs.jafc.3c07143. [DOI] [PubMed] [Google Scholar]
  • 78. Zhao  M, Lei  C, Zhou  K. et al.  POOE: predicting oomycete effectors based on a pre-trained large protein language model. mSystems  2024;9:e0100423. 10.1128/msystems.01004-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Arora  I, Kummer  A, Zhou  H. et al.  Mtx-COBRA: subcellular localization prediction for bacterial proteins. Comput Biol Med  2024;171:108114. 10.1016/j.compbiomed.2024.108114. [DOI] [PubMed] [Google Scholar]
  • 80. Ghazikhani  H, Butler  G. Exploiting protein language models for the precise classification of ion channels and ion transporters. Proteins  2024;92:998–1055. 10.1002/prot.26694. [DOI] [PubMed] [Google Scholar]
  • 81. Zhang  X, Hu  X, Zhang  T. et al.  PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset. Brief Bioinform  2024;25. 10.1093/bib/bbae404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Wu  Z, Jiang  D, Wang  J. et al.  Knowledge-based BERT: a method to extract molecular features like computational chemists. Brief Bioinform  2022;23. 10.1093/bib/bbac131. [DOI] [PubMed] [Google Scholar]
  • 83. Rong  Y. et al.  Dropedge: towards deep graph convolutional networks on node classification. 2019. [DOI] [PubMed]
  • 84. Chithrananda  S, Grand  G, Ramsundar  BJAPA. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. 2020.
  • 85. Bagal  V, Aggarwal  R, Vinod  PK. et al.  MolGPT: molecular generation using a transformer-decoder model. J Chem Inf Model  2022;62:2064–76. 10.1021/acs.jcim.1c00600. [DOI] [PubMed] [Google Scholar]
  • 86. Chen  H, Zhang  S, Zhang  L. et al.  Multi role ChatGPT framework for transforming medical data analysis. Sci Rep  2024;14:13930. 10.1038/s41598-024-64585-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Zheng  J, Xiao  X, Qiu  WR. DTI-BERT: identifying drug-target interactions in cellular networking based on BERT and deep learning method. Front Genet  2022;13:859188. 10.3389/fgene.2022.859188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. ValizadehAslani  T, Shi  Y, Ren  P. et al.  PharmBERT: a domain-specific BERT model for drug labels. Brief Bioinform  2023;24. 10.1093/bib/bbad226. [DOI] [PubMed] [Google Scholar]
  • 89. Kalakoti  Y, Yadav  S, Sundar  D. TransDTI: transformer-based language models for estimating DTIs and building a drug recommendation workflow. ACS Omega  2022;7:2706–17. 10.1021/acsomega.1c05203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. Jang  G, Lee  T, Hwang  S. et al.  PISTON: predicting drug indications and side effects using topic modeling and natural language processing. J Biomed Inform  2018;87:96–107. 10.1016/j.jbi.2018.09.015. [DOI] [PubMed] [Google Scholar]
  • 91. Xu  S. et al.  Foundational model aided automatic high-throughput drug screening using self-controlled cohort study. medRxiv  2024. [Google Scholar]
  • 92. Blank  CE, Cui  H, Moore  LR. et al.  MicrO: an ontology of phenotypic and metabolic characters, assays, and culture media found in prokaryotic taxonomic descriptions. J Biomed Semantics  2016;7:18. 10.1186/s13326-016-0060-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93. Lee  J, Yoon  W, Kim  S. et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics  2020;36:1234–40. 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94. Alsentzer, E.. et al. , Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019.
  • 95. Park  G. et al.  Automated extraction of molecular interactions and pathway knowledge using large language model, galactica: opportunities and challenges. In: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, 2023.
  • 96. Remy  F, Demuynck  K, Demeester  T. BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights. J Am Med Inform Assoc  2024;31:1844–55. 10.1093/jamia/ocae029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97. Wang  JH, Zhao  LF, Wang  HF. et al.  GenCLiP 3: mining human genes’ functions and regulatory networks from PubMed based on co-occurrences and natural language processing. Bioinformatics  2019;36:1973–5. 10.1093/bioinformatics/btz807. [DOI] [PubMed] [Google Scholar]
  • 98. Warikoo  N, Chang  YC, Hsu  WL. LBERT: lexically aware transformer-based bidirectional encoder representation model for learning universal bio-entity relations. Bioinformatics  2021;37:404–12. 10.1093/bioinformatics/btaa721. [DOI] [PubMed] [Google Scholar]
  • 99. Gu  W, Yang  X, Yang  M. et al.  MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction. Bioinform Adv  2022;2:vbac035. 10.1093/bioadv/vbac035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100. Nath  A, Mwesigwa  S, Dai  Y. et al.  GENEVIC: GENetic data exploration and visualization via intelligent interactive console. Bioinformatics  2024;40. 10.1093/bioinformatics/btae500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101. Dotan  E, Jaschek  G, Pupko  T. et al.  Effect of tokenization on transformers for biological sequences. Bioinformatics  2024;40. 10.1093/bioinformatics/btae196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102. Lai  PT, Lu  Z. BERT-GT: cross-sentence n-ary relation extraction with BERT and graph transformer. Bioinformatics  2021;36:5678–85. 10.1093/bioinformatics/btaa1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103. Ligeti  B. et al.  ProkBERT family: genomic language models for microbiome applications. Front Microbiol  2023;14:1331233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104. Balabin  H, Hoyt  CT, Birkenbihl  C. et al.  STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs. Bioinformatics  2022;38:1648–56. 10.1093/bioinformatics/btac001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105. Wang  J, Ye  Q, Liu  L. et al.  Scientific figures interpreted by ChatGPT: strengths in plot recognition and limits in color perception. NPJ Precis Oncol  2024;8:84. 10.1038/s41698-024-00576-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106. Zhang  Q. et al.  Scientific large language models: a survey on biological & chemical domains. 2024.  arXiv:2401.14656. 10.48550/arXiv.2401.14656. [DOI]
  • 107. Huang  MS, Lai  PT, Lin  PY. et al.  Biomedical named entity recognition and linking datasets: survey and our recent development. Brief Bioinform  2020;21:2219–38. 10.1093/bib/bbaa054. [DOI] [PubMed] [Google Scholar]
  • 108. Sokhansanj  BA, Rosen  GL. Mapping data to deep understanding: making the most of the deluge of SARS-CoV-2 genome sequences. mSystems  2022;7:e0003522. 10.1128/msystems.00035-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109. Kim  J, Wang  K, Weng  C, Liu  C. Assessing the utility of large language models for phenotype-driven gene prioritization in rare genetic disorder diagnosis. ArXiv  2024;111:2190–202. 10.1016/j.ajhg.2024.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110. Pourreza Shahri  M, Kahanda  I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics  2021;22:500. 10.1186/s12859-021-04421-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111. Giorgi  JM, Bader  GD. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics  2018;34:4087–94. 10.1093/bioinformatics/bty449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112. Zhang  S, Fan  R, Liu  Y. et al.  Applications of transformer-based language models in bioinformatics: a survey. Bioinform Adv  2023;3:vbad001. 10.1093/bioadv/vbad001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113. Ferruz  N, Höcker  BJAE-P. Controllable protein design with language models. 2022.  arXiv:2201.07338. 10.48550/arXiv.2201.07338. [DOI]
  • 114. Wang  L, Ge  X, Liu  L. et al.  Code interpreter for bioinformatics: are we there yet?  Ann Biomed Eng  2024;52:754–6. 10.1007/s10439-023-03324-9. [DOI] [PubMed] [Google Scholar]
  • 115. Bai  J, Kamatchinathan  S, Kundu  DJ. et al.  Open-source large language models in action: a bioinformatics chatbot for PRIDE database. Proteomics  2024;24:e2400005. 10.1002/pmic.202400005. [DOI] [PubMed] [Google Scholar]
  • 116. Li  X, Peng  L, Wang  YP. et al.  Open challenges and opportunities in federated foundation models towards biomedical healthcare. BioData Min  2025;18:2. 10.1186/s13040-024-00414-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117. Goktas  P, Grzybowski  A. Shaping the future of healthcare: ethical clinical challenges and pathways to trustworthy AI. J Clin Med  2025;14. 10.3390/jcm14051605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118. Rahman  CR, Wong  L. How much can ChatGPT really help computational biologists in programming?  J Bioinform Comput Biol  2024;22:2471001. 10.1142/S021972002471001X. [DOI] [Google Scholar]
  • 119. Fenoy  E, Edera  AA, Stegmayer  G. Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks. Brief Bioinform  2022;23. 10.1093/bib/bbac232. [DOI] [PubMed] [Google Scholar]
  • 120. Tang  X, Qian  B, Gao  R. et al.  BioCoder: a benchmark for bioinformatics code generation with large language models. Bioinformatics  2024;40:i266–76. 10.1093/bioinformatics/btae230. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES