Skip to main content
Fundamental Research logoLink to Fundamental Research
. 2024 Dec 27;6(1):6–10. doi: 10.1016/j.fmre.2024.12.007

Biological sequence analysis: Advances, medical applications, and challenges

Hang Wei a, Jiangyi Shao b, Bin Liu b,
PMCID: PMC12869733  PMID: 41647572

Abstract

With continuous developments in biotechnology and information technology, we are entering the era of biological sequence big data. Extracting valuable insights from large-scale data to decipher life activities presents a major challenge. Artificial intelligence, particularly big data analysis and natural language processing technologies, has emerged as a crucial tool in biological sequence analysis. These technologies facilitate pattern detection, offering new perspectives on complex biological processes. Notably, AlphaFold and ESM models have achieved significant strides in protein structure prediction and functional annotation. These breakthroughs not only accelerate fundamental biological research but also provide innovative tools and strategies for disease diagnosis and drug discovery. In this perspective, we discuss the advancements in biological sequence analysis and focus on their extensive medical applications. Additionally, we highlight relevant challenges and propose further directions, emphasizing the need for ongoing innovation in sequence analysis to fully realize its potential in biomedical research.

Keywords: Biological sequence analysis, Big data analysis, Natural language process, Biological language, Artificial intelligence

1. Introduction

The advancement of research in biotechnology and information technology has laid the foundation for the era of biological sequence big data. This big data encompasses biomolecular sequence and property data, structural data, and functional data. The accessible biomolecular sequences have grown significantly, along with corresponding expansions in their structural and functional annotations (Fig. 1A). Analyzing biological sequence big data greatly improves our understanding of biological mechanisms and has profound implications for pathogenesis research.

Fig. 1.

Fig 1 dummy alt text

Advancements in biological sequence data, analysis methods, and medical applications.

Big data analysis and Natural Language Processing (NLP) technologies play the roles of pioneers and interpreters in constructing comprehensive functional maps of biomolecules. AlphaFold leverages sequence big data and deep learning to predict protein structures, bridging the gap between sequence and structure and offering a panoramic view of the protein universe [1]. AlphaFold 3 designs diffusion-based architecture to predict the joint structure of complexes containing nearly all biomolecules with unprecedented accuracy [2]. These advancements have set a new precedent in biological sequence analysis, marking a revolutionary breakthrough in the field. Biological sequence analyses are increasingly being employed in biomedical research, such as genetic variation detection, biomarker identification, and drug discovery, significantly improving disease diagnosis and therapy.

2. Advances in biological sequence analysis

Existing biological sequence analysis methods can be grouped into three categories: methods based on biological sequence patterns and statistics, methods based on biological sequence big data, and methods based on natural language processing (Fig. 1B).

2.1. Methods based on biological sequence patterns and statistics

Methods based on biological sequence patterns and statistics primarily rely on prior knowledge and statistical properties of biological sequences. These methods typically employ specific rules or statistical models to capture features within the sequences. For example, weight matrices can be constructed to characterize sequence features by statistically analyzing the frequency of nucleotides or amino acids at specific positions. Amino acid index can be computed to encode various physicochemical and biochemical properties of amino acids. Local k-mer pattern and global sequence properties such as physicochemical and nucleotide correlation can be integrated to represent nucleic acid sequences. DPCfam captures the positional distribution specificity of amino acids through sequence alignment methods, automatically generating protein families [3]. DeepSoluE constructs hybrid sequence features, combining physicochemical patterns and distributed representations of amino acids, to predict protein solubility [4].

2.2. Methods based on biological sequence big data

Methods based on biological sequence big data mainly depend on the diverse relationships within large-scale datasets to capture core features of biological sequences and compute their quantifiable continuous representations. AlphaFold 2 takes amino acid sequences, along with multiple sequence alignments and structural template information from assisted database searches as input, and leverages the attention mechanism in deep learning to generate representations. It can accurately describe structural features covering nearly all known proteins. The success of AlphaFold 2, and subsequently AlphaFold 3, demonstrates how deep learning techniques can automatically extract complex features from massive biological sequence data, thereby achieving high-precision predictions of protein and biomolecular interaction structures [1,2]. Based on this, ColabFold combines a fast homology search strategy with AlphaFold 2, significantly accelerating the protein structures and complexes prediction [5]. This combination not only enhances computational efficiency but also makes high-precision structure prediction more feasible and widespread, providing powerful tools for biological research and drug development.

2.3. Methods based on natural language processing

Given the similarity between biological sequences and natural language, methods based on NLP interpret the deep patterns within biological sequences as their high-order semantic content. By learning the linguistic properties of biological sequences, these methods project sequences into semantic spaces to obtain accurate sequence representations. Biomolecular language models learn the linguistic properties of molecular contexts and output semantic representations. Genomic pre-trained network is a DNA language model designed to learn genome-wide variant effects, gene structure and DNA motifs through unsupervised pretraining on genomic DNA sequences [6]. RNAErnie is an RNA language model including motif-aware pretraining and type-guided fine-tuning phases [7]. ESM2 language model enables rapid, end-to-end atomic-resolution structure prediction directly from sequences, revealing regions of the metagenomic space that extend beyond current knowledge [8]. Applying NLP techniques to biological sequence analysis allows the biomolecular language model to uncover the meanings of the ‘book of life’.

2.4. General tools for biological sequence analysis

Alongside task-specific sequence analysis models, universal models that can be applied to various types of sequences are also highly valued. For example, iLearnPlus is a comprehensive machine-learning platform for nucleic acid and protein sequence analysis, providing over twenty machine-learning algorithms and automating feature extraction, model construction, statistical analysis, and data visualization [9]. BioSeq-BLM is a biological language model system for DNA, RNA, and protein sequences. It automates the construction of models, selection of prediction methods, performance evaluation, and result analysis [10]. BioSeq-Diabolo can automatically construct predictors based on spatial distributions, sequence representations or local interactions, and analyze both homogeneous and heterogeneous biological sequence similarity [11]. DeepBIO is an interpretable deep-learning tool for biological sequence functional analysis, including sequence-level functional prediction and base-wise functional annotation [12].

3. Applications for medicine

Biological sequence analysis is valuable across a range of medical applications (Fig. 1C). Its advancement has improved the interpretation of life processes and benefited biomedical research.

3.1. Mutation analysis

Biological sequence variations are crucial for understanding genetic diversity and disease mechanisms within the human genome, making their identification essential for the progress of precision medicine. To date, different genetic mutations have been discovered associated with a broad range of disorders [13]. To enhance the efficiency of sequencing technologies, numerous computational models have been developed to leverage available sequences and deep learning for identifying various mutations including single-nucleotide polymorphisms (SNPs), copy number variations (CNVs), insertions or deletions (Indels), and structural variants (SVs) [14,15].

3.2. Drug discovery and development

Biological sequence analysis is widely employed in drug discovery and development including de novo drug design and drug repurposing. Sequence generation models have been developed to design small molecule or protein/peptide drugs that meet specific biochemical property requirements [16,17]. Drug repurposing can bypass several early-stage testing phases by identifying new indications for approved or well-established clinical drugs, leveraging their existing safety and efficacy data. To overcome limitations in protein/target structural data and high costs of molecular docking simulations, researchers have introduced multi-dimensional sequence-based representation methods, enhancing the prediction of drug–target interactions, binding affinity, therapeutic peptides, and drug responses [18,19].

3.3. Disease-associated molecule identification

Identifying disease-associated molecules is crucial for discovering new therapeutic targets and molecular biomarkers. Early computational methods focused on analyzing individual diseases by comparing expression profiles under different biological conditions to identify significantly differentially expressed genes. With the advent of biological big data analysis, it is now possible to simultaneously identify related molecules across multiple diseases. Various prior knowledge including sequences, expression profiles, variants, and molecular networks are integrated into graph representation learning for capturing hidden static properties and dynamic interaction characteristics of molecules [20,21]. Downstream analyses including pathway enrichment, drug sensitivity analysis, and experimental validation further enhance the systematic exploration of molecular pathological mechanisms.

3.4. Pathogen detection and tracking

Due to the ubiquitous nature and both symbiotic and pathogenic roles, microbial communities form a highly functional ecosystem essential for human health. Therefore, closely monitoring and comprehensively understanding host–microbiome and microbiome intercommunity interactions is crucial. To overcome the low sensitivity and multiplex limitations of traditional culture-based methods, biological sequence analysis offers promising advances in pathogen testing and tracking. These computational methods not only detect causative pathogens from metagenomic data through sequence encoding [22] but also facilitate the surveillance of infection responses [23] and viral escape patterns [24].

3.5. Epigenetic modification detection

Epigenetic modifications are heritable changes in gene expression patterns without alterations to the DNA sequence. Abnormal epigenetic modifications are associated with disrupted gene regulation, contributing to disease occurrence and progression. Currently, several tumor-specific epigenetic biomarkers are clinically used for early cancer screening and diagnosis. To improve the sensitivity and specificity of detecting epigenetic modification and enhance disease diagnosis accuracy, biological sequence analyses primarily focus on predicting epigenetic modification sites, uncovering novel modification patterns, and identifying epigenetic signatures [25,26].

3.6. Single-cell sequence analysis

Elevating biological sequence analysis in biomedical research from the tissue level to the single-cell level reveals cell-specific expression and regulatory patterns, providing novel insights into tumor heterogeneity. Single-cell sequence analysis encounters challenges such as increased dimensionality, sparsity, and heterogeneity. Consequently, computational methods developed for RNA bulk data need to be adapted and refined. Existing single-cell sequence analyses include cell type and state annotation, cell-drug response prediction, intercellular communication inference, target identification, and detection of dynamic regulatory patterns in developmental and pathological processes [27]. Additionally, foundational models are being developed to decipher single-cell languages, distilling critical cell semantics for further optimization in diverse downstream applications [28,29].

3.7. Immunotherapy enhancement

Immunotherapy leverages patients’ immune systems to prevent, control, and eradicate diseases using methods such as immune checkpoint blockade and adoptive cellular therapy. Artificial intelligence (AI) has been integrated to enhance immunotherapy, addressing challenges like limited universality and low overall response rates. Specifically, biological sequence analysis aims to advance several critical areas such as neoantigen prediction, immunogenic peptide screening, antibody design, and immunotherapy response prediction [30,31]. With the accumulation of multi-omics data in immuno-oncology, AI-driven big data analysis is paving the way for identifying meta-biomarkers and tailoring personalized immunotherapy [32].

4. Challenges in biological sequence analysis

Despite significant breakthroughs in biological sequence analysis, several challenges remain to be addressed as follows:

Data incompleteness and imbalance. Although we have entered the era of biological sequence big data, annotating molecular functions such as splicing variants, regulatory elements, and interaction sites requires high costs and specialized knowledge. This often leads to insufficient labeling of certain sequences, hindering the development of sequence analysis models. Moreover, small sample sizes for rare cell types or diseases, coupled with limitations in sequencing depth and significant disparities across categories, result in incomplete or imbalanced datasets, making it difficult to extract meaningful biological insights. Some advanced data augmentation techniques can be incorporated to generate synthetic data, while cross-domain knowledge transfer, semi-supervised learning and active learning can help to leverage unannotated data efficiently.

Long-distance dependencies and complex interaction. Biological sequences exhibit complex and interdependent patterns. Firstly, critical information within sequences may be distributed across distant positions, such as interactions between amino acid residues that are far apart in protein sequence yet influence its structure and function. Secondly, the nonlinear dependencies and interactions between multiple regions and segments make it challenging for traditional models to detect these intricate patterns. Based on biological sequence big data and language models in natural language processing, it is worth designing effective biological language models to discover hidden semantics according to the characteristics of biological sequence analysis problems.

Multi-omics data integration. Integrating diverse omics data provides more comprehensive understanding of molecular regulatory mechanisms. However, systematic modeling of multi-omics data presents significant challenges due to high heterogeneity and dimensionality, low signal-to-noise ratios, batch effects, and other unique attributes of omics dataset. Harmonizing and connecting different omics layers is crucial for deriving meaningful insights. To address this issue, data correction processes are necessary to denoise, and some deep learning techniques like graph learning, deep multi-view learning, and contrastive learning can be integrated to extract patterns, ensuring compatibility across omics data and capturing correlations between different molecular levels.

Model robustness and practicability. Many existing sequence analysis methods are tailored to specific datasets and tasks but often lack robustness and practicality. Several methods struggle with overfitting and data bias, making it difficult to analyze new sequences effectively. Additionally, multi-task analysis and consistent outcomes are hindered by limited transferability and generalizability. The complexity of these computational processes also poses challenges for biological researchers. Therefore, regularization techniques and ensemble frameworks are suggested to prevent overfitting and enhance generalization. Universal and fundamental sequence analysis models, along with their user-friendly tools, are expected to improve robustness and usability for biological researchers.

Biological interpretability. Despite the effectiveness of current computational prediction methods, biological sequence analysis models are often perceived as black boxes with obscure internal mechanisms and complex decision-making processes. This opacity may undermine the credibility and traceability of models, thereby limiting their practical application in biomedical research. While some interpretability methods are available, they are typically task-specific and focus more on enhancing predictive performance than on providing comprehensive, universal biological interpretability. It is promising to leverage explainable AI techniques such as SHAP values [33] and GNNExplainer [34] to elucidate model decisions. Collaborating closely with biologists during model development is essential to ensure interpretability aligns with biological context and practical applications.

5. Conclusion and future perspective

In this perspective, we review significant advancements in biological sequence analysis and explore its applications in modern medicine. Despite the progress made, we highlight several challenges that still need to be addressed. Future research can concentrate on the following key areas: Enhancing self-supervised and generative models for data completion and augmentation in biological sequence analysis; Developing large language models of biological sequences to capture long-range dependencies and intricate interactions; Designing innovative AI algorithms for multi-omics data integration and designation of functional biomolecules with custom-tailored properties; Applying explainable AI to biological sequence analysis and developing practical tools to improve model transparency and real-world applicability; Refining patterns from biological sequence big data to support diversified and personalized medical research.

Looking ahead, advanced biological sequence analyses are poised to deepen our understanding of cellular processes and diseases, leading the way for more effective treatments.

Declaration of competing interest

The authors declare that they have no conflicts of interest in this work.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (62325202, 62372041 and U22A2039).

Biographies

Hang Wei (BRID: 00370.00.76892) is now an associate professor at the School of Computer Science and Technology, Xidian University, Xi'an, China. She received her Ph.D. degree in computer application technology from the Harbin Institute of Technology, Shenzhen, China, in 2022, and received M.S. degree from University of Science and Technology of China, Hefei, China, in 2016. Her research interests include bioinformatics, data mining, computational intelligence, and its application analysis in biomedicine data.

Bin Liu (BRID: 08326.00.98258) received Ph.D. degree from Harbin Institute of Technology, China in 2010. From 2010 to 2012, he was a post-doctoral researcher at The Ohio State University, USA. He is working at Beijing Institute of Technology as a Professor. His research interesting includes bioinformatics, machine learning, natural language processing, etc. Now he is putting the focus on exploring the language models of biological sequences, and proposing computational predictors for some important tasks in bioinformatics based on natural language processing techniques.

References

  • 1.Jumper J., Evans R., Pritzel A., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Abramson J., Adler J., Dunger J., et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Russo E.T., Barone F., Bateman A., et al. DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets. PLoS Comput. Biol. 2022;18(10) doi: 10.1371/journal.pcbi.1010610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wang C., Zou Q. Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE. BMC Biol. 2023;21(1):12. doi: 10.1186/s12915-023-01510-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Mirdita M., Schütze K., Moriwaki Y., et al. ColabFold: Making protein folding accessible to all. Nat. Methods. 2022;19(6):679–682. doi: 10.1038/s41592-022-01488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Benegas G., Batra S.S., Song Y.S. DNA language models are powerful predictors of genome-wide variant effects. Proc. Natl. Acad. Sci. USA. 2023;120(44) doi: 10.1073/pnas.2311219120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wang N., Bian J., Li Y., et al. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning. Nat. Mach. Intell. 2024:1–10. [Google Scholar]
  • 8.Lin Z., Akin H., Rao R., et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–1130. doi: 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 9.Chen Z., Zhao P., Li C., et al. iLearnPlus: A comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021;49(10):e60. doi: 10.1093/nar/gkab122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li H.L., Pang Y.H., Liu B. BioSeq-BLM: A platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Res. 2021;49(22):e129. doi: 10.1093/nar/gkab829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Li H., Liu B. BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo. PLoS Comput. Biol. 2023;19(6) doi: 10.1371/journal.pcbi.1011214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wang R., Jiang Y., Jin J., et al. DeepBIO: An automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis. Nucleic Acids Res. 2023;51(7):3017–3029. doi: 10.1093/nar/gkad055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Logsdon G.A., Vollger M.R., Eichler E.E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 2020;21(10):597–614. doi: 10.1038/s41576-020-0236-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Popic V., Rohlicek C., Cunial F., et al. Cue: A deep-learning framework for structural variant discovery and genotyping. Nat. Methods. 2023;20(4):559–568. doi: 10.1038/s41592-023-01799-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zhou, C.Y. Park, C.L.Theesfeld J., et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 2019;51(6):973–980. doi: 10.1038/s41588-019-0420-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Chen L., Fan Z., Chang J., et al. Sequence-based drug design as a concept in computational drug design. Nat. Commun. 2023;14(1):4217. doi: 10.1038/s41467-023-39856-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Goles M., Daza A., Cabas-Mora G., et al. Peptide-based drug discovery through artificial intelligence: Towards an autonomous design of therapeutic peptides. Brief. Bioinformatics. 2024;25(4):bbae275. doi: 10.1093/bib/bbae275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Pan, X. Lin, D. Cao X., et al. Deep learning for drug repurposing: Methods, databases, and applications. Wiley Interdisciplin Rev. 2022;12(4):e1597. [Google Scholar]
  • 19.Zhang, K.M. Saravanan, Y. Wei H., et al. Deep learning-based bioactive therapeutic peptide generation and screening. J. Chem. Inf. Model. 2023;63(3):835–845. doi: 10.1021/acs.jcim.2c01485. [DOI] [PubMed] [Google Scholar]
  • 20.Schulte-Sasse, S. Budach, D. Hnisz R., et al. Integration of multiomics data with graph convolutional networks to identify new cancer genes and their associated molecular mechanisms. Nat. Mach. Intell. 2021;3(6):513–526. [Google Scholar]
  • 21.Zhang, H. Wei, W. Zhang W., et al. Multiple types of disease-associated RNAs identification for disease prognosis and therapy using heterogeneous graph learning. Sci. China Inf. Sci. 2024;67(8) [Google Scholar]
  • 22.Miao, F. Liu, T. Hou Y., et al. Virtifier: A deep learning-based identifier for viral sequences from metagenomes. Bioinformatics. 2022;38(5):1216–1222. doi: 10.1093/bioinformatics/btab845. [DOI] [PubMed] [Google Scholar]
  • 23.di Iulio J., Bartha I., Spreafico R., et al. Transfer transcriptomic signatures for infectious diseases. Proc. Natl. Acad. Sci. USA. 2021;118(22) doi: 10.1073/pnas.2022486118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hie, E.D. Zhong, B. Berger B., et al. Learning the language of viral evolution and escape. Science. 2021;371(6526):284–288. doi: 10.1126/science.abd7331. [DOI] [PubMed] [Google Scholar]
  • 25.Yuan, D. Edelmann, Z. Fan T., et al. Machine learning in the identification of prognostic DNA methylation biomarkers among patients with cancer: A systematic review of epigenome-wide studies. Artif. Intell. Med. 2023;143 doi: 10.1016/j.artmed.2023.102589. [DOI] [PubMed] [Google Scholar]
  • 26.Ao C., Yu L., Zou Q. Prediction of bio-sequence modifications and the associations with diseases. Brief. Funct. Genomics. 2021;20(1):1–18. doi: 10.1093/bfgp/elaa023. [DOI] [PubMed] [Google Scholar]
  • 27.Qi R., Zou Q. Trends and potential of machine learning and deep learning in drug study at single-cell level. Research (Washington D C) 2023;6:0050. doi: 10.34133/research.0050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Hao, J. Gong, X. Zeng M., et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods. 2024;21(8):1481–1491. doi: 10.1038/s41592-024-02305-7. [DOI] [PubMed] [Google Scholar]
  • 29.Cui, C. Wang, H. Maan H., et al. scGPT: Toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods. 2024;21(8):1470–1480. doi: 10.1038/s41592-024-02201-0. [DOI] [PubMed] [Google Scholar]
  • 30.Xia, J. McMichael, M. Becker-Hapak H., et al. Computational prediction of MHC anchor locations guides neoantigen identification and prioritization. Sci. Immunol. 2023;8(82):eabg2200. doi: 10.1126/sciimmunol.abg2200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wang, J. Patsenker, H. Li M., et al. Language model-based B cell receptor sequence embeddings can effectively encode receptor specificity. Nucleic Acids Res. 2024;52(2):548–557. doi: 10.1093/nar/gkad1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Prelaj, V. Miskovic, M. Zanitti A., et al. Artificial intelligence for predictive biomarker discovery in immuno-oncology: A systematic review. Ann. Oncol. 2024;35(1):29–65. doi: 10.1016/j.annonc.2023.10.125. [DOI] [PubMed] [Google Scholar]
  • 33.Scott M., Su-In L. A unified approach to interpreting model predictions. Adv. Neural Inf. Process Syst. 2017;30:4765–4774. [Google Scholar]
  • 34.Ying, D. Bourgeois, J. You Z., et al. Gnnexplainer: Generating explanations for graph neural networks. Adv. Neural Inf. Process Syst. 2019;32 [PMC free article] [PubMed] [Google Scholar]; https://proceedings.neurips.cc/paper_files/paper/2019/hash/d80b7040b773199015de6d3b4293c8ff-Abstract.html.

Articles from Fundamental Research are provided here courtesy of The Science Foundation of China Publication Department, The National Natural Science Foundation of China

RESOURCES