Dear Editor,
Artificial intelligence (AI) presents immense opportunities for solving the complex problems of different fields, including the biological sciences. Different complex tasks can be performed using the amalgamation of automation and AI, and it can even exceed human performance. That is why the use of AI is increasing steadily. On the other hand, when working with AI, transparency is essential during reporting. In this direction, a recent article published by Agha et al illustrated the transparency benchmarks for reporting AI and listed those benchmarks[1]. Similarly, Haibe-Kains et al discussed the reproducibility and transparency of AI[2]. However, both articles highlighted two vital topics: transparent and reproducible AI research. Therefore, both articles are significant and timely.
Large language models (LLMs), a subset of AI, have recently gained massive attention worldwide following the release of ChatGPT on 30 November 2022, in San Francisco, USA. This model is a free online version 3.5 of the GPT-3 (Generative Pretrained Transformer 3) series. The model has emerged as the fastest-growing tool globally. Within 2 months after its launch, the number of users reached about 100 million[3]. The model is immensely used in different medical science spheres[4,5]. At the same time, it is also utilized in other fields, including drug discovery[6,7], nucleic acid research[8], biomedical engineering[9], and law[10]. Due to the excessive use of LLMs, researchers anticipate that research in various fields, ranging from medical science to chemistry to law, will attain a new dimension. LLMs, such as GPT-4, specifically the series 4 model, can generate images and human-like text from any prompt[11]. LLMs can solve complex problems related to nucleic acids, and scientists have started exploring the use of LLMs in this area. CodonBERT, an LLM platform, can help to design and optimize mRNA. Recently, mRNA-based vaccines and therapeutics have emerged as highly effective molecules. Therefore, to design and optimize a stable mRNA, sequence optimization is necessary to produce low-cost, high-potency, and safe vaccines and therapeutics. CodonBERT helps develop an optimized mRNA sequence for mRNA-based vaccines and therapeutics[12]. Recently, another LLM platform, entitled LitSumm, was developed to help summarize the literature on ncRNAs (non-coding RNAs)[13]. Furthermore, Jorapur et al stated that LLMs can design primers for diagnostic PCR (polymerase chain reaction)[14]. LLMs are capable of answering questions and making accurate decisions. However, LLMs apply neural network architectures to function.
LLMs can take the text and image inputs. Subsequently, it provides the text outputs. It shows human-level performance in the academic arena and various other professional fields. GPT-4 can solve a broader range of problems compared to its predecessor. Therefore, scientists are excited to use the GPT-4 model. This LLM, specifically GPT-4, has created a next-generation platform for researchers known as the multimodal large language model (MLLM). OpenAI’s effort created a new milestone in the Chatbot landscape when it released GPT-4. Following its release, GPT-4 marks the beginning of a new era for AI-powered LLMs. For medical applications, Lee et al found that GPT-4 may be a powerful tool, providing valuable responses[15]. This powerful GPT-4 model generates marker gene-related material from the typical single-cell RNA-seq study[16]. It is an automated cell-type annotation method, which is cost-effective.
The traditional transformer model utilizes attention and self-attention mechanisms, enabling the model to capture information from preceding tokens. The GPT models use a transformer framework. However, these possess a remarkable generative pretraining method. Compared to traditional transformers, which rely on labeled datasets for training, GPTs utilize an enormous corpus of unlabeled data and possess task-specific fine-tuning features[17]. Different versions of the GPT models were introduced, and the third generation, i.e., GPT-3, utilizes a remarkably high number of tokens (0.3 trillion) and parameters (0.175 trillion) for training. Due to this magnitude of large numbers, fine-tuning is often not essential in various tasks. The advanced version, recently launched, is the fourth generation, specifically GPT-4. Brynjolfsson et al noted in their study that the GPT-4 utilizes around 13 trillion tokens and 1.8 trillion parameters for training[18]. A novel feature of GPT-4 compared to GPT-3 is that GPT-4 can generate images from input datasets. Other significant LLMs are Gemini 2.0, developed by Google DeepMind[19], and DeepSeek[20], developed by the Chinese-based company High-Flyer. These LLMs are now GenAI-based, significant analytical models in the domains of biological science and medical research, including nucleic acid analysis.
LLMs are now at the forefront of providing effective solutions in medical science, especially in education, clinical, and research-related solutions[4,21]. LLMs can provide solutions in different fields besides the medical domain. White has informed us that LLMs can provide solutions to different areas of chemistry. He described LLMs as potentially the future of chemistry, which will help initiate a new era of developments in the field[22]. Recently, Chatterjee et al demonstrated that LLMs aid in solving various nucleic acid research problems. They informed us that LLMs will revolutionize nucleic acid research[8]. LLMs are capable of transforming different areas of nucleic acid research. Genomic research has also been explored using LLMs. DNABERT, a nucleic acid domain-specific LLM, can aid in evaluating global and transferable genomic DNA sequences, considering the altered environments of upstream and downstream nucleotides. It has the potential to predict genome-wide regulatory elements and reveal their usage frequency[23]. Additionally, the DNABERT platform helps identify lncRNA from the assemblies of the plant genome. Using DNABERT, Danilevicz et al accurately predicted lncRNAs from genomic sequences, with the highest accuracy noted in Zea mays at 83.4%. Similarly, the lowest accuracy was noted in the Brassica rapa as 57.9%[24]. In another LLM study, Sultan et al analyzed the genes associated with cancer predisposition. They used a study involving 53 patients with pathogenic mutations (CPG mutations) and noted that Rb1 was the most commonly mutated gene in the cohort. Other predicted genes were VHL, NF1, and TP53. In this analysis, the predicted genes were 93% correct. However, they noted that LLM shows promise in predicting CPG mutations[25]. Recently, Elsborg and Salvatore analyzed biomarkers at the single-cell level to gain a deeper understanding of the disease landscape using LLMs and explainable machine learning (ML). They found the symbolic and straightforward nature of fetched gene signatures. It may enable researchers to elucidate the underlying molecular causes of diseases[26]. Similarly, a large-scale pre-trained deep language model tool has been developed to annotate the cell type of single-cell RNA-seq data. This model, named scBERT, can be fine-tuned to utilize supervised user-specific single-cell RNA-seq (scRNA-seq) data. The model validated its performance in terms of robustness to batch effects, novel cell type discovery, and cell type annotation, among others[27]. Yamada and Hamada developed a nucleotide language model using tokens similar to those employed in GPT models. Here, tokens are referred to as sequences of words or their particular nucleotide positions inside a sequence. Here, they also consider the k-mers of the sequence, which are the 3-mers and the 4-mer sequences. The following 3-mers were used: TAC, GTA, and CGT. Similarly, the 4-mers were used as GTAC, CGTA, and ACGT. Their nucleotide language model was used to predict RNA–protein interactions[10]. Zhang et al have developed an RNA language model that performs multiple sequence alignment, which is an unsupervised RNA-MSM (Multiple Sequence Alignment-Associated RNA Language Model). It can map direct base pairing information and also provide solvent accessibility knowledge[28]. Another significant problem in this research domain. Recently, Song et al developed GLMSite, which accurately identifies DNA- and RNA-binding sites using a geometric graph learning (GGL) model[29]. GGL is an ML-based method that uses graphs to analyze and represent data. On the other hand, protein language models (pLMs), a type of LLM, are among the significant deep learning (DL)-based models that utilize natural language processing (NLP) techniques to analyze and comprehend protein sequences. Similarly, using pLMs, Roche et al developed the EquiPNAS model for improved prediction of protein-nucleic acid binding sites. The model combines the strengths of symmetry-aware deep graph learning and pre-trained language models (PLMs)[30]. Therefore, nucleic acid research is progressing quickly to solve the diverse problems in this area. Recently, different DNA language models have been developed to comprehend the sequence context in the human genome. In this direction, a DNA language model, GROVER, has been developed to learn the context of the human genome sequence using next-k-mer prediction with Byte-Pair Encoding (BPE) and a tokenization-based model architecture. This model illustrates the DNA’s coding for proteins and transcripts[31].
Recently, scientists have utilized LLMs to develop nucleic acid language models (NALMs), which can be trained using nucleic acid sequences, such as RNA or DNA. The NALMs are capable of comprehending and predicting the behavior of these biological molecules. In these NALMs, sequences are described as text and employ tokenization to split them down into smaller units, known as tokens. NALM is also referred to as a genomic language model (gLM)[32].
Just like LLMs, NALMs aim to comprehend the meaning of sequences in terms of their structure and function. It will offer new possibilities for investigating problems in biological science and developing solutions to them, thereby bridging the gap between medical science and biological science[32].
Different scientists are attempting to understand various aspects of nucleic acids through LLMs, NALMs, or gLMs. Most importantly, we understand the various crucial aspects of nucleic acids, including evaluation metrics, species specificity, nucleotide sequence interpretation, and the complexities of LLMs. However, the proper tokenization process in converting to DNA/RNA sequences is essential.
The state-of-the-art LLMs, such as GPT-4, Google DeepMind, and DeepSeek, have emerged as leaders in AI innovation in nucleic acid research. Conversely, several LLM-derived models are assisting in solving the critical problem of nucleic acid research by handling a massive amount of data in this domain. Overall, LLMs can potentially revolutionize nucleic acid research by training on large datasets that incorporate texts, images, and videos. It indicates that it will open up a promising avenue for nucleic acid research. However, the potential replacement of humans and pressing ethical issues are considerable concerns. At the same time, progress in scientific development must continue. In the future, LLMs are expected to continue improving and will be introduced as next-generation, upgraded models. The future research language of nucleic acids may depend on MLLMs. Therefore, it is time for researchers to rethink nucleic acid-related research experiments, tools, and experimental design. They should utilize advanced next-generation technologies and tools, such as LLM or MLLM, to make rapid progress in nucleic acid research.
Footnotes
Sponsorships or competing interests that may be relevant to content are disclosed at the end of this article.
Published online 20 June 2025
Contributor Information
Chiranjib Chakraborty, Email: drchiranjib@yahoo.com.
Manojit Bhattacharya, Email: mbhattacharya09@gmail.com.
Arpita Das, Email: arpita-84das@yahoo.co.in.
Md. Aminul Islam, Email: aminulmbg@gmail.com.
Ethical approval
None.
Consent
None.
Sources of funding
None.
Author contributions
C.C.: conceptualization, data curation, investigation, writing – original draft, writing – review & editing; M.B., A.D., and A.I.: validation, formal analysis. All authors critically reviewed and approved the final version of the manuscript.
Conflicts of interest disclosure
The authors declare no competing interests.
Research registration unique identifying number (UIN)
None.
Guarantor
Aminul Islam.
Provenance and peer review
Not commissioned, externally peer-reviewed.
Data availability statement
The authors confirm that the data supporting the findings of this study are available within the article.
References
- [1].Agha RA, Mathew G, Rashid R, et al. Transparency in the reporting of artificial intelligence – the TITAN guideline. Prem J Sci 2025;10:100082. [Google Scholar]
- [2].Haibe-Kains B, Adam GA, Hosny A, et al. Transparency and reproducibility in artificial intelligence. Nature 2020;586:E14–E16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Menon D, Shilpa K. “Chatting with ChatGPT”: analyzing the factors influencing users’ intention to use the OpenAI’s ChatGPT using the UTAUT model. Heliyon 2023;9:e20962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Chatterjee S, Bhattacharya M, Pal S, Lee SS, Chakraborty C. ChatGPT and large language models in orthopedics: from education and surgery to research. J Exp Orthop 2023;10:128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Chakraborty C, Pal S, Bhattacharya M, Islam MA. ChatGPT or LLMs can provide treatment suggestions for critical patients with antibiotic-resistant infections: a next-generation revolution for medical science? Int J Surg [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Pal S, Bhattacharya M, Islam MA, Chakraborty C. ChatGPT or LLM in next-generation drug discovery and development: pharmaceutical and biotechnology companies can make use of the artificial intelligence (AI)-based device for a faster way of drug discovery and development. Int J Surg 2023;109:4382–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Chakraborty C, Bhattacharya M, Lee SS. Artificial intelligence enabled ChatGPT and large language models in drug target discovery, drug discovery, and development. Mol Ther Nucleic Acids 2023;33:866–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Chatterjee S, Bhattacharya M, Lee SS, Chakraborty C. Can artificial intelligence-strengthened ChatGPT or other large language models transform nucleic acid research? Mol Ther Nucleic Acids 2023;33:205–07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Pal S, Bhattacharya M, Lee -S-S, Chakraborty C. A domain-specific next-generation large language model (LLM) or ChatGPT is required for biomedical engineering and research. Ann Biomed Eng 2023;52:451–54. [DOI] [PubMed] [Google Scholar]
- [10].Yamada K, Hamada M, Arighi C. Prediction of RNA-protein interactions using a nucleotide language model. Bioinform Adv 2022;2:vbac023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Graham F. Daily briefing: what scientists think of GPT-4, the new AI chatbot. Nature 2023. doi: 10.1038/d41586-023-00839-y. [DOI] [PubMed] [Google Scholar]
- [12].Li S, Moayedpour S, Li R, et al. CodonBERT: large language models for mRNA design and optimization. bioRxiv 2023;2023:09.09.556981. [Google Scholar]
- [13].Green A, Ribas CE, Ontiveros-Palacios N, et al. LitSumm: large language models for literature summarization of noncoding RNAs. Database (Oxford). 2025;2025:baaf006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Jorapur S, Srivastava A, Kulkarni S. Evaluating the usefulness of a large language model as a wholesome tool for de novo polymerase chain reaction (PCR) primer design. Cureus 2023;15:e47711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Lee P, Bubeck S, Benefits PJ, Drazen JM, Kohane IS, Leong T-Y. Limits, and risks of GPT-4 as an AI chatbot for medicine. New Engl J Med 2023;388:1233–39. [DOI] [PubMed] [Google Scholar]
- [16].Hou W, Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat Methods 2024;21:1462–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Wang TT. GPT: Origin, Theory, Application, and Future. ASCS CIS498/EAS499 Project and Thesis. https://www.cis.upenn.edu/wp-content/uploads/2021/10/Tianzheng_Troy_Wang_CIS498EAS499_Submission.pdf University of Pennsylvania; Philadelphia, PA, USA; 2021. [Google Scholar]
- [18].Brynjolfsson E, Li D, Raymond LR. Generative AI at Work, No. W31161. Cambridge, MA: National Bureau of Economic Research; 2023. [Google Scholar]
- [19].Team G, Anil R, Borgeaud S, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:231211805. 2023. [Google Scholar]
- [20].Bi X, Chen D, Chen G, et al. Deepseek llm: scaling open-source language models with longtermism. arXiv preprint arXiv:240102954. 2024. [Google Scholar]
- [21].Chakraborty C, Pal S, Bhattacharya M, Dash S, Lee SS. Overview of chatbots with special emphasis on artificial intelligence-enabled ChatGPT in medical science. Front Artif Intell 2023;6:1237704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].White AD. The future of chemistry is language. Nat Rev Chem 2023;7:457–58. [DOI] [PubMed] [Google Scholar]
- [23].Ji Y, Zhou Z, Liu H, Davuluri RV, Kelso J. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinform 2021;37:2112–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Danilevicz MF, Gill M, Fernandez CGT, et al. DNABERT-based explainable lncRNA identification in plant genome assemblies. Comput Struct Biotechnol J 2023;21:5676–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Sultan I, Al-Abdallat H, Alnajjar Z, et al. Using ChatGPT to predict cancer predisposition genes: a promising tool for pediatric oncologists. Cureus 2023;15:e47594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Elsborg J, Salvatore M. Using LLMs and explainable ml to analyze biomarkers at single-cell level for improved understanding of diseases. Biomol 2023;13:1516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Yang F, Wang W, Wang F, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Mach Intell 2022;4:852–66. [Google Scholar]
- [28].Zhang Y, Lang M, Jiang J, et al. Multiple sequence-alignment-based RNA language model and its application to structural inference. bioRxiv 2023;2023:15.532863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Song Y, Yuan Q, Zhao H, Yang Y. Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures. Briefings Bioinf. 2023;24. Epub 2023/10/12. [DOI] [PubMed] [Google Scholar]
- [30].Roche R, Moussad B, Shuvo MH, Tarafder S, Bhattacharya D. EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res 2024;52:e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Sanabria M, Hirsch J, Joubert PM, Poetsch AR. DNA language model GROVER learns sequence context in the human genome. Nature Mach Intell 2024;6:911–23. [Google Scholar]
- [32].Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic language models: opportunities and challenges. Trends Genet 2025;41:286–302. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The authors confirm that the data supporting the findings of this study are available within the article.
