Abstract
The year 2023 marked a significant surge in the exploration of applying large language model (LLM) chatbots, notably ChatGPT, across various disciplines. We surveyed the applications of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.
Keywords: ChatGPT, Bioinformatics, Biomedical Informatics
1. INTRODUCTION
In recent years, artificial intelligence (AI) has attracted tremendous interest across various disciplines, emerging as an innovative approach to tackling scientific challenges [1]. The surge in data generated from both public and private sectors, combined with the rapid advancement in AI technologies, has facilitated the development of innovative AI-based solutions and accelerated scientific discoveries [1, [2, [3]. The launch of the Chat Generative Pre-trained Transformer (ChatGPT) to the public towards the end of 2022 marked a new era in AI. The biomedical research community embraces this new tool with immense enthusiasm. In 2023 alone, at least 2,074 manuscripts were indexed in PubMed when searching with the keyword “ChatGPT”. These studies demonstrate that ChatGPT and similar models have great potential to transform many aspects of education, biomedical research, and clinical practices [4, [5, [6, [7].
The core of ChatGPT is a large-language model (LLM) trained on a vast corpus of text and image materials from the internet, including biomedical literature and code [8]. Its ability to comprehend and respond in natural language positions ChatGPT as a valuable tool for biomedical text-based inquiry [9]. Particularly noteworthy is its potential in assisting bioinformatics analysis, enabling scientists to conduct data analyses via verbal instructions [10, [11, [12]. Surprisingly, a search on PubMed using the keywords “ChatGPT” and “bioinformatics” returned only 30 publications. While this number could have been underestimated by limiting the search to PubMed, and a few hundred related articles are probably archived as preprints or under review, it still suggests that the application of this innovative tool in bioinformatics is relatively underexplored compared to other areas of biomedical research.
In this review, we summarize recent advancements, predominantly within 2023, in the application of ChatGPT across a broad spectrum of bioinformatics and biomedical informatics topics, including omics, genetics, biomedical text mining, drug discovery, biomedical images, bioinformatics programming, and bioinformatics education (Figure 1). As the topics are relatively new, this survey included not only publications in journals but also preprints in various archive platforms. Our objective is to encapsulate recurring themes from independent works within the same topic or across multiple topics, pinpointing prospective avenues for further exploration. Additionally, this review allows us to identify challenges in integrating chatbots into bioinformatics, such as inefficiency in prompt generation, uncertainty in responses, and concerns over data privacy [13, [14, [15]. The insights from this analysis are anticipated to benefit other domains where the integration of chatbot technology is actively pursued.
Figure 1: Areas Explored in this Review for ChatGPT’s Use in Bioinformatics and Biomedical Informatics in its Year One.
2. LITERATURE SELECTION
We searched Google Scholar, PubMed, and various preprint servers (aRxiv, bioRxiv, medRxiv, chemRxiv, and Research Square) using keywords such as “ChatGPT” in combination with “Bioinformatics”, “Computational Biology”, “Genetics”, “Text mining”, or “Drug Discovery.” We then reviewed titles and abstracts to select papers that use ChatGPT in bioinformatics and biomedical informatics. We also utilized backward and forward citation tracking for each identified publication to expand the pool, resulting in 65 manuscripts. Lastly, we excluded poster manuscripts and manuscripts lacking in-depth analysis. In the end, we narrowed down to 62 research articles (Supplementary Table S1).
Most of the works reviewed (72.6%) are about performance assessment, while 19.3% lean towards direct applications of GPT (Supplementary Table S1). A significant proportion of the works (45 out of 62) were initially released as preprints, reflecting the emergent nature of this field. In addition to highlighting findings from individual works, we also identified findings supported across multiple independent studies. Out of the 22 preprints that were not formally published at the time of writing, 17 preprints contribute to this direction (Supplementary Table S1). This cross-validation among studies strengthens the reliability of observed trends and shared insights, especially for findings described in preprints.
3. OMICS
Omics techniques are extensively employed in biomedical research, generating vast amounts of data that necessitate careful analysis to uncover significant discoveries. A novel application of GPT-4 is to annotate cell types in single-cell RNA sequencing data [16], traditionally a labor-intensive and expertise-demanding task. Leveraging the wealth of online texts that offer detailed descriptions of signature genes for various cell types, GPT-4 can efficiently identify cell types based on a tissue name and a list of marker genes, as few as ten, identified from standard single-cell analysis pipelines such as Seurat (Figure 2). When evaluated across ten datasets encompassing hundreds of tissues and cell types, GPT-4 demonstrates strong concordance with manual annotations and surpasses several competing methods including CellMarker 2.0, ScType, and SingleR [16]. GPT-4 achieves this impressive performance with basic prompts and does not require biology expert, referencing data sets, or coding experience, thus making cell type annotation easily accessible to general biomedical researchers for scRNA-Seq data analysis. However, given the undisclosed nature of GPT’s training data and the potential for AI-generated errors, expert validation is recommended before leveraging its annotations in further research, especially for tissues and cell types that are not widely studied.
Figure 2: ChatGPT-Powered Cell Type Annotation for scRNA-Seq Data Analysis.
In this application, marker genes for each cell cluster are identified using standard pipelines such as Seurat. These markers, along with the corresponding tissue name, are then incorporated into a prompt template, slightly modified from the GPTCelltype tool [16]. The prompts are submitted to ChatGPT to predict the cell type for each cluster.
Evaluating GPT models in genomics necessitates benchmark datasets with established ground truths. GeneTuring [17] serves this role with 600 questions related to gene nomenclature, genomic locations, functional characterization, sequence alignment, etc. When tested on this dataset, GPT-3 excels in extracting gene names and identifying protein-coding genes, while ChatGPT (GPT-3.5) and New Bing show marked improvements. Nevertheless, all models face challenges with SNP and alignment questions [17]. This limitation is effectively addressed by GeneGPT [18], which utilizes Codex to consult the National Center for Biotechnology Information (NCBI) database.
4. GENETICS
In North America, 34% of genetic counselors incorporate ChatGPT into their practice, especially in administrative tasks [19]. This integration marks a significant shift towards leveraging AI for genetic counseling and underscores the importance of evaluating its reliability. Doung and Solomon [20] analyzed ChatGPT’s performance on multiple-choice questions in human genetics sourced from Twitter. The chatbot achieves a 70% accuracy rate, comparable to human respondents, and excels in tasks requiring memorization over critical thinking. Further analysis by Alkuraya, I. F. [21] revealed ChatGPT’s limitations in calculating recurrence risks for genetic diseases. A notable instance involving cystic fibrosis testing showcases the chatbot’s ability to derive correct equations but falter in computation, raising concerns over its potential to mislead even professionals [21]. This aspect of plausible responses is also identified as a significant risk by genetic counselors [19].
These observations have profound implications for the future education of geneticists. It indicates a shift from memorization tasks to a curriculum that emphasizes critical thinking in varied, patient-centered scenarios, scrutinizing AI-generated explanations rather than accepting them at face value [22]. Moreover, it stresses the importance of understanding AI tools’ operational mechanisms, limitations, and ethical considerations essential in genetics [20]. This shift prepares geneticists better for AI use, ensuring they remain informed on the benefits and risks of technology.
5. BIOMEDICAL TEXT MINING
For biomedical text mining with ChatGPT, we first summarized works that evaluate the performance of ChatGPT in various biomedical text mining tasks and compared it to state-of-the-art (SOTA) models. Then, we explored how ChatGPT has been used to reconstruct biological pathways and prompting strategies used to improve the performance.
5.1. PERFORMANCE ASSESSMENTS ACROSS TYPICAL TASKS
Biomedical text mining tasks typically include name entity recognition, relation extraction, sentence similarity, document classification, and question answering. Chen, Q., et al. [23] assessed ChatGPT-3.5 across 13 publicly available benchmarks. While its performance in question answering closely matched SOTA models like PubmedBERT [24], ChatGPT-3.5 showed limitations in other tasks, with similar observations made for ChatGPT-4 [7, [25, [26]. Extensions to sentence classification and reasoning revealed that ChatGPT was inferior to SOTA pretrained models like BioBERT [27]. These studies highlight the limitations of ChatGPT in some specific domains of biomedical text mining where domain-optimized language models excel. Nevertheless, when the training sets with task-specific annotations are not sufficient, zero-shot LLMs, including ChatGPT-3.5 outperform SOTA finetuned biomedical models [28]. A compilation of performance metrics for ChatGPT and other baseline models on various biomedical text mining tasks is listed in Supplementary Table S2.
Biomedical Knowledge Graphs (BKGs) have emerged as a novel paradigm for managing large-scale, heterogeneous biomedical knowledge from expert-curated sources. Hou, Y., et al. [29] evaluated ChatGPT’s capability on question and answering tasks using topics collected from the “Alternative Medicine” sub-category on “Yahoo! Answers” and compare to the Integrated Dietary Supplements Knowledge Base (iDISK) [30]. While ChatGPT-3.5 showed comparable performance to iDISK, ChatGPT-4 was superior to both. However, when tasked to predict drug or dietary supplement repositioned for Alzheimer’s Disease, ChatGPT primarily responded with candidates already in clinical trials or existing literature. Moreover, ChatGPT’s efforts to establish associations between Alzheimer’s Disease and hypothetical substances were less than impressive. This highlights ChatGPT’s limitations in performing novel discoveries or establishing new entity relationships within BKGs.
ChatGPT’s underperformance in some specific text mining tasks against SOTA models or BKGs identifies areas for enhancement; On the other hand, finetuning LLMs, although beneficial, remains out of reach for most users due to the high computational demand. Therefore, techniques like prompt engineering, including one/few-shot in-context learning and Chain-of-Though (CoT; See Table 1 for terminologies cited in this review), can be more practical to improve LLM efficiency in text mining tasks [23, [25, [27, [31]. For instance, incorporating examples with CoT reasoning enhances the performances of ChatGPT over both zero-shot (no example) and plain examples in sentence classification and reasoning tasks [27] as well as knowledge graph reconstruction from literature titles [32]. However, simply increasing the number of examples does not always correlate with better performance [25, [27]. This underscores another challenge in optimizing LLMs for specialized text mining tasks, necessitating more efficient prompting strategies to ensure consistent reliability and stability.
Table 1.
Terminologies cited in this review.
| Term | Definition |
|---|---|
| Prompt engineering | The practice of designing and refining input prompts (natural language instruction) to elicit desired responses from a language model chatbot. |
| Zero-shot | A way of prompting where instruction to the chatbot contains no example of a specified task. |
| One-shot | A way of prompting where instruction to the chatbot contains one example of a specified task. |
| Few-shot | A way of prompting where instruction to the chatbot contains more than one examples of that task. |
| Chain of Thought (CoT) | A way of prompting asking the chatbot to think step by step. This approach helps in enhancing the model’s ability to solve complex problems by breaking them down into simpler, sequential steps. For one/few-shot, if an example includes details of step-by-step reasoning, the example is called CoT example. |
| Tree of Thought (ToT) | An extension of the Chain of Thought approach, where the model generates a tree-like structure of reasoning steps instead of a linear chain. |
| In-Context Learning (ICL) | A learning paradigm where a model leverages the context provided within the input to adapt and respond to new tasks or information without explicit retraining. |
| Retrieval-Augmented Generation (RAG) | A technique that combines a retriever model, which fetches relevant documents or data, with a generator model, which uses the retrieved information to generate responses or complete tasks. This approach is useful for tasks that require external knowledge or context. |
| Fine-tuning | The process of further training a pre-trained model on a specific dataset or task to improve its performance in that area. |
| Instruction tuning | The process of fine-tuning a pre-trained model to better understand and follow natural language instructions, improving its applicability across different tasks. |
| Task tuning | The process of fine-tuning a pre-trained model on a specific task to enhance its performance on that task. |
| AI hallucination | The phenomenon where a generative AI model produces false or misleading information not supported by the input data or its training. |
5.2. BIOLOGICAL PATHWAY MINING
Another emerging application of biomedical text mining from LLMs is to build biological pathways. Azam, M., et al. [33] conducted a broader assessment of mining gene interactions and biological pathways across 21 LLMs, including seven Application Programming Interface (API)-based and 14 open-source models. ChatGPT-4 and Claude-Pro emerged as leaders, though they only achieved F1 scores less than 50% for gene relation predictions and a Jaccard index less than 0.3 for pathway predictions. Another evaluation work on retrieving protein-protein interaction (PPI) from sentences reported a modest F1 score for both GPT-3.5 and GPT-4 with base prompts [34]. All the studies underscore the inherent challenges generic LLMs face in delineating gene relationships and constructing complex biological pathways from biomedical text without prior knowledge or specific training.
The capabilities of ChatGPT in knowledge extraction and summarization present promising avenues for pathway database curation support. Tiwari, K., et al. [35] explored its utility in the Reactome curation process, notably in identifying potential proteins for established pathways and generating comprehensive summaries. For the case study on the circadian clock pathway, ChatGPT proposed 13 new proteins, five of which were supported by the literature but overlooked in traditional manual curation. When summarizing pathway from multiple literature extracts, ChatGPT struggled to resolve contradictions, but gained improved performance when inputs contained in-text citations. Similarly, the use of ChatGPT for annotating long non-coding RNAs in the EVLncRNAs 3.0 database [36] faces issues with inaccurate citations. Both works emphasize cautions on direct use of ChatGPT in assisting in database curation.
Supplementing ChatGPT with domain knowledge or literature has been shown to mitigate some of its intrinsic limitations. The inclusion of a protein dictionary in prompts improves performance for GPT-3.5 and GPT-4 in PPI task [34]. Chen, X., et al. [37] augmented ChatGPT with literature abstracts to identify genes involved in arthrofibrosis pathogenesis. Similarly, Fo, K., et al. [38] supplied GPT-3.5 with plant biology abstracts to uncover over 400,000 functional relationships among genes and metabolites. This domain knowledge/literature-backed approach enhances the reliability of chatbots in text generation by reducing AI hallucination [39, [40].
Addressing LLMs’ intrinsic limitations can also involve sophisticated prompt engineering. Chen, Y., et al. [41] introduced an iterative prompt optimization procedure to boost ChatGPT’s accuracy in predicting genegene interactions, utilizing KEGG pathway database as a benchmark. Initial tests without prompt enhancements showed a performance decline along with ChatGPT’s upgrades from March to July in 2023, but the strategic role and few-shot prompts significantly countered this trend. The iterative optimization process, which employed the tree-of-thought methodology [42], achieved notable improvements in precision and F1 scores [41]. These experiments demonstrate the value of strategic prompt engineering in aligning LLM outputs with complex biological knowledge for better performance.
6. DRUG DISCOVERY
Drug discovery is a complex and failure-prone process that demands significant time, effort, and financial investment. The emerging interest in ChatGPT’s potential to facilitate drug discovery has captivated the pharmaceutical community [43, [44, [45, [46]. Recent studies have showcased the chatbot’s proficiency in addressing tasks related to drug discovery; a compilation of performance metrics for ChatGPT and other baseline models is listed in Supplementary Table S3. GPT-3.5, for example, has been noted for its respectable accuracy in identifying associations between drugs and diseases [47]. Furthermore, GPT models exhibit strong performance in tasks related to textual chemistry, such as generating molecular captions, but face challenges in tasks that require accurate interpretation of the Simplified Molecular-Input Line-Entry System (SMILES) strings [48]. Research by Juhi, A., et al. [49] highlighted ChatGPT’s partial success in predicting and elucidating drug-drug interactions (DDIs). When benchmarked against two clinical tools, GPT models achieved an accuracy rate of 50–60% in DDI prediction and improved furhter by 20–30% with internet search through BING; a comparison to SOTA methods was not conducted [50]. When evaluated using the DDI corpus [51], ChatGPT achieved an micro F1 score of 52%, lower than SOTA BERT-based models [23]. In more rigorous assessments, ChatGPT was unable to pass various pharmacist licensing examinations [52, [53, [54]. It also shows limitations in patient education and in recognizing adverse drug reactions [55]. These findings suggest that, although ChatGPT offers valuable support in drug discovery, its capacity to tackle complex challenges is ineffective and necessitates close human oversight.
In the following few sections, we will review three important aspects of using LLM-chatbots such as ChatGPT in drug discovery (Figure 3). We first focused examples and tools that facilitate a human-in-the-loop approach for reliable use of ChatGPT in drug discovery. Then we highlighted the advances brought by strategic prompting using in-context learning with examples to increase response accuracy of ChatGPT. Lastly, we summarize the progress of using task- and or instruction finetune to adapt a foundational model to specific tasks, though demonstrated mostly by open-source models but could be extended to GPT-3.5 and GPT-4.
Figure 3: Key Themes from the Application of GPTs and Other LLMs in Drug Discovery Tasks.
The human-in-the-loop section highlights a case study and three interactive tools that facilitate communication between users and chatbots. The in-context learning section emphasizes the use of ad-hoc examples or examples sourced by retrieval-augmented generation to guide chatbots for better performance. The fine-tuning section demonstrates examples on task and/or instruction tuning, primarily with open large language models. Works focusing on the use of GPTs are highlighted in red.
6.1. HUMAN-IN-THE-LOOP
The application of AI in drug development necessitates substantial expertise from human specialists for result refinement. This collaborative approach is illustrated in a case study focusing on the development of anti-cocaine addiction drugs aided by ChatGPT [56]. Throughout this process, GPT-4 assumes three critical roles in sparking new ideas, clarifying methodologies, and providing coding assistance. To enhance its performance, the chatbot is equipped with various plugins at each phase to ensure deeper understanding of context, access to the latest information, improved coding capabilities, and more precise prompt generation. The responses generated by the chatbot are critically evaluated with existing literature and expert domain knowledge. Feedback derived from this evaluation is then provided to the chatbot for further improvement. This iterative, human-in-the-loop methodology led to the identification of 15 promising multi-target leads for anti-cocaine addiction [56]. This example underscores the synergistic potential of human expertise and AI in advancing drug discovery efforts.
Several tools leveraging LLMs offer interactive interfaces to enhance molecule description and optimization. ChatDrug [57] is a framework that can use GPT API or other open source LLMs to streamline the process of editing small molecules, piptides, or proteins (Figure 4). It features a prompt design module equipped with a collection of template prompts customized for different types of editing tasks. The core of ChatDrug is a retrieval and domain feedback module to ensure that the response is grounded in real-world examples and safeguarded through expert scrutiny: The retrievel sub-module selects examples from external databases, while the domain feedback sub-module integrates feedback from domain experts through iteration. Additionally, ChatDrug includes a conversational module dedicated to further interactive refinement. Similar tools though based on other LLMs have been developped. DrugChat based on Vicuna-13b [58] offers interactive question-and-answer and textual explanations starting from drug graph representations. DrugAssist [59] based on Llama2–7B utilizes external database retrieval for hints and allowing iterative refinement with expert feedback. This process of iterative refinement, supported by example retrieval from external databases as contextual hints, also known as retrieval-augmented generation (RAG), and expert feedback enhances the model’s accuracy and relevance to practical applications.
Figure 4: Illustration of ChatDrug for Conversational Drug Editing with GPT.
In ChatDrug [57], initial prompts are derived from a Prompt Design for Domain-Specific (PDDS) module, which provides tailored templates for specific drug editing tasks. If the response from the chatbot (using GPT-4 as an example) is unsatisfactory, a Retrieval and Domain Feedback (ReDF) module leverages domain knowledge to refine the prompts. Sample prompts, shown in red boxes, are extracted from Liu, S., et al. [57] for a small molecule editing task. In this case, the initial prompts did not yield satisfactory responses (first try), prompting updates from the ReDF module, which subsequently led to satisfactory outcomes (second try).
6.2. IN-CONTEXT LEARNING
In-context learning (ICL) enhances chatbots’ responses by leveraging examples from a domain knowledgebase through prompting without finetuning a foundation model [60]. This approach utilizes examples closely aligned with the subject matter to ground the responses of ChatGPT with relevant domain knowledge [57, [61]. Evaluating GPTs’ capabilities across various chemistry-related tasks has shown that including contextually similar examples results in superior outcomes compared to approaches that use no example or employ random sampling; The performance of these models improves progressively with the inclusion of additional examples [48, [61, [62]. ICL also boosts the accuracy in more complex regression tasks, rendering GPT-4 competitively effective compared to dedicated machine learning models [63, [64]. Lastly, instead of using specific examples, enriching the context with related information—such as disease backgrounds and synonyms in a fact check task on drug-disease associations [47] —also augments response accuracy. These examples, with in-context learning and context enrichment, underscore the critical role of domain-knowledge in improving the quality and reliability of GPTs’ responses in drug discovery tasks.
6.3. INSTRUCTION FINETUNING
Task-tuning language models for specific tasks within drug discovery has shown considerable promise, as evidenced by two recent projects. ChatMol [65] is a chatbot based on the T5 model [66], finetuned with experimental property data and molecular spatial knowledge to improve its capabilities in describing and editing target molecules. Task-tuning GPT-3 has demonstrated notable advantages over traditional machine learning approaches, particularly in tasks where training data is small [62]. Task-tuning also significantly improves GPT-3 in extracting DDI triplets, showcasing a substantial F1 score enhancement over GPT-4 with few-shots [67]. These projects demonstrate that task-tuning of foundation models can effectively capture the complex knowledge at the molecule level relevant to drug discovery.
Instruction tuning diverges from task tuning by training an LLM across a spectrum of tasks using instruction-output pairs and enables the model to address new, unseen tasks [68]. DrugAssist [59], a Llama-2–7B-based model, after instruction-tuned with data with individual molecule properties, achieved competitive results when simultaneously optimizing multiple properties. Similarly, DrugChat [58], a Vicuna-13b-based model instruction-tuned with examples from databases like ChEMBL and PubChem, effectively answered open-ended questions about graph-represented drug compounds. Mol-Instructions [69], a large-scale instruction dataset tailored for the biomolecular domain, demonstrated its effectiveness in finetuning models like Llama-7B on a variety of tasks, including molecular property prediction and biomedical text mining.
Task-tuning may be combined with instruction tuning to synergize the strength of each. ChemDFM [70], pre-trained on LLaMa-13B with a chemically rich corpus and further enhanced through instruction tuning, exceled in a range of chemical tasks, particularly in molecular property prediction and reaction prediction, outperforming models like GPT-4 with in-context learning. InstructMol [71] is a multi-modality instruction-tuning-based LLM, featured by a two-stage tuning process, first by instruction tuning with molecule graph-text caption pairs to integrate molecule knowledge and then by task-specific tuning for three drug discovery-related molecular tasks. Applied to Vicuna-7B, InstructMol surpassed other leading open-source LLMs and narrows the performance gap with specialized models [71]. These developments underscore the effectiveness of both task and instruction tuning as strategies for enhancing generalized foundation models with domain-specific knowledge to address specific challenges in drug discovery.
It is important to note that the significant improvements observed through task-tuning and/or instruction-tuning primarily involve open-sourced large language models. These techniques have shown great promise in enhancing model performance in various drug discovery tasks. We noticed that fine-tuning of GPT-3.5 is still in its infancy but encouraging preliminary results have been recently documented in chemical text mining [72]. Unlike its predecessors, GPT-4’s fine-tuning capabilities are currently under exploration in an experimental program by OpenAI. As these options become more broadly available, they are expected to significantly advance the field of drug discovery through task/instruction fine-tuning.
7. BIOMEDICAL IMAGE UNDERSTANDING
In recent advancements, multimodal AI models have garnered significant attention in biomedical research [73]. Released in late September 2023, GPT-4V(ision) has been the subject of numerous studies that explored its application in image-related tasks across various biomedical topics [74, [75, [76, [77, [78, [79, [80]. For biomedical images, GPT-4V exhibits a performance rivaling professionals in Medical Visual Question Answering [78, [79] and rivals traditional image models in biomedical image classification [81]. For scientific figures, GPT-4V can proficiently explain various plot types and apply domain knowledge to enrich interpretations [82].
Despite the impressive performance, current evaluations reveal significant limitations. OpenAI acknowledges the limitation of GPT-4V in differentiating closely located text and making factual errors in an authoritative tone [83]. The model is not competent in perceiving visual patterns’ colors, quantities, and spatial relationships in bioinformatics scientific figures [82]. Image interpretation with domain knowledge from GPT-4V may risk “confirmation bias” [84]: either the observation or conclusion is incorrect, but the supporting knowledge is valid by itself in other irrelevant context [82], or the observation or conclusion is correct, but the supporting knowledge is invalid/irrelevant [85]. Such biases are particularly concerning as users without requisite expertise might be easily misled by these plausible responses.
Prompt engineering has been instrumental in enhancing AI responses to text inputs. The emergence of GPT-4V emphasizes the need to develop equivalent methodologies for visual inputs to refine chatbots’ comprehension across modalities. The field of computer vision has already witnessed some progress in this direction [86]. Yang, Z., et al. [87] proposes visual referring prompting (VRP) by setting visual pointer references through directly editing input images to augment textual prompts with visual cues. VRP has proven effective in preliminary case studies, leading to the creation of a benchmark like VRPTEST [88] to evaluate its efficacy. Yet, a thorough, quantitative assessment of VRP’s impact on GPT-4V’s understanding of biomedical images remains to be explored.
8. BIOINFORMATICS PROGRAMMING
ChatGPT enables scientists who may not possess advanced programming skills to perform bioinformatics analysis. Users can articulate data characteristics, analysis details, and objectives in natural language, prompting ChatGPT to respond with executable code. In this context, we define “prompt bioinformatics”: the use of natural language instructions (prompts) to guide chatbots for reliable and reproducible bioinformatics data analysis through code generation [13]. This concept differs from the development of bioinformatics chatbot before the GPT era, such as DrBioRight [89] and RiboChat [90]. In prompt bioinformatics, the code is generated on the fly by the chatbot in response to a data analysis description. In addition, the generated code inherently varies across different chat sessions even for the same instruction, adding challenges to new method developments for result reproducibility. Lastly, the concept covers a broad range of bioinformatics topics, particularly those in applied bioinformatics, where data analysis methods are relatively mature.
Early case studies showcase ChatGPT’s versatility in addressing diverse bioinformatics coding tasks, from aligning sequencing reads to constructing evolutionary trees [10], and excelling in introductory course exercises [11]. ChatGPT excels at writing short scripts that call existing functions with specific instructions. However, it shows limitations in writing longer, workable code for more complex data analysis with errors often requiring domain-specific knowledge to spot for correction [91].
8.1. APPLICATION IN APPLIED BIOINFORMATICS
In applied bioinformatics, established methods for data analysis are prevalent used, enhancing the likelihood of their incorporation into LLM training datasets. Thus, applied bioinformatics emerges as a fertile ground for practicing prompt bioinformatics and evaluating its effectiveness. AutoBA [12], a Python package powered by LLMs, streamlined applied bioinformatics for multi-omics data analysis by autonomously designing analysis plans, generating code, managing package installations, and executing the code. Through testing across 40 varied sequencing-based analysis scenarios, AutoBA with GPT-4 attained a 65% success rate in end-to-end automation [12]. Error message feedback for code correction significantly enhanced this success rate. In addition, AutoBA utilizes retrieval-augmented generation to increase robustness of code generation [12].
Mergen [92] is an R package that automates data analysis through LLM utilization. It crafts, executes, and refines code based on user-provided textual descriptions. The inclusion of file headers in prompts and error message feedback notably improves coding efficacy. The evaluation tasks for Mergen, while relevant to bioinformatics, cater to a general-purpose scope, covering machine learning, statistics, visualization, and data wrangling. Interestingly, the adoption of role-playing does not yield significant enhancements [92], possibly due to the general nature of the tasks and the mismatch between the assumed bioinformatician role and the task requirements.
LLMs exhibit inherent limitations in coding with tools beyond their training datasets. Bioinformaticians typically consult user manuals and source code to master new tools, a process LLMs could emulate. The BioMANIA framework [93] exemplifies this approach by creating conversational chatbots for open-source, well-documented Python tools. By understanding APIs from source code and user manuals, it employs GPT-4 to generate instructions for API usage. These instructions inform a BERT-based model to suggest top appropriate APIs based on a user’s query, with GPT-4 predicting parameters and executing API calls. Evaluation of the method identifies areas for improvement, such as tutorial documentation and API design, guiding the future development of chatbot-compatible tools [93].
8.2. BIOMEDICAL DATABASE ACCESS
Structured Query Language (SQL) serves as a pivotal tool for navigating bioinformatics databases. Mastering SQL requires users to have both programming skills and a deep understanding of the database’s data schema—prerequisites that many biomedical scientists find challenging. Recent advancements have seen LLM-chatbots like ChatGPT stepping in to translate natural language questions into SQL queries [94], significantly easing database access for non-programmers.
The work by Sima, A.-C. and de Farias, T. M. [95] explored ChatGPT-4’s ability to explain and generate SPARQL queries for public biological and bioinformatics databases. Faced with explaining a complex SPARQL query that identifies human genes linked to cancer and their orthologs in rat brains—requiring to combine data from Uniprot, OMA, and Bgee databases—ChatGPT adeptly breaked down the query’s elements. However, its attempt to craft a SPARQL query from a natural language description for the same database search revealed inaccuracies that require specific human feedback for correction. Notably, prompts augmented with sematic clues such as variable names and inline comments indicate a substantial improvement in the performance on translating questions into corresponding SPARQL queries, when evaluated on a fine-tuned OpenLlama LLM [96].
Another work by Chen, C. and Stadler, T. [97] applied GPT-3.5 and GPT-4 to convert user inputs into SQL queries for accessing a database of SARS-CoV-2 genomes and their annotations. Through systematic prompting and learning from numerous examples, the chatbot shows proficiency in understanding the database structure and generates accurate queries for 90.6% and 75.2% of the requests with GPT-4 and GPT-3.5, respectively. In addition, the chatbot initiates a new session to explain each query for the users to cross-ref with their own inputs to minimize risks of misunderstandings.
8.3. ONLINE TOOLS FOR CODING WITH CHATGPT
Shortly after the release of ChatGPT in November 2022, RTutor.AI emerged as a pioneering web-server powered by the GPT technology dedicated to data analysis. This R-based platform simplifies the process for users to upload a single tabular dataset and articulate their data analysis requirements in natural language. RTutor.AI proficiently manages data importing and type conversion, subsequently leveraging OpenAI’s API for R code generation. It executes the generated code and produces downloadable HTML reports including figure plots. A subsequent application, Chatlize.AI, developed by the same team, adopts the tree-of-thought methodology [42] to enhance data analysis exploration. This approach, extending to Python, enables the generation of multiple code versions for a given analysis task, their execution, and comprehensive documentation of the results. Users benefit from the flexibility to select a specific code for further analysis. This feature is particularly valuable for exploratory data analysis, making Chatlize.AI a flexible solution for practicing prompt bioinformatics.
The Code Interpreter, officially integrated into ChatGPT-4 during the summer of 2023 and became a default option in GPT-4o in May 2024, represents a significant advancement in streamlining computational tasks. This feature facilitates a wide array of operations, including data upload, specification of analysis requirements, generation and execution of Python code, visualization of results, and data download, all through natural language instructions. It stands out for its ability to dynamically adapt code in response to runtime errors and self-assess the outcomes of code execution. Despite its broad applicability for general-purpose tasks such as data manipulation and visualization, its utility in bioinformatics data analysis encounters limitations such as the absence of bioinformatics-specific packages and the inability to access external databases [98].
8.4. BENCHMARKS FOR BIOINFORMATICS CODING
A thorough assessment of bioinformatics necessitates the establishment of comprehensive benchmarks to cover a broad range of topics in the field. Writing individual functions is a fundamental skill in the development of advanced bioinformatics algorithms. BIOCODER [99] is a benchmark to evaluate language models’ proficiency in function writing. This benchmark encompasses over 2,200 Python and Java functions derived from authentic bioinformatics codebases, in addition to 253 functions sourced from the Rosalind project. Comparative analyses have shown that GPT-3.5 and GPT-4 significantly outperform smaller, coding-specific language models on this benchmark. Interestingly, integrating topic-specific context, such as imported objects, into the baseline task descriptions markedly enhances accuracy. However, even the most adept models, namely the GPT series, reach an accuracy ceiling at 60% for GPT-4. A significant proportion of the failures are attributed to syntax or runtime errors [99], suggesting that ChatGPT’s effectiveness in bioinformatics coding can be further enhanced through human feedback on error messages.
Execution success is crucial, yet it represents only one facet of evaluating bioinformatics code quality. Sarwal, V., et al. [100] proposed a comprehensive evaluation framework that encompassed seven metrics, assessing both subjective and objective dimensions of code writing. These dimensions include readability, correctness, efficiency, simplicity, error handling, code examples, and clarity of input/output specifications. Each metric is scaled from 1 to 10 and normalized independently post-evaluation across models. When applied to a variety of common bioinformatics tasks, this framework highlighted GPT-4’s superior performance over alternatives such as BARD and LLaMA. However, the current evaluation remains narrowly focused on a limited number of tasks [100]. Expanding these evaluations to encompass a broader range of bioinformatics domains asks for community-led efforts for a comprehensive appraisal of these language models.
9. CHATBOTS IN BIOINFORMATICS EDUCATION
The potential of integrating LLMs into bioinformatics education has attracted significant discussions. ChatGPT-3.5 achieves impressive performance in addressing Python programming exercises in an entry-level bioinformatics course [11]. Beyond mere code generation, the utility of chatbots extends to proposing analysis plans, enhancing code readability, elucidating error messages, and facilitating language translation in coding tasks [101]. The effectiveness of a chatbot’s response depends on the precision of human instructions, or prompts. In this context, Shue et al. [10] introduced the OPTIMAL model, a framework for prompt refinement through iterative interactions with a chatbot, mirroring the learning curve of bioinformatics beginners assisted by such technologies. To navigate this evolving educational landscape, it becomes imperative to establish guidelines that enable students to critically assess outcomes and articulate constructive feedback to the chatbot for code improvement. Error messages, as one form of such feedback, turn out to be an effective way to boost the coding efficiency of ChatGPT across various studies [10, [12, [92].
The convenience of using chatbots for coding exercises poses a risk of fostering AI overreliance, which will lead to a superficial understanding of the underlying concepts [11, [13, [102]. This AI reliance could undermine students’ performance in summative assessments [11]. Innovative evaluation strategies, such as generating multiple-choice questions from student-submitted code to gauge their understanding [103], are needed to counteract this challenge. Such methodologies should aim to deepen students’ grasp of the mate rial, ensuring their in-depth understanding of coding concepts.
The art of crafting effective prompts emerges as a critical skill that complements traditional programming competencies. General guidelines are well summarized in a recent commentary [104]. In the context of bioinformatics tasks, these include breaking down a complex task into sub-tasks, enriching context with details (e.g., spelling out package names in code-generation tasks and tissue names for cell type annotation in scRNA-Seq analysis), illustrating intent through examples (e.g., supplying a volcano plot for data visualization task in differentially expressed gene analysis), specifying the output format to facilitate downstream data process while mining gene relationships from literature abstracts, etc. It is important to note that effective prompting is not formulaic. Like coding in bioinformatics and experimental skills for bench works, experience is gained through repetitive experiments [104]. Intriguingly, feedback from a pilot study involving graduate students interacting with ChatGPT for coding highlights the challenges in generating impactful prompts [105]. This prompt-related psychological strain may discourage students from using the chatbot [13]. In this context, the development of a repository featuring carefully crafted prompts for specific bioinformatics analyses—accompanied by quality metrics, reference code, and outcomes—could serve as a valuable resource for students to learn bioinformatics and biomedical informatics aided through prompting with chatbots [10, [13].
In conclusion, while chatbots demonstrate potential as educational tools, their efficacy and effectiveness have not yet been systematically evaluated in classroom settings with controlled experiments. The use of chatbots should be viewed as supplementary to traditional education methodologies [10, [11, [13]. Meanwhile, new assessment methodologies are needed to measure the pedagogical value of chatbots in enhancing bioinformatics learning without diminishing the depth of understanding of concepts and analytical skills.
10. DISCUSSION AND FUTURE PERSPECTIVES
The year 2023 marked significant progress in leveraging ChatGPT for bioinformatics and biomedical informatics. Early studies affirming its capability in drafting workable code for basic bioinformatics data analysis [10, [11]. The chatbot has also demonstrated competitiveness with SOTA models in other bioinformatics areas, including identifying cell type from single-cell RNA-Seq data [106], performing questionanswering tasks in biomedical text mining [107], and generating molecular captions in drug discovery [48]. These achievements underscore ChatGPT’s proficiency in text-generative tasks. Meanwhile, other LLMs are catching up. For example, Google developed Gemini and open-source LLM Gemma, which delivered impressive performance in various tasks. Although their applications in bioinformatics and medical informatics have not been reported, their potentials provide users a viable alternative to ChatGPT.
Current chatbots exhibit limitation in performing biomedical tasks that require reasoning and quantitative analysis, such as regression and classification, as evidenced by references [27, [29, [63, [64, [100]. Though not yet widely adapted in bioinformatics [72], OpenAI’s fine-tuning APIs such as for GPT-3.5 and GPT-4 hold great potential for performance improvements when the training dataset is large. Nevertheless, the accuracy of ChatGPT’s responses can be significantly improved through a strategic design of its input instructions with prompt engineering. Incorporating examples into prompts and employing CoT reasoning has proven an effective strategy, as evidenced in various bioinformatics applications [32, [41, [57, [63, [64, [97]. While examples in prompts are sometimes hardcoded, they can also be dynamically and strategically sourced from external knowledge bases or knowledge graphs [57, [59, [61, [108]. This approach, known as retrieval-augmented generation, improves ChatGPT’s reliability by sourcing facts from domain-specific knowledge and represents a promising avenue for future development in bioinformatics with chatbots.
Another significant limitation of ChatGPT, like all other LLMs, is hallucination [39, [40]. This occurs when ChatGPT fabricates non-factual content. Instances in bioinformatics applications include inventing functions that do not exist in coding [10], generating false positives when mining gene relationships from biomedical text [41], and fabricating molecular function for gene annotation [36]. While hallucination in code-generation related tasks may be detected through code-execution and partially corrected through error-message feedback, other types require expert knowledge, posing significant risks to general users. To reduce hallucination, one can condition the chatbot with relevant context, such as through RAG, or supplement it with external tools such as task-specific APIs [18]. Despite these strategies, developing evaluation and remediation techniques for detecting hallucinations in LLMs such as ChatGPT —with the accuracy of human experts and the efficiency of computational programs —is urgently needed and remains an ongoing challenge for bioinformatics applications with chatbots.
In this rapidly evolving domain, ChatGPT has experienced several significant upgrades within its first year alone. We acknowledge that not every upgrade enhances performance across the board [109]. Consequently, prompts that are highly effective with the current version for specific tasks may not maintain the same level of efficacy following future updates. The technique of prompt engineering, which includes strategies like role prompting and in-context learning, offers a way to partially counteract this variability [41]. An innovative approach, rather than manually adjusting the prompts, involves instructing ChatGPT to autonomously optimize prompts to align with its latest model iteration. This strategy has shown promise in tasks such as mining gene relationships [41] but remains largely unexplored in other bioinformatics topics and therefore warrants further exploration to fully leverage ChatGPT’s capabilities in the field.
Numerous studies repeatedly show that using ChatGPT with human augmentations significantly improve the performance. Iterative human-AI communication plays a pivotal role in this process, where feedback from human operator grounds the chatbot’s responses for improved accuracy. This human-in-the-loop methodology is particularly evident in prompt optimization [10] and molecular optimization [56, [59]. For code generation tasks, runtime error message represents commonly used feedback that has been automated into several GPT-based tools [12, [92, [98]. Conversely, the chatbot can also be instructed to provide feedback to human operators. As demonstrated by Chen, C. and Stadler, T. [97], ChatGPT can produce textual descriptions for the generated code through an inverse generation process. Comparing these descriptions with the original instructions from the human operator ensures that the chatbot’s output aligns closely with the intended task requirements. This iterative exchange of feedback between AI and human operators enhances the overall quality of the bioinformatics tasks being addressed.
The assessment of ChatGPT’s capabilities across various bioinformatics tasks has illuminated both its strengths and weaknesses. Importantly, the reliability of these evaluations largely hinges on the quality of the benchmarks used and the methodologies applied in these assessments. Currently, many benchmarks are available for biomedical text mining and chemistry-related tasks. The development of benchmarks designed specifically for assessing ChatGPT’s capability in other bioinformatics tasks, including multimodality, is still in its infancy. It’s important to recognize that in generative tasks like coding, producing expected results is not the sole criterion for gauging effectiveness and efficiency. Factors such as the readability of the code and the inclusion of code examples also play crucial roles [100]. Similarly, on prediction or classification tasks, an extension of the evaluation to inspect the text explanations behind the prediction/classification is equally important, as this will facilitate the detection of hidden flaws [85]. Nonetheless, conducting such comprehensive evaluations can be resource-intensive, underscoring the need for community efforts. While alternatives exist for automation, such as transforming tasks into multiple-choice questions or verifying responses against reference texts, for example through lexical overlap or semantic similarity, each method comes with its own set of limitations [7]. Consequently, there is a pressing need to develop new, scalable, and accurate evaluation metrics and benchmark datasets that can accommodate a wide range of bioinformatics tasks, ensuring that assessments are both meaningful and reflective of real-world and cutting-edge applicability.
While aiming for comprehensiveness, our review does not encompass areas that, although outside the direct scope of bioinformatics and biomedical informatics, are closely related and significant. These areas include the management of electronic health records [110, [111], emotion analysis through social media [112], and medical consultation [113, [114]. To mitigate transparency and security concerns, fine-tuning open-source language models deployed locally with task-specific fine-tuning presents a practical approach. Our review has spotlighted such advancements for drug discovery. However, we refer our readers to additional reviews for an expansive understanding of similar developments in other bioinformatics topics, as well as the ethical and legal issues involved [7, [8, [9, [115, [116]. Looking ahead, we envision a future where both online proprietary models such as ChatGPT and open-source, locally deployable finetuned language models coexist for bioinformatics and biomedical informatics, ensuring users with the most suitable tools to address their specific needs.
Supplementary Material
ACKNOWLEDGEMENTS
This work was partially supported by NIH-NIGMS grants P20 GM103434 and U54 GM-104942, as well as NSF 2125872 (GH). NIH-NLM grant No. R01LM013438 and NIDDK grant No. T32 DK137525 to LL. NLM grant R01LM013392 to DX. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH and NSF. The writing was polished by ChatGPT. We thank the following scholars for making comments on the manuscript: Tarcisio Mendes de Farias from Swiss Institute of Bioinformatics (Switzerland), Juexiao Zhou and Xin Gao from King Abdullah University of Science and Technology (Kingdom of Saudi Arabia), and Tanja Stadler from ETH Zürich (Switzerland).
Footnotes
In press with Quantitative Biology (https://onlinelibrary.wiley.com/journal/20954697)
CONFLICT OF INTEREST STATEMENT
The authors declared no conflict of interest or financial conflicts to disclose.
ETHICS STATEMENT
There was no sample from human subjects or animals collected for this work.
REFERENCES
- 1.Wang H., Fu T., Du Y., Gao W., Huang K., Liu Z., Chandak P., Liu S., Van Katwyk P., Deac A., et al. (2023) Scientific discovery in the age of artificial intelligence. Nature. 620, 47–60 [DOI] [PubMed] [Google Scholar]
- 2.Xu Y., Liu X., Cao X., Huang C., Liu E., Qian S., Liu X., Wu Y., Dong F., Qiu C. W., et al. (2021) Artificial intelligence: A powerful paradigm for scientific research. Innovation (Camb). 2, 100179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Van Noorden R., and Perkel J. M. (2023) Ai and science: What 1,600 researchers think. Nature. 621, 672–675 [DOI] [PubMed] [Google Scholar]
- 4.Milano S., McGrane J. A., and Leonelli S. (2023) Large language models challenge the future of higher education. Nat Mach Intell. 5, 333–334 [Google Scholar]
- 5.van Dis E. A. M., Bollen J., Zuidema W., van Rooij R., and Bockting C. L. (2023) Chatgpt: Five priorities for research. Nature. 614, 224–226 [DOI] [PubMed] [Google Scholar]
- 6.Lee P., Bubeck S., and Petro J. (2023) Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New Engl J Med. 388, 1233–1239 [DOI] [PubMed] [Google Scholar]
- 7.Tian S., Jin Q., Yeganova L., Lai P. T., Zhu Q., Chen X., Yang Y., Chen Q., Kim W., Comeau D. C., et al. (2023) Opportunities and challenges for chatgpt and large language models in biomedicine and health. Brief Bioinform. 25, [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Liu J., Yang M., Yu Y., Xu H., Li K., and Zhou X. (2024) Large language models in bioinformatics: Applications and perspectives. arXiv:2401.04155 [Google Scholar]
- 9.Xu D., Chen W., Peng W., Zhang C., Xu T., Zhao X., Wu X., Zheng Y., and Chen E. (2023) Large language models for generative information extraction: A survey. arXiv:2312.17617 [Google Scholar]
- 10.Shue E., Liu L., Li B., Feng Z., Li X., and Hu G. (2023) Empowering beginners in bioinformatics with chatgpt. Quantitative Biology. 11, 105–108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Piccolo S. R., Denny P., Luxton-Reilly A., Payne S. H., and Ridge P. G. (2023) Evaluating a large language model’s ability to solve programming exercises from an introductory bioinformatics course. PLoS Comput Biol. 19, e1011511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhou J., Zhang B., Chen X., Li H., Xu X., Chen S., He W., Xu C., and Gao X. (2024) An ai agent for fully automated multi-omic analyses. bioRxiv. 2023.2009.2008.556814 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Hu G., Liu L., and Xu D. (2024) On the responsible use of chatbots in bioinformatics. Genomics, Proteomics, and Bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Murdoch B. (2021) Privacy and artificial intelligence: Challenges for protecting health information in a new era. BMC Med Ethics. 22, 122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Karim M. R., Islam T., Shajalal M., Beyan O., Lange C., Cochez M., Rebholz-Schuhmann D., and Decker S. (2023) Explainable ai for bioinformatics: Methods, tools and applications. Brief Bioinform. 24, [DOI] [PubMed] [Google Scholar]
- 16.Hou W., and Ji Z. (2024) Assessing gpt-4 for cell type annotation in single-cell rna-seq analysis. Nat Methods. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hou W., and Ji Z. (2023) Geneturing tests gpt models in genomics. bioRxiv. [Google Scholar]
- 18.Jin Q., Yang Y., Chen Q., and Lu Z. (2024) Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics. 40, [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ahimaz P., Bergner A. L., Florido M. E., Harkavy N., and Bhattacharyya S. (2024) Genetic counselors’ utilization of chatgpt in professional practice: A cross-sectional study. Am J Med Genet A. 194, [DOI] [PubMed] [Google Scholar]
- 20.Duong D., and Solomon B. D. (2024) Analysis of large-language model versus human performance for genetics questions. Eur J Hum Genet. 32, 466–468 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Alkuraya I. F. (2023) Is artificial intelligence getting too much credit in medical genetics? Am J Med Genet C. 193, [DOI] [PubMed] [Google Scholar]
- 22.Emmert-Streib F. (2024) Can chatgpt understand genetics? European Journal of Human Genetics. 32, 371–372 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Chen Q., Sun H., Liu H., Jiang Y., Ran T., Jin X., Xiao X., Lin Z., Chen H., and Niu Z. (2023) An extensive benchmark study on biomedical text generation and mining with chatgpt. Bioinformatics. 39, [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gu Y., Tinn R., Cheng H., Lucas M., Usuyama N., Liu X., Naumann T., Gao J., and Poon H. (2021) Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare. 3, Article 2 [Google Scholar]
- 25.Chen Q., Du J., Hu Y., Kuttichi Keloth V., Peng X., Raja K., Zhang R., Lu Z., and Xu H. (2023) Large language models in biomedical natural language processing: Benchmarks, baselines, and recommendations. arXiv:2305.16326 [Google Scholar]
- 26.Ateia S., and Kruschwitz U. (2023) Is chatgpt a biomedical expert? -- exploring the zero-shot performance of current gpt models in biomedical tasks. arXiv:2306.16108 [Google Scholar]
- 27.Chen S., Li Y., Lu S., Van H., Aerts H., Savova G. K., and Bitterman D. S. (2024) Evaluating the chatgpt family of models for biomedical reasoning and classification. J Am Med Inform Assoc. 31, 940–948 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Jahan I., Laskar M. T. R., Peng C., and Huang J. X. (2024) A comprehensive evaluation of large language models on benchmark biomedical text processing tasks. Computers in Biology and Medicine. 108189 [DOI] [PubMed] [Google Scholar]
- 29.Hou Y., Yeung J., Xu H., Su C., Wang F., and Zhang R. (2023) From answers to insights: Unveiling the strengths and limitations of chatgpt and biomedical knowledge graphs. Res Sq. [Google Scholar]
- 30.Rizvi R. F., Vasilakes J., Adam T. J., Melton G. B., Bishop J. R., Bian J., Tao C., and Zhang R. (2020) Idisk: The integrated dietary supplements knowledge base. J Am Med Inform Assoc. 27, 539–548 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhao Q., Zhou X., Wu J., Cai J., Bao X., Tang L., Wang C., Liu C., Wang Y., Teng Y., et al. (2024) Biotreasury: A community-based repository enabling indexing and rating of bioinformatics tools. Sci China Life Sci. 67, 221–229 [DOI] [PubMed] [Google Scholar]
- 32.Wu X., Zeng Y., Das A., Jo S., Zhang T., Patel P., Zhang J., Gao S. J., Pratt D., Chiu Y. C., and Huang Y. (2024) Regulogpt: Harnessing gpt for knowledge graph construction of molecular regulatory pathways. bioRxiv. [Google Scholar]
- 33.Azam M., Chen Y., Arowolo M., Liu H., Popescu M., and Xu D. (2024) A comprehensive evaluation of large language models in mining gene interactions and pathway knowledge. Quantitative Biology. in press, [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Rehana H., Bengisu Çam N., Basmaci M., Zheng J., Jemiyo C., He Y., Özgür A., and Hur J. (2023) Evaluation of gpt and bert-based models on identifying protein-protein interactions in biomedical text. arXiv:2303.17728 [Google Scholar]
- 35.Tiwari K., Matthews L., May B., Shamovsky V., Orlic-Milacic M., Rothfels K., Ragueneau E., Gong C., Stephan R., Li N., et al. (2023) Chatgpt usage in the reactome curation process. bioRxiv. 2023.2011.2008.566195 [Google Scholar]
- 36.Zhou B., Ji B., Shen C., Zhang X., Yu X., Huang P., Yu R., Zhang H., Dou X., Chen Q., et al. (2024) Evlncrnas 3.0: An updated comprehensive database for manually curated functional long non-coding rnas validated by low-throughput experiments. Nucleic Acids Res. 52, D98–D106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Chen X., Li C., Wang Z., Zhou Y., and Chu M. (2024) Computational screening of biomarkers and potential drugs for arthrofibrosis based on combination of sequencing and large nature language model. Journal of Orthopaedic Translation. 44, 102–113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Fo K., Chuah Y. S., Foo H., Davey E. E., Fullwood M., Thibault G., and Mutwil M. (2023) Plantconnectome: Knowledge networks encompassing >100,000 plant article abstracts. bioRxiv. 2023.2007.2011.548541 [Google Scholar]
- 39.Rawte V., Sheth A., and Das A. (2023) A survey of hallucination in large foundation models. arXiv:2309.05922 [Google Scholar]
- 40.Zhang Y., Li Y., Cui L., Cai D., Liu L., Fu T., Huang X., Zhao E., Zhang Y., Chen Y., et al. (2023) Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv:2309.01219 [Google Scholar]
- 41.Chen Y., Gao J., Petruc M., Hammer R. D., Popescu M., and Xu D. (2024) Iterative prompt refinement for mining gene relationships from chatgpt. International Journal of Artificial Intelligence and Robotics Research. in press, [Google Scholar]
- 42.Yao S., Yu D., Zhao J., Shafran I., Griffiths T. L., Cao Y., and Narasimhan K. (2023) Tree of thoughts: Deliberate problem solving with large language models. In: Neural Information Processing Systems 36 (NeurIPS 2023) 35, 11809–11822. Curran Associates, Inc. [Google Scholar]
- 43.Savage N. (2023) Drug discovery companies are customizing chatgpt: Here’s how. Nat Biotechnol. 41, 585–586 [DOI] [PubMed] [Google Scholar]
- 44.Chakraborty C., Bhattacharya M., and Lee S. S. (2023) Artificial intelligence enabled chatgpt and large language models in drug target discovery, drug discovery, and development. Mol Ther Nucleic Acids. 33, 866–868 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Zhao A., and Wu Y. (2023) Future implications of chatgpt in pharmaceutical industry: Drug discovery and development. Front Pharmacol. 14, 1194216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hu G., and Xie Z. (2023) The artificial intelligence pharma era after “chat generative pre-trained transformer”. Medical Review. 3, 198–199 [Google Scholar]
- 47.Gao Z., Li L., Ma S., Wang Q., Hemphill L., and Xu R. (2023) Examining the potential of chatgpt on biomedical information retrieval: Fact-checking drug-disease associations. Ann Biomed Eng. [DOI] [PubMed] [Google Scholar]
- 48.Guo T., Guo K., Nan B., Liang Z., Guo Z., Chawla N. V., Wiest O., and Zhang X. (2023) What can large language models do in chemistry? A comprehensive benchmark on eight tasks. In: Advances in Neural Information Processing Systems.36, 59662–59688. Curran Associates, Inc. [Google Scholar]
- 49.Juhi A., Pipil N., Santra S., Mondal S., Behera J. K., and Mondal H. (2023) The capability of chatgpt in predicting and explaining common drug-drug interactions. Cureus. 15, e36272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Al-Ashwal F. Y., Zawiah M., Gharaibeh L., Abu-Farha R., and Bitar A. N. (2023) Evaluating the sensitivity, specificity, and accuracy of chatgpt-3.5, chatgpt-4, bing ai, and bard against conventional drug-drug interactions clinical tools. Drug Healthc Patient Saf. 15, 137–147 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Herrero-Zazo M., Segura-Bedmar I., Martinez P., and Declerck T. (2013) The ddi corpus: An annotated corpus with pharmacological substances and drug-drug interactions. J Biomed Inform. 46, 914–920 [DOI] [PubMed] [Google Scholar]
- 52.Wang Y. M., Shen H. W., and Chen T. J. (2023) Performance of chatgpt on the pharmacist licensing examination in taiwan. J Chin Med Assoc. 86, 653–658 [DOI] [PubMed] [Google Scholar]
- 53.Kunitsu Y. (2023) The potential of gpt-4 as a support tool for pharmacists: Analytical study using the japanese national examination for pharmacists. Jmir Medical Education. 9, [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Zong H., Li J., Wu E., Wu R., Lu J., and Shen B. (2024) Performance of chatgpt on Chinese national medical licensing examinations: A five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ. 24, 143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Huang X., Estau D., Liu X., Yu Y., Qin J., and Li Z. (2024) Evaluating the performance of chatgpt in clinical pharmacy: A comparative study of chatgpt and clinical pharmacists. Br J Clin Pharmacol. 90, 232–238 [DOI] [PubMed] [Google Scholar]
- 56.Wang R., Feng H., and Wei G. W. (2023) Chatgpt in drug discovery: A case study on anticocaine addiction drug development with chatbots. J Chem Inf Model. 63, 7189–7209 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Liu S., Wang J., Yang Y., Wang C., Liu L., Guo H., and Xiao C. (2024) Conversational drug editing using retrieval and domain feedback. The Twelfth International Conference on Learning Representations. [Google Scholar]
- 58.Liang Y., Zhang R., Zhang L., and Xie P. (2023) Drugchat: Towards enabling chatgpt-like capabilities on drug molecule graphs. arXiv:2309.03907 [Google Scholar]
- 59.Ye G., Cai X., Lai H., Wang X., Huang J., Wang L., Liu W., and Zeng X. (2023) Drugassist: A large language model for molecule optimization. arXiv:2401.10334 [DOI] [PubMed] [Google Scholar]
- 60.Dong Q., Li L., Dai D., Zheng C., Wu Z., Chang B., Sun X., Xu J., Li L., and Sui Z. (2022) A survey on in-context learning. arXiv:2301.00234 [Google Scholar]
- 61.Li J., Liu Y., Fan W., Wei X.-Y., Liu H., Tang J., and Li Q. (2023) Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective. IEEE TRANSACTIONS ON KNOWLEDGEANDDATAENGINEERING. in press, [Google Scholar]
- 62.Jablonka K. M., Schwaller P., Ortega-Guerrero A., and Smit B. (2024) Leveraging large language models for predictive chemistry. Nat Mach Intell. 6, 161–169 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Cai X., Lai H., Wang X., Wang L., Liu W., Wang Y., Wang Z., Cao D., and Zeng X. (2024) Comprehensive evaluation of molecule property prediction with chatgpt. Methods. 222, 133–141 [DOI] [PubMed] [Google Scholar]
- 64.Caldas Ramos M., Michtavy S. S., Porosoff M. D., and White A. D. (2023) Bayesian optimization of catalysts with in-context learning. arXiv:2304.05341 [Google Scholar]
- 65.Zeng Z., Yin B., Wang S., Liu J., Yang C., Yao H., Sun X., Sun M., Xie G., and Liu Z. (2023) Interactive molecular discovery with natural language. arXiv:2306.11976 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Raffel C., Shazeer N., Roberts A., Lee K., Narang S., Matena M., Zhou Y. Q., Li W., and Liu P. J. (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 21, [Google Scholar]
- 67.Hu H., Yang A. J., Deng S., Wang D., Song M., and Shen S. (2023) A generative drug–drug interaction triplets extraction framework based on large language models. Proceedings of the Association for Information Science and Technology. 60, 980–982 [Google Scholar]
- 68.Wei J., Bosma M., Zhao V. Y., Guu K., Yu A. W., Lester B., Du N., Dai A. M., and Le Q. V. (2021) Finetuned language models are zero-shot learners. arXiv:2109.01652 [Google Scholar]
- 69.Fang Y., Liang X., Zhang N., Liu K., Huang R., Chen Z., Fan X., and Chen H. (2023) Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv:2306.08018 [Google Scholar]
- 70.Zhao Z., Ma D., Chen L., Sun L., Li Z., Xu H., Zhu Z., Zhu S., Fan S., Shen G., et al. (2024) Chemdfm: Dialogue foundation model for chemistry. arXiv:2401.14818 [Google Scholar]
- 71.Cao H., Liu Z., Lu X., Yao Y., and Li Y. (2023) Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv:2311.16208 [Google Scholar]
- 72.Zhang W., Wang Q., Kong X., Xiong J., Ni S., Cao D., Niu B., Chen M., Zhang R., Wang Y., et al. (2024) Fine-tuning large language models for chemical text mining. Chemical Science. In press, [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Acosta J. N., Falcone G. J., Rajpurkar P., and Topol E. J. (2022) Multimodal biomedical ai. Nature Medicine. 28, 1773–1784 [DOI] [PubMed] [Google Scholar]
- 74.Truhn D., Weber C. D., Braun B. J., Bressem K., Kather J. N., Kuhl C., and Nebelung S. (2023) A pilot study on the efficacy of gpt-4 in providing orthopedic treatment recommendations from mri reports. Sci Rep-Uk. 13, [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Liu Z., Jiang H., Zhong T., Wu Z., Ma C., Li Y., Yu X., Zhang Y., Pan Y., Shu P., et al. (2023) Holistic evaluation of gpt-4v for biomedical imaging. arXiv:2312.05256 [Google Scholar]
- 76.Wu C., Lei J., Zheng Q., Zhao W., Lin W., Zhang X., Zhou X., Zhao Z., Zhang Y., Wang Y., and Xie W. (2023) Can gpt-4v(ision) serve medical applications? Case studies on gpt-4v for multimodal medical diagnosis. arXiv:2310.09909 [Google Scholar]
- 77.Yan Z., Zhang K., Zhou R., He L., Li X., and Sun L. (2023) Multimodal chatgpt for medical applications: An experimental study of gpt-4v. arXiv:2310.19061 [Google Scholar]
- 78.Buckley T., Diao J. A., Rodman A., and Manrai A. K. (2023) Accuracy of a vision-language model on challenging medical cases. arXiv:2311.05591 [Google Scholar]
- 79.Yang Z., Yao Z., Tasmin M., Vashisht P., Jang W. S., Ouyang F., Wang B., Berlowitz D., and Yu H. (2023) Performance of multimodal gpt-4v on usmle with image: Potential for imaging diagnostic support with explanations. medRxiv. 2023.2010.2026.23297629 [Google Scholar]
- 80.Li Y., Liu Y., Wang Z., Liang X., Liu L., Wang L., Cui L., Tu Z., Wang L., and Zhou L. (2023) A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv. 2023.2011.2003.23298067 [Google Scholar]
- 81.Hou W., and Ji Z. (2024) Gpt-4v exhibits human-like performance in biomedical image classification. bioRxiv. [Google Scholar]
- 82.Wang J., Ye Q., Liu L., Guo N. L., and Hu G. (2024) Scientific figures interpreted by chatgpt: Strengths in plot recognition and limits in color perception. NPJ Precis Oncol. 8, 84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.OpenAI. (2023) Gpt-4v(ision) system card. 1–18 [Google Scholar]
- 84.Nickerson R. S. (1998) Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology. 2, 175–220 [Google Scholar]
- 85.Jin Q., Chen F., Zhou Y., Xu Z., Cheung J. M., Chen R., Summers R. M., Rousseau J. F., Ni P., Landsman M. J., et al. (2024) Hidden flaws behind expert-level accuracy of gpt-4 vision in medicine. arXiv:2401.08396 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Wang J., Liu Z., Zhao L., Wu Z., Ma C., Yu S., Dai H., Yang Q., Liu Y., Zhang S., et al. (2023) Review of large vision models and visual prompt engineering. arXiv:2307.00855 [Google Scholar]
- 87.Yang Z., Li L., Lin K., Wang J., Lin C., Liu Z., and Wang L. (2023) The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv. [Google Scholar]
- 88.Li Z., Wang C., Liu C., Ma P., Wu D., Wang S., and Gao C. (2023) Vrptest: Evaluating visual referring prompting in large multimodal models. arXiv:2312.04087 [Google Scholar]
- 89.Li J., Chen H., Wang Y., Chen M. M., and Liang H. (2021) Next-generation analytics for omics data. Cancer Cell. 39, 3–6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Xie M., Yang L., Chen G., Wang Y., Xie Z., and Wang H. (2022) Ribochat: A chat-style web interface for analysis and annotation of ribosome profiling data. Brief Bioinform. 23, [DOI] [PubMed] [Google Scholar]
- 91.Merow C., Serra-Diaz J. M., Enquist B. J., and Wilson A. M. (2023) Ai chatbots can boost scientific coding. Nat Ecol Evol. 7, 960–962 [DOI] [PubMed] [Google Scholar]
- 92.Jansen J. A., Manukyan A., Khoury N. A., and Akalin A. (2023) Leveraging large language models for data analysis automation. bioRxiv. 2023.2012.2011.571140 [DOI] [PubMed] [Google Scholar]
- 93.Dong Z., Zhong V., and Lu Y. Y. (2023) Biomania: Simplifying bioinformatics data analysis through conversation. bioRxiv. 2023.2010.2029.564479 [Google Scholar]
- 94.Liu A., Hu X., Wen L., and Yu P. S. (2023) A comprehensive evaluation of chatgpt’s zero-shot text-to-sql capability. arXiv:2303.13547 [Google Scholar]
- 95.Sima A.-C., and de Farias T. M. (2023) On the potential of artificial intelligence chatbots for data exploration of federated bioinformatics knowledge graphs. In: SeWebMeDa’23: 6th Workshop on Semantic Web Solutions for Large-Scale Biomedical Data Analytics.Vol-3466, CEUR-WS.org [Google Scholar]
- 96.Rangel J. C., de Farias T. M., Sima A. C., and Kobayashi N. (2024) Sparql generation: An analysis on fine-tuning openllama for question answering over a life science knowledge graph. In: WAT4HCLS 2024: 15th International Semantic Web Applications and Tools for Health Care and Life Sciences Conference. In press [Google Scholar]
- 97.Chen C., and Stadler T. (2023) Genspectrum chat: Data exploration in public health using large language models. arXiv:2305.13821 [Google Scholar]
- 98.Wang L., Ge X. J., Liu L., and Hu G. Q. (2024) Code interpreter for bioinformatics: Are we there yet? Ann Biomed Eng. 52, 754–756 [DOI] [PubMed] [Google Scholar]
- 99.Tang X., Qian B., Gao R., Chen J., Chen X., and Gerstein M. (2023) Biocoder: A benchmark for bioinformatics code generation with large language models. Bioinformatics. In press, [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Sarwal V., Munteanu V., Suhodolschi T., Ciorba D., Eskin E., Wang W., and Mangul S. (2023) Biollmbench: A comprehensive benchmarking of large language models in bioinformatics. bioRxiv. 2023.2012.2019.572483 [Google Scholar]
- 101.Lubiana T., Lopes R., Medeiros P., Silva J. C., Goncalves A. N. A., Maracaja-Coutinho V., and Nakaya H. I. (2023) Ten quick tips for harnessing the power of chatgpt in computational biology. PLoS Comput Biol. 19, e1011319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Chen M., Tworek J., Jun H., Yuan Q., Ponde de Oliveira Pinto H., Kaplan J., Edwards H., Burda Y., Joseph N., Brockman G., et al. (2021) Evaluating large language models trained on code. arXiv:2107.03374 [Google Scholar]
- 103.Lehtinen T., Haaranen L., and Leinonen J. (2023) Automated questionnaires about students’ javascript programs: Towards gauging novice programming processes. In: ACE ‘23: Proceedings of the 25th Australasian Computing Education Conference.49–58. Association for Computing Machiner [Google Scholar]
- 104.Lin Z. C. (2024) How to write effective prompts for large language models. Nat Hum Behav. [DOI] [PubMed] [Google Scholar]
- 105.Denny P., Leinonen J., Prather J., Luxton-Reilly A., Amarouche T., Becker B. A., and Reeves B. N. (2023) Promptly: Using prompt problems to teach learners how to effectively utilize ai code generators. arXiv:2307.16364 [Google Scholar]
- 106.Hou W., and Ji Z. (2023) Assessing gpt-4 for cell type annotation in single-cell rna-seq analysis. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Chen Q., Sun H., Liu H., Jiang Y., Ran T., Jin X., Xiao X., Lin Z., Chen H., and Niu Z. (2023) An extensive benchmark study on biomedical text generation and mining with chatgpt. Bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Soman K., Rose P. W., Morris J. H., E Akbas R., Smith B., Peetoom B., Villouta-Reyes C., Cerono G., Shi Y., Rizk-Jackson A., et al. (2023) Biomedical knowledge graph-enhanced prompt generation for large language models. arXiv:2311.17330 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Chen L., Zaharia M., and Zou J. (2023) How is chatgpt’s behavior changing over time? arXiv:2307.09009 [Google Scholar]
- 110.Wang G., Yang G., Du Z., Fan L., and Li X. (2023) Clinicalgpt: Large language models fine tuned with diverse medical data and comprehensive evaluation. arXiv:2306.09968 [Google Scholar]
- 111.Peng C., Yang X., Chen A., Smith K. E., PourNejatian N., Costa A. B., Martin C., Flores M. G., Zhang Y., Magoc T., et al. (2023) A study of generative large language model for medical research and healthcare. NPJ Digit Med. 6, 210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Lai T., Shi Y., Du Z., Wu J., Fu K., Dou Y., and Wang Z. (2023) Psy-llm: Scaling up global mental health psychological services with ai-based large language models. arXiv:2307.11991 [Google Scholar]
- 113.Liu J. M., Li D., Cao H., Ren T., Liao Z., and Wu J. (2023) Chatcounselor: A large language models for mental health support. arXiv:2309.15461 [Google Scholar]
- 114.Han T., Adams L. C., Papaioannou J.-M., Grundmann P., Oberhauser T., Löser A., Truhn D., and Bressem K. K. (2023) Medalpaca -- an open-source collection of medical conversational ai models and training data. arXiv:2304.08247 [Google Scholar]
- 115.Zhang S., Fan R., Liu Y., Chen S., Liu Q., and Zeng W. (2023) Applications of transformer-based language models in bioinformatics: A survey . Bioinform Adv. 3, vbad001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.Qiu J. N., Li L., Sun J. K., Peng J. C., Shi P. L., Zhang R. Y., Dong Y. Z., Lam K., Lo F. P. W., Xiao B., et al. (2023) Large ai models in health informatics: Applications, challenges, and the future. Ieee J Biomed Health. 27, 6074–6087 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




