Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Feb 6;26(1):bbaf045. doi: 10.1093/bib/bbaf045

Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics

Olivier Cinquin 1,
PMCID: PMC11798674  PMID: 39910777

Abstract

Large language models (LLMs) leverage factual knowledge from pretraining. Yet this knowledge remains incomplete and sometimes challenging to retrieve—especially in scientific domains not extensively covered in pretraining datasets and where information is still evolving. Here, we focus on genomics and bioinformatics. We confirm and expand upon issues with plain ChatGPT functioning as a bioinformatics assistant. Poor data retrieval and hallucination lead ChatGPT to err, as do incorrect sequence manipulations. To address this, we propose a system basing LLM outputs on up-to-date, authoritative facts and facilitating LLM-guided data analysis. Specifically, we introduce NagGPT, a middleware tool to insert between LLMs and databases, designed to bridge gaps in LLM knowledge and usage of database application programming interfaces. NagGPT proxies LLM-generated database queries, with special handling of incorrect queries. It acts as a gatekeeper between query responses and the LLM prompt, redirecting large responses to files but providing a synthesized snippet and injecting comments to steer the LLM. A companion OpenAI custom GPT, Genomics Fetcher-Analyzer, connects ChatGPT with NagGPT. It steers ChatGPT to generate and run Python code, performing bioinformatics tasks on data dynamically retrieved from a dozen common genomics databases (e.g. NCBI, Ensembl, UniProt, WormBase, and FlyBase). We implement partial mitigations for encountered challenges: detrimental interactions between code generation style and data analysis, confusion between database identifiers, and hallucination of both data and actions taken. Our results identify avenues to augment ChatGPT as a bioinformatics assistant and, more broadly, to improve factual accuracy and instruction following of unmodified LLMs.

Keywords: large language model (LLM), Generative Pretrained Transformer (GPT), retrieval-augmented generation (RAG), Database Query Correction, LLM factual accuracy, LLM steering, bioinformatics

Introduction

Generative large language models (LLMs) offer a revolutionary tool to assist user-guided data retrieval and analysis. Yet substantial challenges must be addressed, particularly when it comes to retrieving information underrepresented in model training materials [1, 2], incorporating updates to the information, avoiding hallucinations [3], and ensuring veridicity—i.e. faithful reflection of objective reality. These challenges are acutely relevant to the biomedical field [4].

A number of approaches are being actively researched to address LLM shortcomings, including retrieval-augmented generation (RAG) [5], model editing [6, 7], and prompt engineering [8]. Particularly relevant to science overall and biology specifically, trained LLMs can be supplemented with knowledge graphs [9, 10] or, on smaller data scales, be directly taught new facts [11, 12]. Yet, despite these significant advances, they have yet to fully realize their potential as tools widely adopted in biologists’ daily work.

Focusing on bioinformatics, potential uses of ChatGPT have garnered much interest [13–17], but a number of issues limit ChatGPT’s ability to retrieve relevant data and to process it, as reported by others and as we further document. Even if these issues are addressed, substantial questions remain as to whether and how ChatGPT should be used in scientific workflows. To address these questions, it is critical to characterize both the explicit failures and, even more importantly, the subtle failures that lead to erroneous results without obvious warning signs.

Here, we start by characterizing issues a user would initially encounter when attempting to use ChatGPT as a basic bioinformatics assistant. These issues comprise not only explicit failures, in large part because of ChatGPT’s limitations in retrieving data and running code, but also silent errors. As a first step in addressing these issues, we then report an addition to ChatGPT, in the form of Genomics Fetcher-Analyzer (GFA), a “custom GPT” with which users can converse from chatgpt.com, and NagGPT, a backend server with which GFA communicates. This addition addresses some of the most immediate and prominent hurdles ChatGPT users face with bioinformatics analyses. It is purposefully minimalist in that it does not seek to be exhaustive or to take over control of analyses from the user but rather to provide basic functionality enabling user-guided analyses: transparent data retrieval from common genomics databases and a framework to run basic bioinformatics software that is normally precluded by ChatGPT’s “Code interpreter and data analysis” environment. This leaves users fully in control with the regular ChatGPT website interface and leverages the remarkable pretrained knowledge and generalist abilities of ChatGPT 4 [18] as well as its data visualization capabilities [16, 19].

We find that, even with the relatively basic set of tasks on which we test GFA/NagGPT, and despite the strong capabilities of ChatGPT-4o and the often-successful injection of correct genomics database responses, silent flaws in GFA responses are still a distinct possibility. User redirection is often required for the analysis to reach completion, which frequently requires analysis of the generated Python code. Thus, although GFA substantially improves the bioinformatics capabilities of ChatGPT, it is not a tool to be trusted in a scientific workflow. Rather, it helps identify a potential way forward in developing a reliable bioinformatics assistant.

Materials and methods

The overall architecture of the software components and user interaction is outlined in Fig. 1. Full instructions defining the OpenAI custom GPT “GFA” are shown in Table S1. The “knowledge” files, as well as the schema defining the “actions” available for GFA to submit requests to NagGPT, are available at https://naggpt.com/NagGPT_source_code_repos/1/GFA. The models underlying GFA used for this manuscript were gpt-4o-2024-05-13 or gpt-4o-2024-08-06, accessed chiefly between June and September 2024. NagGPT is implemented in Java, using the Javalin web framework and the Apache HttpComponents Client. To maximize throughput, multiple request URLs can be included in a single query sent to NagGPT by GFA. These requests are processed and forwarded in parallel (similarly, e.g. to the parallelism implemented in [20]). Request forwarding is throttled on an application-wide basis, grouping incoming requests irrespective of their parent GFA conversation but considering each target Application Programming Interface (API) host independently, with adjustable rates. Large request results are uploaded to the user’s Google Drive, after combining them into a single archive for multiple-URL requests (this adds a small GFA Python code generation overhead as the archive must be decompressed but minimizes effort on the user’s part to make the files available). Table 1 lists the databases that NagGPT accepts as forwarding targets (a strong filter on outgoing requests helps ensure safety). The full, open-source code is available at https://naggpt.com/NagGPT_source_code_repos/1/NagGPT.

Figure 1.

Figure 1

Architecture of NagGPT/GFA highlighting information flow. The user (1) directs requests at GFA, a custom GPT that uses general ChatGPT knowledge in addition to its local documentation of analysis strategies and API endpoints (2) to generate requests sent to NagGPT as a list of API URLs. NagGPT examines the requests. Requests known a priori to be invalid are automatically reformulated when possible, or otherwise returned with an informative error message. Requests are then forwarded to public databases (3), which is the only conduit for internet access (globe symbols) other than Google Drive storage. When NagGPT identifies in the response failures that match a known pattern, it may autonomously issue updated requests. Full responses are stored in Google Drive (4) when they are large, in which case the response to GFA includes snippets and an inferred schema for JSON responses (5). The response to GFA also includes potential warnings, as well as directions for subsequent analysis (6). GFA then uses ChatGPT knowledge in addition to its specific instructions, scripts, and program binaries (7) to generate custom Python code that performs the analysis in OpenAI’s sandbox (8).

Table 1.

Databases and corresponding Internet hosts that NagGPT accepts as forwarding targets

Platform/Database Host Notes on API/Query method
Ensembl rest.ensembl.org
FlyBase api.flybase.org  
 chado.flybase.org
REST-like / SQL (Chado)
gnomAD gnomad.broadinstitute.org REST-like / GraphQL
HGNC (Gene Names) rest.genenames.org
KEGG rest.kegg.jp
MyVariant.info myvariant.info
NCBI (General API) api.ncbi.nlm.nih.gov
NCBI E-Utilities eutils.ncbi.nlm.nih.gov Complex RPC-like, stateful queries not explicitly targeted by GFA
OMA Browser omabrowser.org
OrthoDB www.orthodb.org
STRING Database string-db.org
UCSC Genome Browser api.genome.ucsc.edu
UniProt rest.uniprot.org

The more significant aspects of request processing by NagGPT are described in Results. In addition, a number of smaller transformations to the requests generated by ChatGPT/GFA were found empirically to be necessary to work around defects in request generation, or idiosyncrasies of major public databases. For example, the Uniform Resource Identifier (URI) scheme must be corrected from https to http for databases that silently ignore secure connection requests, and unescaped space characters must be fixed to “+” or “%20” in database-dependent fashion. Database-specific rules address frequent (but often not systematic) mistakes in query parameter names (e.g. “organism_id” versus “organism”) or query parameter values—e.g. Ensembl [21] denotes the wolf species “Canis lupus familiaris” and not “Canis familiaris”; ChatGPT “knows” that the former is the more accurate designation but favors the latter in Ensembl queries. Search queries for, e.g., the NCBI [22] or UniProt [23] databases are adjusted when they generate unexpected responses; an example pitfall is that NCBI queries narrowed using a nonexistent field selector return results that ignore the field restriction altogether, which GFA does not “realize,” often leading to incorrect analysis if left unaddressed. More specialized databases, such as WormBase [24] and FlyBase [25], also need specific adjustments to identify orthologs. For example, some FlyBase gene query results contain human matches, which are filtered out by NagGPT as they were found to confuse GFA.

ChatGPT was used as a code generation assistant (chiefly for utility classes) to write NagGPT and for light editing of the present manuscript. Google Gemini and OpenAI Dall-E 3 image generation tools were used for parts of Fig. 1, in combination with public domain glyphs.

Results

Analysis obstacles in “plain” ChatGPT

We first offer empirical observations of limitations of ChatGPT as a bioinformatics assistant, based on simple examples that we next use as a basis to introduce the remedies provided by NagGPT.

Consistent with previous reports, ChatGPT is far from suitable for bioinformatics out of the box. Two major types of limitations are immediately apparent. First, information retrieval is inconsistent and incomplete. ChatGPT can either rely on its own internal knowledge or perform Web searches. Both behaviors are readily apparent in conversations but both can show severe limitations in terms of accuracy, exhaustiveness, and reflecting current knowledge. This is illustrated, for example, by simple queries on otoferlin (OTOF), a gene for which an exact symbol search generates 9430 search hits in Google Scholar as of this writing and that ranks in the top ~4% for frequency of linkage to PubMed publications among human genes in the gene2pubmed dataset [26], suggesting that information should be readily available (press attention related to OTOF gene therapy [27] may have occurred just past the training cutoff dates of current ChatGPT models). When asked about the protein length of the various human isoforms, ChatGPT gives incorrect answers as to the longest and shortest known isoforms (Fig. S1) and only partially corrects itself upon leading follow-up prompts. ChatGPT displays its intrinsic knowledge, e.g. in that it spontaneously generates the corresponding UniProt accession ID “Q9HC10,” and even attempts to access a corresponding, well-chosen Web page from the UniProt website that contains the correct answers [28] (Supplemental File S1), but fails to extract those correct answers. The cause of this failure is unclear. It could be that, contrary to its claims, ChatGPT does not successfully retrieve the page contents, as a UniProt query to the human-facing website with an automated tool, which does not have the JavaScript engine expected of a human-operated Web browser, leads to an error response that does not contain data on OTOF (Supplemental File S2). Consistent with this, although ChatGPT refuses to simply reproduce the fetched content (probably to avoid copyright issues), targeted queries suggest that its claims to have retrieved the content displayed to regular Web browsers are false and that it hallucinates (Fig. S2). Irrespective of the exact problems at play, this illustrates how, in its current state, ChatGPT can not only fail to leverage the best sources of authoritative information but also, much worse, fail to recognize that failure.

Web search results may be haphazardly available from sites, other than databases, that serve the expected contents to the ChatGPT web retrieval tool. But despite their potential to inject well-targeted pieces of information into the prompt, these results may often come up short. For example, they may be germane to the topic without actually containing the specific required information, even when they are derived from the subset of scientific manuscripts that are searchable by ChatGPT, or they may be outdated or not exhaustive or may mix up pieces of information that are distinct but sufficiently related to potentially cause confusion. Indeed, Fig. S1 shows a ChatGPT query to the Bing search engine whose results may contain relevant snippets from various websites, and also a scientific manuscript pertaining to OTOF, but that was insufficient to generate correct answers. In addition, although ChatGPT reasonably inferred the request to be for human OTOF, it also attempted to fetch data for the mouse ortholog.

The second major limitation of ChatGPT in performing bioinformatics is that it does not suffice to retrieve data: those data must be analyzed. While ChatGPT is capable of sophisticated reasoning, the only practical way for it to perform bioinformatics analyses is to use ad hoc computer code running on the retrieved data. As just one example that may be surprising to beginners, ChatGPT readily makes mistakes on simple sequence manipulation tasks (Fig. S3) that can be averted by GFA instructions (Fig. S4). But generated code, in turn, often needs to call upon pre-existing software packages, and while ChatGPT’s built-in “Code interpreter and data analysis” tool is remarkably proficient at writing Python code to analyze user-supplied data, limitations in the runtime environment and in the code generation style present significant limitations. Specifically, the runtime environment has no Internet connectivity (and thus cannot directly query databases) and only has a limited set of pre-installed Python packages that do not cover bioinformatics. ChatGPT does not have reliable knowledge of which packages it can call upon (Fig. S5) and, more worryingly, can hallucinate having run code when it in fact merely generated the code and hallucinated an incorrect output (Fig. S6). Furthermore, stereotypical errors are often present in generated code, such as running on “toy” data (Fig. S7)—a good practice for first-pass testing while writing programs but a potentially disastrous practice when end users do not realize such data was the basis of the analysis output.

NagGPT/GFA design

GFA–specific persistent knowledge and software extras

Our approach inserts itself in the process of data retrieval and analysis in three ways (Fig. 1). First, our “GFA” custom GPT comprises instructions (Table S1), automatically applied without user intervention, that fall into three loose categories. One category guides code generation and the overall conversation flow, e.g. by requesting that GFA state the analysis plan before executing it, and attempt to iteratively fix runtime errors itself, without user intervention. Second, instructions help define analysis strategies, notably by requesting systematic querying of everything that can be, rather than relying on intrinsic ChatGPT knowledge, favoring the use of APIs and broad-stroke implementation strategies suggested in local files, and promoting rigor and double-checking. Importantly, the regular ChatGPT web retrieval tool is not made available to GFA so that all data retrieval takes place through a call placed to an external NagGPT server through a generated OpenAI “action.” Finally, a third category of instructions is tailored to specific tools and to recurrent patterns of model misbehavior. For example, instructions and local binary archives (Table S2) allow GFA to install Python modules such as BioPython [29], or binaries such as the multiple sequence alignment program kAlign [30], in a manner allowed by the sandboxed runtime environment. Instructions that GFA did not consistently follow in its earlier versions are repeated in emphatic and dramatic language.

NagGPT transformation of GFA requests

Second, before forwarding queries received from GFA to upstream servers, NagGPT repeatedly applies various transformations until no further rules can be matched. A first category is based on pre-established patterns of invalid queries observed to have been generated by GFA. Frequent errors include incorrect API endpoints or parameters, often hallucinated reproducibly in a “logical” way similar to what a human programmer might have assumed may be correct usage, and also include confusion between different kinds of identifiers. This confusion is pervasive and applies not only when two databases have different numbers as the primary identifier to access related entities (e.g. the numerical identifier for a sequence in NCBI’s “nuccore” nucleotide database is not the same as the identifier for a corresponding entry in NCBI’s “gene” database; see Fig. S8) but also even when the identifiers have a structure that naturally distinguishes them: for example, ChatGPT can try to access an Ensembl gene entry using a numerical identifier corresponding to an NCBI gene, despite that identifier not having the expected “ENS” prefix. NagGPT detects patterns it knows to be erroneous, and automatically rewrites the requests when a substitute is obviously apparent, or otherwise blocks the requests and sends GFA a response consisting of an informative error message. Requests that could be correct, but appear suspicious, are processed, but a warning directed at GFA is appended to their results.

A second category of GFA request transformation applies when the target database is not accessible through the kind of Representational State Transfer (REST) API on which GFA is inclined to rely. This applies to the Genome Aggregation Database (gnomAD) [31], which has a GraphQL [32] interface, or to the Drosophila genetics and molecular biology database FlyBase, which can be queried through a Structured Query Language (SQL) client. In those cases, NagGPT takes a REST-like request (as detailed in GFA’s internal documentation) and translates it internally to the appropriate scheme, which is more reliable than GFA attempting to craft a query following that scheme.

A third and final kind of request transformation relies on examining the results before they are returned to GFA. When a search returns no matches, it can be because the criteria were more restrictive than expected by GFA—e.g. because the search required an exact match on the whole field or because the search did not consider name aliases. Thus, when no matches are present and the search formulation is known to be associated with false negatives, NagGPT issues a follow-up request with loosened criteria.

Large language model prompt and behavior management

A third, important, role of NagGPT is to manage the text response it sends to GFA and that is inserted in the prompt to continue the conversation. Three different components are integrated as applicable: (i) the full response from the upstream server, when below a threshold length, or else an abbreviated “sneak peek” along with an inferred document schema for JavaScript Object Notation (JSON) [33] responses (the precise length of the sneak peek and schema may gain from optimization in the future, as high values are likely detrimental to LLM performance even when the context window is not overwhelmed [16]); (ii) comments on the query or its upstream response (e.g. potential reasons for an absence of search hits or warnings that results may not be exactly as expected); and (iii) instructions as to how GFA should proceed. These “just-in-time” reminders, targeted to the local context, aim to enhance instruction following. GFA is repeatedly encouraged to perform all the relevant NagGPT requests as part of the same conversation turn so that all the responses are collected as part of a single archive made available to GFA through the ChatGPT web interface. GFA is also repeatedly encouraged to leverage the sneak peeks and schemas to generate its code, in an attempt to avoid multiple trial-and-error rounds of Python code generation to access the correct fields.

Another channel for enhanced control of GFA behavior is provided by Python scripts or code snippets included as part of GFA local files and instructions. Some execute tasks that GFA is not able to reliably code independently, even by trial and error. Importantly, in addition to performing the desired actions, the scripts can output instruction reminders—e.g. instructions on correct invocation of a program that go counter to ChatGPT’s propensities. Another kind of script modifies the environment in the Python runtime to cause instruction-violating code generated by GFA to fail early with a useful error message (see, e.g., comment #5 in Table S1).

Altogether, NagGPT does not just forward query responses to GFA but also uses the specific context of each query to give GFA targeted, repeated warnings and instruction reminders—ergo the name NagGPT.

Analysis enhancements with NagGPT/GFA

The combination of engineered GFA instructions and knowledge files on one hand and NagGPT programmed rules on the other substantially enhance bioinformatics analysis capabilities. This is demonstrated, e.g. by GFA responding correctly to the same example user query on OTOF isoform lengths as illustrated earlier, on which plain ChatGPT failed (Fig. S9). As more sophisticated examples, GFA and NagGPT make it possible to retrieve the sequences for a gene in multiple species, align them, compute similarities, and compute a phylogenetic tree (Fig. 2, Fig. S10) or to retrieve a set of sequences that match a combination of user-defined criteria (Table 2, Table S3, Fig. S11). GFA is able to rely on ChatGPT’s knowledge of Biopython and general bioinformatics analysis techniques, even in domains not covered by its instructions or knowledge files. For example, a prompt to identify transmembrane domains triggers an analysis of hydrophobicity (Fig. 3, Fig. S12), essentially re-implementing a well-known prediction technique [34].

Figure 2.

Figure 2

Phylogenetic tree for the P53 gene, generated by GFA, for a prompt starting with “Create a phylogenetic tree from the alignments of TP53 sequences from zebrafish, rat, mouse, pig, cow, monkey, chimpanzee, human, and gorilla” (full conversation in Fig. S10). The sequence retrieval, alignment, and subsequent processing were performed by GFA de novo, in the OpenAI code analysis sandbox environment.

Table 2.

Example subset of sequences retrieved by GFA in response to the prompt “Retrieve all the reviewed mammalian insulin sequences from UniProt.” The full list is shown in Table S3, and the conversation in Figure S11.

Primary Accession Organism Sequence
P01308 Homo sapiens MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
P01317 Bos taurus MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEGPQVGALELAGGPGAGGLEGPPQKRGIVEQCCASVCSLYQLENYCN
P01315 Sus scrofa MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAENPQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN
Q91XI3 Ictidomys tridecemlineatus MALWTRLLPLLALLALLGPDPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKSRREVEEQQGGQVELGGGPGAGLPQPLALEMALQKRGIVEQCCTSICSLYQLENYCN
P01329 Cavia porcellus MALWMHLLTVLALLALWGPNTGQAFVSRHLCGSNLVETLYSVCQDDGFFYIPKDRRELEDPQVEQTELGMGLGAGGLQPLALEMALQKRGIVDQCCTGTCTRHQLQSYCN
P17715 Octodon degus MAPWMHLLTVLALLALWGPNSVQAYSSQHLCGSNLVEALYMTCGRSGFYRPHDRRELEDLQVEQAELGLEAGGLQPSALEMILQKRGIVDQCCNNICTFNQLQNYCNVP
Figure 3.

Figure 3

Grand average of hydropathicity score computed by GFA on an insulin receptor sequence, in a conversation starting with the prompt “Download the human insulin receptor protein and identify whether it contains transmembrane domains” (full conversation, displaying GFA self-correction as well as a critical silent bug, in Fig. S12).

Interestingly, GFA often corrects itself without user intervention. It does so, sometimes multiple times in the same conversation turn, on the basis of the “nagging” NagGPT responses detailed above, runtime errors generated by its Python code (e.g. Fig. S11 or S13), or simply an examination of the results it produced (Fig. S14).

But in spite of GFA’s self-correction ability, critical issues remain that affect result correctness. First, basic algorithmic mistakes—not specific to the bioinformatics nature of the task—can occur in generated code without causing runtime errors and must be identified through careful user review (Fig. S12). Second, some problematic ChatGPT behaviors are not fully suppressed despite GFA instructions targeting them. For example, GFA can still rely on intrinsic ChatGPT knowledge for information such as gene IDs, instead of performing a database lookup (this does not always lead to errors, as that intrinsic knowledge is often correct). GFA sometimes also struggles with performing exhaustive analyses when its queries are too broad and generate many hits that it does not analyze thoroughly (Fig. S11). Third, retrieved data hallucination is infrequent but does occur, perhaps particularly so when data retrieval fails (Fig. S15; GFA being interrupted during response generation, on a long DNA sequence, can be indicative of hallucination). At present, these issues can only be mitigated by a human operator who has the ability to detect and correct errors and to maintain sustained vigilance.

Discussion

The GFA custom GPT and NagGPT backend server we report here substantially enhance ChatGPT’s bioinformatics capabilities. They expand on the idea of explicitly teaching a coding model about a database’s request API as, e.g. GeneGPT for NCBI [35]. They also build on the idea of using an LLM to run computational analyses (e.g. [36]). Together, they cover the APIs of multiple databases that have complementary information, block access to nonvalidated external sources of information, and help steer analyses run with a powerful generalist model.

Despite the improvements achieved by GFA/NagGPT over plain ChatGPT, GFA must not be trusted to single-handedly produce correct analysis results, even with a well-crafted prompt. This is evident from the frequent need for the user to redirect or correct GFA for the analysis to complete not only from the errors inherent to LLM code generation and data processing but also, more specifically, from the issues we report here. Critical issues are occasional hallucinations and the generation of incorrect code that does not trigger runtime errors and that is not self-corrected. GFA users must therefore be willing and able to help GFA along in its analysis and, crucially, to review the generated Python code even when no issues are readily apparent.

Use cases of AI bioinformatics assistants in light of their shortcomings

It is vital to carefully consider the dangers posed by all AI tools, including GFA. Chief among these dangers are the undetected errors introduced by the tools and that lead to incorrect analysis results and conclusions. Errors in bioinformatics software and ensuing incorrect analyses can cascade into serious downstream consequences, in science and beyond [37, 38]. Importantly, this danger applies not only to inexperienced users, who are prone to developing an illusion of competence as a result of AI assistance [39], but also to experienced users, who may develop AI overreliance [40]. Another danger is for users with insufficient expertise to draw incorrect conclusions—even from correct assistant output—because of a limited grasp of field-specific concepts, methodologies, and pitfalls. These dangers must inform the use of GFA or similar tools, as well as future directions for tool development.

Given the potential for inaccurate outputs, what use can be made of GFA in its current state of development and by which kind of user? We suggest two possibilities. First and foremost, GFA will serve as a basis to explore avenues to further increase the reliability of AI bioinformatics assistants. Second, GFA may act as a teaching assistant of sorts, helping students develop skills and apply concepts in real-world contexts without requiring them to complete extensive, formal prerequisites [16]. Despite its limitations, ChatGPT has been reported to serve as a useful programming “learning companion” [41] that can enhance computational skill acquisition [42] without becoming a crutch [43]. ChatGPT can assist coding and statistical analysis in R [44, 45], at least for users with prior skills, and has strengths in generating code to access and process bioinformatics data [46]. Moreover, GFA’s level of technical sophistication may be suitable for a learning companion. For example, the attempt at transmembrane segment prediction shown in Fig. S12 relies on a method that has been superseded [47]—but that is perhaps of greater pedagogical value to a bioinformatics beginner than more sophisticated and less accessible counterparts.

In the future, bioinformatics experts may also benefit from AI assistants with enhanced reliability. Such tools would likely not be useful alternatives to data processing pipelines that apply a well-established workflow at scale. Rather, their primary strength may lie in interactive exploration, quick prototyping, and live, assisted programming [48]. A live programming environment, supported in the context of bioinformatics by live genomics data retrieval as provided by GFA, facilitates debugging and code validation [49, 50]. Reliable code-writing assistance may boost the productivity of seasoned bioinformaticians, as ChatGPT does for general coding [51], and similarly to effects observed on writing tasks [52]. Furthermore, AI may help follow best practices sometimes neglected in bioinformatics [38], perhaps in part because of “end-user programming” [53–55]. For example, automated generation of unit tests may increase their prevalence [56, 57].

Further into the future, AI bioinformatics assistants may lower entry barriers for biologists wanting to acquire bioinformatics skills [37] and may thereby boost interdisciplinarity and research creativity [58, 59]. However, safely achieving this goal will require AI tool reliability well beyond current levels.

Future directions

What are avenues to enhance tools such as GFA? We first address mitigations of AI risks. Numerous possibilities exist [40]. Well-crafted information presented upon initial use of an AI assistant, as well as periodic reminders, “nudges,” and “cognitive forcing functions,” can help users form a correct mental model of the limitations of an AI assistant and stay vigilant. As another example, AI-generated code and outputs can be automatically labeled to help preserve risk information upon reuse in other projects. Both strategies are readily applicable to GFA. Another mitigation is continuous monitoring by development and hosting teams, to detect behaviors or failures indicative of decreased output trustworthiness [60]. In this vein, NagGPT logs failures and follow-up requests that may have resolved them. Finally, formal code verification is arduous, and even partial validation is challenging [38, 61]. However, multi-agent verification may enhance reliability, with an integrated AI companion monitoring GFA’s output for mistakes, focusing on error patterns reported here.

Beyond AI risk mitigation, multiple avenues exist for future improvements. First, we made the initial choice of keeping computations within OpenAI’s sandboxed data analysis environment, avoiding the need for users to provide their own. We worked around the lack of Internet connectivity and limitations in pre-installed software. Yet important limitations remain, notably in the length of time any individual call to the runtime can take before being forcibly interrupted (on the order of minutes), the maximum storage capacity, and the ephemerality of user files. These limitations can be lifted by sending the generated code to a custom runtime environment hosted independently of OpenAI [62] or by generating function calls to a scheduler that invokes shell commands and connects their inputs and outputs [36]. A custom environment would also simplify the running of code in different languages, such as R, in which ChatGPT is proficient and for which a rich set of bioinformatics packages exist [63], and would allow for finer control of execution feedback loops during which generated code is improved based on runtime errors and, optionally, human feedback [64].

Second, we chose ChatGPT because of its strong abilities and because of its data analysis environment. However, any generative LLM could be used, as long as it inserts the appropriate function calls in its output and its generation is managed by an orchestrator capable of handling those calls and as long as its context window length is sufficient; we expect that NagGPT would work, e.g. with open weights models hosted by together.ai or with Anthropic’s Claude models. The correctness of generated code, prior to human intervention, varies strongly by LLM [19, 65], and our approach could be readily adapted to future LLMs with enhanced code generation performance (an interesting question is whether models other than ChatGPT-4o would require the same corrections made by NagGPT to their generated queries). It would also be possible to insert fine-tuned models that learned to invoke external tools or APIs on a larger scale and with more autonomy [66–72] and to optimize their use [73, 74].

Third, GFA’s knowledge base and capabilities can be expanded with extra natural language API documentation and data analysis guidance, with extra Python code or with extra binary executables or Python modules. GFA benefits from the transparent retrieval-augmented generation feature of OpenAI’s custom GPTs. It may gain from explicit control of the RAG process, for code generation [75] or to better leverage extensive documentation of specific APIs. For example, GeneGPT [35] translates natural-language queries to a set of calls to the NCBI API by prompting the code-generating Codex model with examples of NCBI API usage.

Fourth, greater control could be exerted on the behavior of the model, the queries it generates, and the processing of the queries back into the model’s prompt. In addition to model fine-tuning or enhanced prompting as discussed above, constrained generation [76–78] could prevent the generation of requests that are known a priori to be invalid. Furthermore, multiple models could interact, whereby a high-performing, generalist model handles the user conversation and the analysis flow, with input from more specialized models on the specific tools to invoke or the specific database queries to place. Specialized models could also assist with extraction of information from database responses (e.g. [79, 80]) or with protection against prompt injection attacks [81].

Fifth, an important future goal will be to reduce hallucinations. ChatGPT-4 hallucinates substantially less than ChatGPT-3.5 [82, 83], a trend possibly continued by the more recent ChatGPT-4o, but not as clearly by the “o1” models [84–86]. Yet we observed multiple instances of hallucination, with ChatGPT-4o generating code but hallucinating its output instead of actually running it, or claiming to have retrieved a result from a database when it had not. The former kind of hallucination appears tentatively to be infrequent in the specific context provided by GFA. The latter is problematic: a major motivation to insert results of database queries in the model’s prompt is precisely to limit hallucination. It appears to occur more frequently when errors arise during code execution or database communication, stressing the need to clearly propagate any errors to the model’s prompt.

Conclusion

We reported a practical way to augment an LLM with a modular tool that provides trustworthy information and that steers the LLM. Focusing on bioinformatics, we identified concrete improvements in LLM functionality while highlighting critical issues that remain. Limitations of our approach are its empirical nature, the lack of formal benchmarking, and that we do not know the extent to which the results may have been influenced by the context of the specific examples we picked—if only because the composition of ChatGPT’s training set is not public. Nonetheless, even if it did not yield a trustworthy AI bioinformatics assistant, our approach does unlock substantial functionality that is otherwise not available with plain LLM systems such as ChatGPT. Going forward, tools facilitating queries from authoritative, curated databases are likely to be of increasing importance in biology and beyond, given the decrease in the quality of the scientific literature [87] and given the possibility that LLM-generated content with errors will increasingly be found in web search results and injected back into LLM prompts. Additionally, should the raw performance of future LLMs plateau or come at an unsustainable cost, LLM augmentation offers a promising avenue to increase AI assistant reliability.

Key Points

  • Poor data retrieval, hallucination, and incorrect data processing make ChatGPT a poor out-of-the-box bioinformatics assistant.

  • GFA, an OpenAI custom GPT, in tandem with NagGPT, a middleware tool, facilitates retrieval of trusted genomics data and steering of live ChatGPT analysis.

  • Substantial outstanding issues must still be addressed before ChatGPT can serve as a reliable bioinformatics assistant.

Supplementary Material

supplemental_material_final_bbaf045
supplemental_files_S1_S2_S3_bbaf045_tar_bbaf045_xz

Acknowledgements

Thank you to the anonymous reviewers, whose comments substantially improved this manuscript.

Funding

None declared.

Data availability

The data and code underlying this article are available in the article, in its supplemental material, and at https://naggpt.com/NagGPT_source_code_repos/1/.

References

  • 1. Kandpal  N, Deng  H, Roberts  A. et al.  Large language models struggle to learn long-tail knowledge. Proceedings of the 40th International Conference on Machine Learning, in PMLR, 2023;202:15696–707. https://proceedings.mlr.press/v202/kandpal23a.html. [Google Scholar]
  • 2. Sun  K, Xu  Y, Zha  H. et al.  Head-to-tail: How knowledgeable are large language models (LLMs)? A.K.A. will LLMs replace knowledge graphs? In: Duh K, Gomez H, Bethard S. (eds). Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, 2024; 311–25. https://aclanthology.org/2024.naacl-long.18.
  • 3. Huang  L, Yu  W, Ma  W. et al.  A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans Inf Syst 2025;43:42. 10.1145/3703155. [DOI] [Google Scholar]
  • 4. Pal  S, Bhattacharya  M, Lee  S-S. et al.  A domain-specific next-generation large language model (LLM) or ChatGPT is required for biomedical engineering and research. Ann Biomed Eng  2024;52:451–4. 10.1007/s10439-023-03306-x. [DOI] [PubMed] [Google Scholar]
  • 5. Lewis  P, Perez  E, Piktus  A. et al.  Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst 2020;33:9459–74. http://arxiv.org/abs/2005.11401. [Google Scholar]
  • 6. De Cao  N, Aziz  W, Titov  I. Editing factual knowledge in language models. In: Moens M-F, Huang X, Specia L, Yih SW. (eds). Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021;6491–506. https://aclanthology.org/2021.emnlp-main.522.
  • 7. Sinitsin  A, Plokhotnyuk  V, Pyrkin  D. et al. Editable Neural Networks. 2020. http://arxiv.org/abs/2004.00345.
  • 8. Hu  Y, Chen  Q, du  J. et al.  Improving large language models for clinical named entity recognition via prompt engineering. J Am Med Inform Assoc  2024;31:1812–20. 10.1093/jamia/ocad259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Wu  Y, Hu  N, Bi  S., et al. Retrieve-Rewrite-Answer: A KG-to-Text Enhanced LLMs Framework for Knowledge Graph Question Answering. 2023. http://arxiv.org/abs/2309.11206.
  • 10. Razniewski  S, Yates  A, Kassner  N. et al. Language Models As or for Knowledge Bases. 2021. http://arxiv.org/abs/2110.04888.
  • 11. Hatakeyama-Sato  K, Igarashi  Y, Katakami  S. et al. Teaching Specific Scientific Knowledge into Large Language Models through Additional Training. 2023. http://arxiv.org/abs/2312.03360.
  • 12. He  Q, Wang  Y, Wang  W. Can Language Models Act as Knowledge Bases at Scale? 2024. http://arxiv.org/abs/2402.14273.
  • 13. Wang  J, Cheng  Z, Yao  Q. et al.  Bioinformatics and biomedical informatics with ChatGPT: Year one review. Quant Biol  2024;12:345–59. 10.1002/qub2.67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Piccolo  SR, Denny  P, Luxton-Reilly  A. et al.  Evaluating a large language model’s ability to solve programming exercises from an introductory bioinformatics course. PLoS Comput Biol  2023;19:e1011511. 10.1371/journal.pcbi.1011511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Shue  E, Liu  L, Li  B. et al.  Empowering beginners in bioinformatics with ChatGPT. Quant Biol  2023;11:105–8. 10.15302/J-QB-023-0327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Jansen  JA, Manukyan  A, Al Khoury  N. et al. Leveraging Large Language Models for Data Analysis Automation. 2023. 10.1101/2023.12.11.571140. [DOI] [Google Scholar]
  • 17. Wang  L, Ge  X, Liu  L. et al.  Code interpreter for bioinformatics: Are we there yet?  Ann Biomed Eng  2024;52:754–6. 10.1007/s10439-023-03324-9. [DOI] [PubMed] [Google Scholar]
  • 18. Bubeck  S, Chandrasekaran  V, Eldan  R. et al. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. 2023. http://arxiv.org/abs/2303.12712.
  • 19. Nejjar  M, Zacharias  L, Stiehle  F. et al.  LLMs for science: Usage for code generation and data analysis. J Softw Evol Process  2024;37:e2723. 10.1002/smr.2723. [DOI] [Google Scholar]
  • 20. Kim  S, Moon  S, Tabrizi  R. et al.  An LLM compiler for parallel function calling. In: Proceedings of the 41st International Conference on Machine Learning. 2024. http://arxiv.org/abs/2312.04511.
  • 21. Cunningham  F, Allen  JE, Allen  J. et al.  Ensembl 2022. Nucleic Acids Res  2022;50:D988–95. 10.1093/nar/gkab1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Sayers  EW, Bolton  EE, Brister  JR. et al.  Database resources of the national center for biotechnology information. Nucleic Acids Res  2022;50:D20–6. 10.1093/nar/gkab1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. The UniProt Consortium, Bateman  A, Martin  M-J. et al.  UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Res  2023;51:D523–31. 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Harris  TW, Arnaboldi  V, Cain  S. et al.  WormBase: A modern model organism information resource. Nucleic Acids Res  2019;48:gkz920. 10.1093/nar/gkz920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Larkin  A, Marygold  SJ, Antonazzo  G. et al.  FlyBase: Updates to the Drosophila melanogaster knowledge base. Nucleic Acids Res  2021;49:D899–907. 10.1093/nar/gkaa1026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Maglott  D, Ostell  J, Pruitt  KD. et al.  Entrez gene: Gene-centered information at NCBI. Nucleic Acids Res  2007;35:D26–31. 10.1093/nar/gkl993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kolata  G. Gene therapy allows an 11-year-old boy to hear for the first time. The New York Times  2024. https://www.nytimes.com/2024/01/23/health/deaf-gene-therapy.html. [Google Scholar]
  • 28. Uniprot . UniProt entry for Q9HC10. 2024; https://www.uniprot.org/uniprotkb/Q9HC10/entry.
  • 29. Cock  PJA, Antao  T, Chang  JT. et al.  Biopython: Freely available python tools for computational molecular biology and bioinformatics. Bioinformatics  2009;25:1422–3. 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Lassmann  T. Kalign 3: Multiple sequence alignment of large datasets. Bioinformatics  2020;36:1928–9. 10.1093/bioinformatics/btz795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Karczewski  KJ, Francioli  LC, Tiao  G. et al.  The mutational constraint spectrum quantified from variation in 141,456 humans. Nature  2020;581:434–43. 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. He  H, Singh  AK. Graphs-at-a-time: Query language and access methods for graph databases. Proceedings of the 2008 ACM SIGMOD international conference on Management of data  2008; 405–18. 10.1145/1376616.1376660. [DOI]
  • 33. Bray  T. The JavaScript object notation (JSON). Data Interchange Format  2017;RFC8259. https://www.rfc-editor.org/info/rfc8259. [Google Scholar]
  • 34. Kyte  J, Doolittle  RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol  1982;157:105–32. 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]
  • 35. Jin  Q, Yang  Y, Chen  Q. et al.  GeneGPT: Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics  2024;40:btae075. 10.1093/bioinformatics/btae075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Duque  A, Syed  A, Day  KV. et al.  Leveraging large language models to build and execute computational workflows. In: Proceedings of WORKS ’23: 18th Workshop on Workflows in Support of Large-Scale Science. Denver, CO: ACM, 2023. http://arxiv.org/abs/2312.07711.
  • 37. Joppa  LN, McInerny  G, Harper  R. et al.  Troubling trends in scientific software use. Science  2013;340:814–5. 10.1126/science.1231535. [DOI] [PubMed] [Google Scholar]
  • 38. Kamali  AH, Giannoulatou  E, Chen  TY. et al.  How to test bioinformatics software?  Biophys Rev  2015;7:343–52. 10.1007/s12551-015-0177-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Prather  J, Reeves  BN, Leinonen  J. et al.  The widening gap: The benefits and harms of generative AI for novice programmers. Proceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1  2024; 469–86. 10.1145/3632620.3671116. [DOI] [Google Scholar]
  • 40. Passi  S, Vorvoreanu  M. Overreliance on AI: Literature Review. Microsoft Research, 2022. https://www.microsoft.com/en-us/research/uploads/prod/2022/06/Aether-Overreliance-on-AI-Review-Final-6.21.22.pdf. [Google Scholar]
  • 41. Bringula  R. ChatGPT in a programming course: Benefits and limitations. Front Educ  2024;9:1248705. 10.3389/feduc.2024.1248705. [DOI] [Google Scholar]
  • 42. Yilmaz  R, Karaoglan Yilmaz  FG. The effect of generative artificial intelligence (AI)-based tool use on students’ computational thinking skills, programming self-efficacy and motivation. Comput Educ Artif Intell  2023;4:100147. 10.1016/j.caeai.2023.100147. [DOI] [Google Scholar]
  • 43. Kazemitabaar  M, Chow  J, Ma  CKT. et al.  Studying the effect of AI code generators on supporting novice learners in introductory programming. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems  2023; 1–23. 10.1145/3544548.3580919. [DOI]
  • 44. Meyer  A, Ruthard  J, Streichert  T. Dear ChatGPT – Can you teach me how to program an app for laboratory medicine?  J Lab Med  2024;48:197–201. 10.1515/labmed-2024-0034. [DOI] [Google Scholar]
  • 45. Ghosh  A, Li  H, Trout  AT. Large language models can help with biostatistics and coding needed in radiology research. Acad Radiol  2024;S1076633224006913. 10.1016/j.acra.2024.09.042. [DOI] [PubMed] [Google Scholar]
  • 46. AI4Science MR, Quantum  MA. The Impact of Large Language Models on Scientific Discovery: A Preliminary Study Using GPT-4. Redmond, USA: Microsoft, 2023. http://arxiv.org/abs/2311.07361.
  • 47. Duart  G, Graña-Montes  R, Pastor-Cantizano  N. et al.  Experimental and computational approaches for membrane protein insertion and topology determination. Methods  2024;226:102–19. 10.1016/j.ymeth.2024.03.012. [DOI] [PubMed] [Google Scholar]
  • 48. Hu  G, Liu  L, Xu  D. On the responsible use of Chatbots in bioinformatics. Genomics Proteomics Bioinformatics  2024;22:qzae002. 10.1093/gpbjnl/qzae002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Ferdowsi  K, Huang  R, James  MB. et al.  Validating AI-generated code with live programming. Proceedings of the CHI Conference on Human Factors in Computing Systems  2024;1–8. http://arxiv.org/abs/2306.09541. [Google Scholar]
  • 50. Reeves  BN, Prather  J, Denny  P. et al.  Prompts first, finally. 2024. http://arxiv.org/abs/2407.09231. [Google Scholar]
  • 51. Peng  S, Kalliamvakou  E, Cihon  P. et al.  The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. Redmond, USA: Microsoft, 2023. http://arxiv.org/abs/2302.06590.
  • 52. Noy  S, Zhang  W. Experimental evidence on the productivity effects of generative artificial intelligence. Science  2023;381:187–92. 10.1126/science.adh2586. [DOI] [PubMed] [Google Scholar]
  • 53. Silva  LB, Jimenez  RC, Blomberg  N. et al.  General guidelines for biomedical software development. F1000Res  2017;6:273. 10.12688/f1000research.10750.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Noor  A. Improving bioinformatics software quality through incorporation of software engineering practices. PeerJ Comput Sci  2022;8:e839. 10.7717/peerj-cs.839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Ferenc  K, Rauluseviciute  I, Hovan  L. et al.  Improving bioinformatics software quality through teamwork. Bioinformatics  2024;40:btae632. 10.1093/bioinformatics/btae632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Bhatia  S, Gandhi  T, Kumar  D. et al.  Unit test generation using generative AI : A comparative performance analysis of autogeneration tools. LLM4Code '24: Proceedings of the 1st International Workshop on Large Language Models for Code  2024; 54–61. 10.1145/3643795.3648396. [DOI] [Google Scholar]
  • 57. Yuan  Z, Lou  Y, Liu  M. et al.  No more Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. 2024. http://arxiv.org/abs/2305.04207.
  • 58. Uzzi  B, Mukherjee  S, Stringer  M. et al.  Atypical combinations and scientific impact. Science  2013;342:468–72. 10.1126/science.1240474. [DOI] [PubMed] [Google Scholar]
  • 59. Wu  L, Wang  D, Evans  JA. Large teams develop and small teams disrupt science and technology. Nature  2019;566:378–82. 10.1038/s41586-019-0941-9. [DOI] [PubMed] [Google Scholar]
  • 60. Resnik  DB, Hosseini  M. The ethics of using artificial intelligence in scientific research: New guidance needed for a new tool. AI Ethics  2024. 10.1007/s43681-024-00493-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Giannoulatou  E, Park  S-H, Humphreys  DT. et al.  Verification and validation of bioinformatics software without a gold standard: A case study of BWA and bowtie. BMC Bioinformatics  2014;15:S15. 10.1186/1471-2105-15-S16-S15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. name no . OpenInterpreter. 2024. https://github.com/OpenInterpreter/open-interpreter.
  • 63. Gentleman  R. R Programming for Bioinformatics. New York, USA: Chapman and Hall/CRC, 2008. https://www.taylorfrancis.com/books/9781420063684.
  • 64. Zheng  T, Zhang  G, Shen  T. et al. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. 2024. http://arxiv.org/abs/2402.14658.
  • 65. Tang  X, Qian  B, Gao  R. et al.  BioCoder: A benchmark for bioinformatics code generation with large language models. Bioinformatics  2024;40:i266–76. 10.1093/bioinformatics/btae230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Schick  T, Dwivedi-Yu  J, Dessì  R., et al. Toolformer: language models can teach themselves to use tools. Adv Neural Inf Process Syst  2023;36:68539–51. https://arxiv.org/abs/2302.04761. [Google Scholar]
  • 67. Erdogan  LE, Lee  N, Jha  S. et al. TinyAgent: function calling at the edge. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 2024; 80–88. http://arxiv.org/abs/2409.00608.
  • 68. Gao  S, Shi  Z, Zhu  M. et al.  Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum. AAAI  2024;38:18030–8. 10.1609/aaai.v38i16.29759. [DOI] [Google Scholar]
  • 69. Liu  W, Huang  X, Zeng  X. et al. ToolACE: Winning the Points of LLM Function Calling. 2024. http://arxiv.org/abs/2409.00920.
  • 70. Qin  Y, Liang  S, Ye  Y. et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs. 2023. http://arxiv.org/abs/2307.16789.
  • 71. Liu  Y, Yuan  Y, Wang  C. et al. From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs. 2024. http://arxiv.org/abs/2402.18157.
  • 72. Patil  SG, Zhang  T, Wang  X. et al. Gorilla: Large Language Model Connected with Massive APIs. 2023. http://arxiv.org/abs/2305.15334.
  • 73. Singh  S, Karatzas  A, Fore  M. et al. An LLM-Tool Compiler for Fused Parallel Function Calling. 2024. http://arxiv.org/abs/2405.17438.
  • 74. Zhuang  Y, Chen  X, Yu  T. et al. ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search. 2023. http://arxiv.org/abs/2310.13227.
  • 75. Koziolek  H, Grüner  S, Hark  R. et al.  LLM-based and retrieval-augmented control code generation. LLM4Code '24: Proceedings of the 1st International Workshop on Large Language Models for Code  2024; 22–9. 10.1145/3643795.3648384. [DOI] [Google Scholar]
  • 76. Liu  MX, Liu  F, Fiannaca  AJ. et al.  We need structured output: Towards user-centered constraints on large language model output. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems  2024;1–9. http://arxiv.org/abs/2404.07362. [Google Scholar]
  • 77. Willard  BT, Louf  R. Efficient Guided Generation for Large Language Models. 2023. http://arxiv.org/abs/2307.09702.
  • 78. Beurer-Kellner  L, Fischer  M, Vechev  M. Guiding LLMs the Right Way: Fast, Non-Invasive Constrained Generation. 2024. http://arxiv.org/abs/2403.06988.
  • 79. Luo  R, Sun  L, Xia  Y. et al.  BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform  2022;23:bbac409. 10.1093/bib/bbac409. [DOI] [PubMed] [Google Scholar]
  • 80. Cinquin  O. ChIP-GPT: A managed large language model for robust data extraction from biomedical database records. Brief Bioinform  2024;25:bbad535. 10.1093/bib/bbad535. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Greshake  K, Abdelnabi  S, Mishra  S. et al.  Not what You’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security  2023; 79–90. 10.1145/3605764.3623985. [DOI] [Google Scholar]
  • 82. OpenAI, Achiam  J, Adler  S. et al. GPT-4 Technical Report. 2024; http://arxiv.org/abs/2303.08774.
  • 83. Chelli  M, Descamps  J, Lavoué  V. et al.  Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative analysis. J Med Internet Res  2024;26:e53164. 10.2196/53164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. OpenAI . OpenAI o1 System Card. 2024; https://cdn.openai.com/o1-system-card.pdf.
  • 85. Guan  MY, Joglekar  M, Wallace  E. et al. Deliberative Alignment: Reasoning Enables Safer Language Models. 2024. https://openai.com/index/deliberative-alignment/.
  • 86. Hughes  S, Bae  M. Vectara Hallucination Leaderboard. Palo Alto, USA: Vectara, 2024. https://github.com/vectara/hallucination-leaderboard.
  • 87. Hanson  MA, Barreiro  PG, Crosetto  P. et al.  The strain on scientific publishing. Quant Sci Stud  2024;5:823–43. 10.1162/qss_a_00327. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplemental_material_final_bbaf045
supplemental_files_S1_S2_S3_bbaf045_tar_bbaf045_xz

Data Availability Statement

The data and code underlying this article are available in the article, in its supplemental material, and at https://naggpt.com/NagGPT_source_code_repos/1/.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES