Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2023 Sep 28;19(9):e1011511. doi: 10.1371/journal.pcbi.1011511

Evaluating a large language model’s ability to solve programming exercises from an introductory bioinformatics course

Stephen R Piccolo 1,*, Paul Denny 2, Andrew Luxton-Reilly 2, Samuel H Payne 1, Perry G Ridge 1
Editor: Francis Ouellette3
PMCID: PMC10564134  PMID: 37769024

Abstract

Computer programming is a fundamental tool for life scientists, allowing them to carry out essential research tasks. However, despite various educational efforts, learning to write code can be a challenging endeavor for students and researchers in life-sciences disciplines. Recent advances in artificial intelligence have made it possible to translate human-language prompts to functional code, raising questions about whether these technologies can aid (or replace) life scientists’ efforts to write code. Using 184 programming exercises from an introductory-bioinformatics course, we evaluated the extent to which one such tool—OpenAI’s ChatGPT—could successfully complete programming tasks. ChatGPT solved 139 (75.5%) of the exercises on its first attempt. For the remaining exercises, we provided natural-language feedback to the model, prompting it to try different approaches. Within 7 or fewer attempts, ChatGPT solved 179 (97.3%) of the exercises. These findings have implications for life-sciences education and research. Instructors may need to adapt their pedagogical approaches and assessment techniques to account for these new capabilities that are available to the general public. For some programming tasks, researchers may be able to work in collaboration with machine-learning models to produce functional code.

Author summary

Life scientists frequently write computer code when doing research. Computer programming can aid researchers in performing tasks that are not supported by existing tools. Programming can also help researchers to implement analytical logic in a way that documents their steps and thus enables others to repeat those steps. Many educational resources are available to teach computer programming, but this skill remains challenging for many researchers and students to master. Artificial-intelligence tools like OpenAI’s ChatGPT are able to interpret human-language requests to generate code. Accordingly, we evaluated the extent to which this technology might be used to perform programming tasks described by humans. To evaluate ChatGPT, we used requirements specified for 184 programming exercises taught in an introductory bioinformatics course at the undergraduate level. Within 7 or fewer attempts, ChatGPT solved 179 (97.3%) of the exercises. These findings suggest that some educators may need to reconsider how they evaluate students’ programming abilities, and researchers might be able to collaborate with such tools in research settings.

Introduction

For decades, the life-sciences community has called for researchers to gain a greater awareness of “computing in all its forms” [1]. This need is now greater than ever. A 2016 survey of principal investigators from diverse biology disciplines revealed that almost 90% of researchers had a current or impending need to use computational methods [2]. Computers can help researchers formalize scientific processes [3], accelerate research progress [4], improve job prospects, and even learn biological concepts [5]. These opportunities have motivated the creation of interdisciplinary training programs, courses, workshops, and tutorials to teach computing skills in a life-sciences context [2,612]. In some circumstances, it is sufficient for researchers to understand computing concepts and learn to use existing tools; in others, learning to write computer code is invaluable [13]. A 2011 survey of scientists from many disciplines (other than computer science) found that researchers spent 35% of their time, on average, writing code [14]. Computer programming makes it possible to complete tasks not supported by existing tools, interface with software libraries, adapt algorithms based on custom needs, tidy data, and more [1517]. In these applied scenarios, computer programs are often small [13] and used for only one project.

Scripting languages are well suited to such tasks because researchers can focus on high-level needs and worry less about memory management, code efficiency, and other technical details [15]. Python, a scripting language, has gained much acceptance among scientists [14] and programming educators [18], perhaps due to its relatively simple syntax [19] and the availability of libraries supporting common tasks [2024]. However, learning to program is a daunting challenge for many researchers. Decades of research have sought to characterize common errors and identify effective ways for novices to learn programming skills [2530]; much remains to be discovered.

Recent advances in artificial intelligence have shown promise for converting natural-language descriptions of programming tasks to functional code [31,32]. The first such large language models (LLMs) fine-tuned to generate code that captured widespread interest were OpenAI’s Codex and DeepMind’s AlphaCode [33,34]. These models were trained on millions of code examples, representing diverse programming tasks. In November 2022, OpenAI released ChatGPT, which uses an LLM fine-tuned with human feedback to generate natural dialogue-based text and code [35]. Researchers have speculated whether such models may be able to aid researchers—or even replace their efforts on basic programming tasks. For more complicated projects, LLMs might be able to assist in writing or debugging portions of the code. If successful in these settings, LLMs could reduce life scientists’ time spent on programming, leaving more time for other research tasks.

We undertook a study to assess the extent to which an LLM can solve basic computer-programming tasks. By understanding LLMs’ current capabilities and limits [36], we sought to gain perspective on their potential usefulness in life-sciences education and research. We used ChatGPT because it 1) was released recently, 2) is accessible with a Web browser, 3) can interact with users in a conversational manner, and 4) has garnered considerable attention among academics, industry competitors, and the general public [3743]. By January 2023, ChatGPT had over 100 million active users [44].

We evaluated and documented ChatGPT’s effectiveness on Python-programming exercises from an introductory-bioinformatics course taught primarily to undergraduates. We evaluated how well ChatGPT could interpret the prompts and respond to human feedback to generate functional Python code. Here we describe quantitative and qualitative aspects of ChatGPT’s performance, describe ways that ChatGPT could aid life scientists in research, and discuss implications for teaching and assessing students’ programming capabilities in an educational context.

Methods

Programming exercises

Since 2012, Brigham Young University has offered an introductory bioinformatics course, Introduction to Bioinformatics. The course was designed for novice programmers who have an interest in biology. One learning outcome for the course is that “students will be able to create computer scripts in the Python programming language to manipulate biological data stored in diverse file formats.” To facilitate skill development, the instructors created Python programming exercises, which serve as formative assessments. We used online datasets, articles, and tools to create the exercises [4554]. Six exercises in the second assignment were derived from an online course [55]. To our knowledge, none of the other exercises were in the public domain at the time of our experiment; thus, it is unlikely that they were used to train the LLM before our testing. We used the “Jan 30 Version Free Research Preview” version of ChatGPT, which used version 3.5 of the Generative Pre-trained Transformer (GPT) model.

The exercises are organized into 19 assignments, each designed to teach a particular concept. Students complete the assignments and exercises in a defined sequence. The assignments cover 1) relatively simple tasks like declaring and using variables, performing mathematical calculations, and writing conditional statements; 2) medium-difficulty tasks like working with strings, lists, loops, dictionaries, and files; and 3) more advanced tasks like writing regular expressions, manipulating tabular data, and creating data visualizations. Other assignments give students practice with techniques that they learned in previous assignments. At specific times throughout the course, students complete additional programming exercises as summative assessments (exams), culminating in an end-of-course summative assessment. We excluded the summative assessments from this study.

The programming exercises are delivered via CodeBuddy, a Web-based application that acts as an automated grader (https://github.com/srp33/CodeBuddy). For each exercise, students receive a prompt describing the problem’s context and requirements. The prompt sometimes includes basic code that students can use as a starting point. Each exercise has at least one test, including inputs and expected outputs. When applicable, the inputs consist of data file(s) provided within the prompt. The expected outputs may be text based (n = 179) or image based (n = 5). To generate the expected outputs, the instructor provides a solution; CodeBuddy executes the code and stores the output. The output of the student’s code must match the expected output exactly. Students can make multiple attempts, as needed, without penalty. In many cases, instructors provide test(s) for which the inputs and/or expected outputs are hidden; this helps to prevent students from writing code that does not address the stated requirements. We excluded these tests from the study to maintain consistency with what students see; we manually verified whether ChatGPT-generated code met the requirements.

We used openpyxl (version 3.1.0) [56] to create a spreadsheet with information about each exercise. One column contains the prompt for each exercise, including the instructions and a summary of each test. Fig 1 shows an example of how the prompts were structured. For image-based tests, we did not include the expected outputs because ChatGPT does not accept images as input. To make the prompts more understandable to ChatGPT—and to mimic what students or researchers might do—we added natural-language transitions between each section of the prompt. Other columns in the spreadsheet include the instructors’ solutions and flags indicating whether each exercise was biology oriented. Many exercise prompts provide biology-based scenarios such that a basic understanding of biology concepts is helpful when interpreting the prompts.

Fig 1. Example prompt for a programming exercise delivered to ChatGPT.

Fig 1

Evaluation approach

We initiated a conversation with ChatGPT for each assignment. For one exercise at a time, we copied the prompt into ChatGPT’s Web-based interface. To assess functional correctness, we copied ChatGPT’s generated code into CodeBuddy. If the code did not pass all the tests, we continued the conversation with ChatGPT. In these interactions, we took the stance of a naive programmer who wishes to obtain functional code but is not necessarily able to provide detailed feedback about the code itself. We allowed ChatGPT a maximum of ten attempts per exercise. As we interacted with ChatGPT, we used the spreadsheet to record the dates of the interactions, ChatGPT’s generated code (its final attempt), the number of passed tests, the number of attempts made by ChatGPT, and comments describing our interactions. When our interactions with ChatGPT suggested that a prompt lacked clarity, we slightly modified the prompt and updated the spreadsheet accordingly.

After completing our evaluation of ChatGPT, Google released Bard, a Web-based application that uses an LLM to generate text and code. We tested 66 of the 184 exercises using Bard (version: 2023.06.07). These exercises consisted of the first and last exercises in each assignment and all exercises in the following assignments: “01—Declaring and Converting Variables,” “07—Problem Solving,” “13—Advanced Functions & Additional Practice,” and “19—Additional Practice.”

When executing the Python code, we used version 3.8 of Python. To generate the manuscript figures and perform statistical analyses, we used the R statistical software (version 4.0.2) and the tidyverse packages (version 1.3.2) [57,58]. All statistical tests were two sided.

Results

After our filtering steps, 184 Python programming exercises were available for testing. ChatGPT successfully solved 139 (75.5%) of the exercises on its first attempt. When it was unsuccessful on the first attempt, we engaged in a dialog with ChatGPT, allowing up to 10 interactions. Table 1 summarizes these interactions. In 26 instances, we indicated that the code had resulted in a runtime error, and we provided the error message to ChatGPT. More commonly, the generated code’s output did not match the expected output, either due to a logic error (n = 44) or a simple formatting issue (n = 17). In many cases—typically after three or more interactions for a given exercise—we restated the original prompt (n = 30), a modified version of the prompt (n = 11), or simply asked the model to try again. Rarely (n = 7), we provided a suggestion about the code itself (e.g., to change a function name or to use a different parameter). Still, we never provided code with our feedback.

Table 1. Summary of interactions between the human user and ChatGPT.

For the exercises that ChatGPT failed to solve on the first attempt, we categorized the subsequent interactions between the human user and the model. This table indicates the frequency of each interaction type.

Interaction type Count
Indicated logic error 44
Restated original prompt 30
Described runtime error 26
Described simple formatting issue 21
Requested simply to try again 12
Provided modified prompt 11
Suggested simple code tweak 7

Of the 45 exercises that did not pass on the first attempt, ChatGPT solved 27 within 1 or 2 subsequent attempts. Within 7 or fewer attempts total, ChatGPT solved 179 (97.3%) of the exercises (Fig 2). As summarized in Table 2, the five unsolved exercises are delivered in the middle or end of the course; each requires students to combine multiple types of programming skills. One of these is the course’s final exercise; the instructors’ solution uses 61 lines of code, nearly twice as many as any other solution. For the remaining exercises that failed, ChatGPT came close to passing the tests. Its solutions resulted in logic errors or runtime errors or produced outputs that did not match the expected outputs exactly.

Fig 2. Number of ChatGPT iterations per exercise.

Fig 2

For each exercise prompt, we gave ChatGPT up to 10 attempts at generating a code solution that successfully passed the tests. The counts above each bar represent the number of exercises that required a particular number of attempts.

Table 2. Summary of exercises that ChatGPT did not solve.

ChatGPT failed to solve 5 of the exercises within 10 attempts. This table summarizes characteristics of these exercises and provides a brief summary of complications that ChatGPT faced when attempting to solve them.

Assignment Exercise Prompt summary Skills emphasized Failure summary
09—Strings 05—Is CpG island Count the proportion of a DNA sequence that consists of ‘CG’ nucleotide pairs Manipulating strings; performing mathematical calculations Failed on one test representing an edge case (exactly 10% of the sequence were CG pairs)
14—Regular Expressions 1 06—Find words that start with a vowel Find words in a biological text that begin with vowels Writing regular expressions; reading files Trouble dealing with extra spaces or punctuation marks
15—Regular Expressions 2 06—Switch column order Switch the first two columns in a tab-delimeted text file Writing regular expressions or using lists; reading files; writing files Runtime errors, logic errors dealing with + or—portion of blood types
19—Additional Practice 08—Make inducible promoter—Part B Identify in-frame start and stop codons in an mRNA sequence Manipulating strings; reading files; using complex iteration logic Failed to follow the instructions to look for the start codon in frame
19—Additional Practice 10—Make inducible promoter—Part D Identify restriction-enzyme binding sites upstream of a gene in a DNA sequence Reading files; writing regular expressions; using complex iteration logic; using lists; using dictionaries; using conditionals Runtime errors, various logic errors, failure to fully comprehend the prompt

We used statistics to understand more about the scenarios in which ChatGPT either succeeded or failed. First, we used the length of the instructors’ solutions as an indicator for difficulty level. After removing comments (inline descriptions of how the code works) and blank lines, we compared the number of lines of code between the exercises that ChatGPT solved and those that it did not (Fig 3). The median for passing solutions was 6, and the median for non-passing solutions was 7; this difference was not statistically significant (Mann-Whitney U p-value: 0.2836). The lengths of the instructors’ solutions were significantly (positively) correlated with the lengths of ChatGPT’s solutions (Fig 4), both for the number of characters (Spearman’s rho = 0.89; p-value < 0.001) and the number of lines (Spearman’s rho = 0.83; p-value < 0.001). Another indicator of difficulty level is the exercise-prompt length. For passing solutions, the median was 2019 characters, while the median was 9115 for non-passing solutions. Although this difference was not statistically significant (Mann-Whitney U p-value: 0.1021), it is consistent with a recent study of computer science exercises [32].

Fig 3. Lines of Python code per instructor solution.

Fig 3

Course instructors provided a solution for each exercise. This plot illustrates the number of lines of code for each solution, after removing comment lines.

Fig 4. Comparison of code-solution lengths for instructor solutions versus ChatGPT solutions.

Fig 4

This illustrates the relationship between A) the number of characters or B) the number of lines of code, for each exercise, after removing comment lines. The dashed, red line is the identity line.

The number of attempts provides additional insight into ChatGPT’s capabilities but should be cautiously interpreted because ChatGPT exhibits stochasticity. Whether ChatGPT provides a correct answer on the first or a later attempt, eventual success shows that its probabilistic model can aid users. However, a smaller number of attempts might suggest an ability to formulate a valid response more readily, thus requiring less time by the user. The number of attempts per exercise was significantly correlated with the length of the instructors’ solution (rho = 0.234; p = 2.2e-16) and the length of the prompt (rho = 0.31; p = 2.2e-16). These correlations held, whether or not we considered the five exercises that ChatGPT failed to solve.

Of the 184 prompts, 98 (53.3%) were framed in a biological context. Of the five exercises that ChatGPT did not solve, four were framed in a biological context (Fisher’s exact test p-value = 0.37). The median length (characters) of biology-oriented prompts was 3203, whereas the median was 1437 for the remaining prompts (Mann-Whitney U p-value = 1.6e-09). In the course, we frequently use biological data (e.g., genome sequences, medical observations, narrative text) to teach analysis skills and make the exercises more authentic. We included these data so that ChatGPT could evaluate the files’ structure. For 24 exercises, the prompt size exceeded the maximum allowed by ChatGPT. After we truncated the data to the first few lines, ChatGPT was successful at solving all of these exercises. On four other occasions, we shortened parts of the prompt as we interacted with ChatGPT to attempt to provide clarity. For example, we shortened the descriptions of how the code would be tested. ChatGPT eventually solved two of these four exercises.

We note additional challenges that ChatGPT faced when interpreting the programming prompts. On 17 exercises, ChatGPT used correct logic but produced outputs that were different from the expected outputs (for example, “Number of worms in the last box: 5” instead of “5”). Eventually, ChatGPT solved all of these exercises. On 25 exercises, ChatGPT generated code that produced logic errors; it eventually solved 20 of these exercises. On 10 exercises, ChatGPT generated code that produced runtime errors (exceptions); it eventually solved 8 of these exercises. On two exercises, ChatGPT generated passing code that did not directly address the prompt. For example, in one case, the prompt called for using a regular expression (text-based pattern matching), but ChatGPT used iteration logic instead; we marked these exercises as passing because the automatic grader did not verify which type of logic they used. On five occasions, we noted parts of the prompt that may have been ambiguous. We clarified these prompts; subsequently, ChatGPT solved four of these exercises.

In using ChatGPT to solve these programming problems, we observed several practical issues that may impact the value offered by ChatGPT to researchers and students. For 13 exercises that did not pass on the first attempt, we asked ChatGPT to try an alternative approach, and/or we re-delivered the original prompt. In these cases, we sought to take advantage of its stochastic nature, perhaps resulting in code that used a considerably different strategy. ChatGPT eventually passed 11 of these 13 exercises. For 41 (22.2%) exercises in total, ChatGPT used at least one programming technique that would have been unfamiliar to most students in the course. Many of these techniques are never taught in the course, whereas others are introduced in later units. Finally, following an unsuccessful first attempt for six exercises, ChatGPT generated code that did not address the original prompt. Conceivably, the model “forgot” earlier parts of the conversation or was “distracted” by subsequent inputs. Eventually, it solved all of these exercises.

Although the focus of this study was not to compare LLMs, we wished to approximate how well our findings would generalize to another LLM. We evaluated Google Bard’s code-generation ability for 66 of the Python exercises. Bard solved 33 (50.0%) within one attempt and 45 (68.2%) within 10 attempts (Fig 5).

Fig 5. Number of Bard iterations per exercise.

Fig 5

For 66 exercise prompts, we gave Google Bard up to 10 attempts at generating a code solution that successfully passed the tests. The counts above each bar represent the number of exercises that required a particular number of attempts.

Discussion

These findings demonstrate that modern LLMs can solve many basic Python programming tasks, often in a biological context. On educational assessments requiring basic programming skills, students might seek help from LLMs when it is available. Additionally, in some settings, researchers might be able to rely on LLMs’ abilities to translate natural-language descriptions to code. Researchers have already begun to explore this capability in practice [59]. We anticipate that as the models evolve, students and researchers will increasingly author programming prompts in addition to code.

Anecdotally, we have found that authoring programming prompts is not always easy. During our evaluations, communicating with the model was cognitively taxing at times. In addition, these conversations were sometimes awkward. Although ChatGPT can retain a memory of previous interactions, its default response was to provide a solution; in many cases, it might have been more helpful for ChatGPT to request additional information or clarification regarding the problem. It was often more effective for us to restate the original prompt than engage in a back-and-forth dialog. On a positive note, ChatGPT was exceptionally effective at determining which parts of a given prompt were most informative; for example, it seemed to identify relevant aspects of biological context and ignore extraneous details. Currently, LLMs do not execute code; thus, they often cannot predict the output of code [60]. This is one area in which human feedback remains critical.

For 60+ years, researchers have been working to automate program synthesis [61,62]. Recent efforts have focused on training neural networks on large code repositories [33,34,63,64]. Our results show that ChatGPT represents a considerable advance compared to prior models. Chen, et al. [33] evaluated Codex’s ability to solve short- to medium-length programming exercises (median solution = 5.5 lines of code). When delivering the prompts, they used “docstrings” (structured descriptions of functions). Codex was successful for 28.8% of these exercises in a single attempt; when making 100 attempts per exercise, it solved 77.5% of the exercises [33]. In an additional study, Austin, et al. used a different set of exercises that were either mathematical or focused on core programming skills [60]. In contrast to Chen, et al., they used natural-language prompts (one or a few sentences). The solutions had a median length of 5 lines. Using various LLMs, they solved as many as 83.8% of the mathematical problems and 60% of the remaining problems (within 100 attempts). For a subset of the problems, they provided human-language feedback to the models (up to four interactions); maximum accuracy was 65%. In an educational context, Finnie-Ansley, et al. showed that Codex could solve 82.6% of programming exercises from an introductory computer-science course within 10 attempts and that the model would have ranked among the top quartile of students in the course [31]. Finally, in a similar approach to ours, Denny, et al. used Copilot (a development environment plug-in powered by OpenAI’s Codex model) to solve 166 programming exercises designed for novice computer science students [65]. For problems that initially failed, they observed similar improvements in the model’s performance through natural-language modifications to the prompts. However, only 80% of the problems were ultimately solved. Given that the problems they analyzed were also designed for novices, the superior performance we observed may be due to improvements in the models themselves in the intervening six months.

Aside from the use of ChatGPT, our work differs from prior evaluations in scope and context. Previous studies used exercises that evaluated the models’ abilities to solve mathematical problems or to use core programming skills like processing lists, processing strings, or evaluating integer sequences. Our exercises required similar techniques and higher-level tasks like parsing data files, writing data files, creating graphics, and using external Python packages. Furthermore, more than half of our exercises were framed in a biological context. LLMs may be most helpful for routine tasks that appear frequently in training sets and only need to be modified for a particular purpose; however, our results show that LLMs can be used in new and diverse contexts as well.

Our findings have important implications for education. Unless LLMs demonstrate an ability to replace all human programming efforts, it will remain necessary for students (and others) to gain programming skills [66]. In our course, preventing students from using LLMs on formative assessments (homework) would be impossible. However, summative assessments (exams) are the primary way we determine students’ final grades; these assessments are invigilated, and students cannot access the Internet. Therefore, we retain confidence in the validity of grades determined under secure assessment conditions. Extensive practice is a critical part of learning to write code [67,68]. Thus, if students rely on LLMs to generate answers to formative assessments without first devising their own solutions, they may be more likely to perform poorly on summative assessments. Indeed, over-reliance by novices was a key risk identified by Chen et al. when releasing the Codex model [33]. With the ease of use and wide availability of tools like Copilot and ChatGPT, novices may quickly learn to rely on auto-suggested solutions without thinking about the computational steps involved—or reading problem statements carefully. Furthermore, if students copy and paste code without understanding it—as has been observed for an online forum [69]—they may underperform on summative assessments. One way that instructors could counter this behavior is to generate student-specific questions about their code. Lehtinen, et al. used an LLM to generate multiple-choice questions about code that students had submitted in an introductory-programming course [70]. Students who struggled to answer these questions were more likely to perform poorly or drop the course. LLMs also provide opportunities to make learning processes more efficient. For example, to aid students on programming exercises, the instructor could allow access to an LLM, which could act as an intelligent tutor. When a student struggled to complete a given exercise, the tutor could ingest the student’s code and the exercise requirements and offer suggestions [7173]. Doing so may reduce the need for instructors or teaching assistants to provide help. Finally, instructors might be able to use LLMs when creating new exercises to evaluate whether their prompts are clear.

It remains essential to have a human in the loop to evaluate the outputs of LLMs. Students and researchers must be competent at code comprehension and code evaluation. LLMs often produce code that does not meet the stated requirements; additionally, edge cases may not be specified as part of prompts. As a (simple) validation, we compared ChatGPT’s output against the expected outputs that we had defined before beginning our study. This approach aligns with the educational context that we considered (an introductory course). However, in subsequent courses and the “real world,” other types of validation would be necessary. Educators may need to shift pedagogical practice toward ensuring that students can understand code that has been generated, evaluating whether generated code meets specifications, debugging generated code, adapting code to different library versions, etc. Fig 6 provides recommendations on how to use LLMs effectively in an educational or research context.

Fig 6. Recommendations for using LLMs to generate code.

Fig 6

We deliberately chose to allow ChatGPT up to 10 attempts to solve each exercise. Firstly, this criterion aligns with our pedagogical approach. The exercises we tested are formative. Accordingly, failing, receiving feedback, and re-attempting are part of the learning process [67,68,74]. Secondly, allowing multiple attempts reflects how biologists could use LLMs in research. If LLM-generated code does not function correctly on the first attempt, the researcher could ask the model to revise or generate a new solution. Thirdly, allowing multiple attempts per exercise is consistent with what others have reported [31,60].

Our study has several limitations. We applied one particular version of one LLM to all 184 exercises. We applied a second LLM (Google Bard) to a subset of the exercises. We do not know how our findings would generalize to other models or versions; however, the performance of LLM-based code generators will likely continue to improve as model sizes increase. The programming exercises we evaluated do not necessarily represent skills that would be taught in other introductory bioinformatics courses or used broadly in bioinformatics research. We used the Python programming language; our findings might not generalize to other languages. Future studies can shed additional light on how LLMs might be helpful for bioinformatics education and research.

Another limitation is that our evaluation process was subjective. When the initially generated solutions did not pass, the human user judged which types of feedback would be most helpful in each interaction. Other users would have interacted differently with the models. Furthermore, the human user was not a student but an instructor with 25 years of programming experience and 15 years of Python experience. In our attempt to mimic novice programmers, we rarely suggested tweaks to the code (Table 1); perhaps students would have described problematic aspects of generated code more (or less) frequently. Additionally, students might have provided more (or less) context to the models about runtime errors that occurred.

In this study, we provide evidence that dialog-based LLMs, such as ChatGPT, can aid in solving basic programming exercises, with or without biological relevance. However, despite generally excellent performance, much remains to be learned about how these models can replace human programming efforts. In an authentic research setting, where an auto-grader cannot provide instant feedback on the correctness of model-generated code, there is a risk that relying on their outputs may produce erroneous results. Nevertheless, our findings have important implications for educators and researchers who seek to incorporate programming skills into their work. With the help of machine-learning models, instructors may be able to provide more personalized and efficient feedback to students, and researchers might be able to accelerate their work.

Acknowledgments

Brandon Pickett, Justin Miller, Corinne Sexton, Ashlyn Powell, and Eric Upton-Rowley contributed to programming exercises that were used in this study.

Data Availability

We created a GitHub repository (https://github.com/srp33/ChatGPT_Bioinformatics) that includes the data we collected in this study. The instructors’ solutions have been removed so that students cannot see the solutions for exercises that ChatGPT did not solve. Due to cell-size limitations in Microsoft Excel, we exported the spreadsheets to HTML. Researchers who wish to reuse the data can import the HTML files using R code and then can export it to other formats; an example is provided in our GitHub repository. The repository also includes a record of our conversations with ChatGPT. The files are in Markdown format. The researcher’s portions of each conversation are prefixed with **Human:**. ChatGPT’s portions of each conversation are prefixed with **Assistant:**. The code that we used to analyze the data and generate figures is available in our GitHub repository.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Beynon RJ. CABIOS editorial. Bioinformatics. 1985. Jan;1(1):1–1. [Google Scholar]
  • 2.Barone L, Williams J, Micklos D. Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators. PLOS Computational Biology. 2017. Oct;13(10):e1005755. doi: 10.1371/journal.pcbi.1005755 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Guzdial M. Teaching computing for everyone. J Comput Sci Coll. 2006. Apr;21(4):6. [Google Scholar]
  • 4.Baker M. Scientific computing: Code alert. Nature. 2017;541(7638):563–5. [Google Scholar]
  • 5.Guzdial M. Computational thinking and using programming to learn. In: Learner-centered design of computing education: Research on computing for everyone. Cham: Springer International Publishing; 2016. p. 37–51. [Google Scholar]
  • 6.Zatz MM. Bioinformatics training in the USA. Brief Bioinform. 2002;3(4):353–60. doi: 10.1093/bib/3.4.353 [DOI] [PubMed] [Google Scholar]
  • 7.Kulkarni-Kale U, Sawant S, Chavan V. Bioinformatics education in india. Brief Bioinform. 2010;11(6):616–25. doi: 10.1093/bib/bbq027 [DOI] [PubMed] [Google Scholar]
  • 8.Welch L, Lewitter F, Schwartz R, Brooksbank C, Radivojac P, Gaeta B, et al. Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies. PLOS Computational Biology. 2014. Mar;10(3):e1003496. doi: 10.1371/journal.pcbi.1003496 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Williams JJ, Teal TK. A vision for collaborative training infrastructure for bioinformatics. Ann N Y Acad Sci. 2017;1387(1):54–60. doi: 10.1111/nyas.13207 [DOI] [PubMed] [Google Scholar]
  • 10.Mulder N, Schwartz R, Brazas MD, Brooksbank C, Gaeta B, Morgan SL, et al. The development and application of bioinformatics core competencies to improve bioinformatics training and education. PLOS Computational Biology. 2018. Feb;14(2):e1005772. doi: 10.1371/journal.pcbi.1005772 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Shaffer JG, Mather FJ, Wele M, Li J, Tangara CO, Kassogue Y, et al. Expanding Research Capacity in Sub-Saharan Africa Through Informatics, Bioinformatics, and Data Science Training Programs in Mali. Front Genet. 2019;10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Attwood TK, Blackford S, Brazas MD, Davies A, Schneider MV. A global perspective on evolving bioinformatics and data science training needs. Brief Bioinform. 2019;20(2):398–404. doi: 10.1093/bib/bbx100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Sayres MAW, Hauser C, Sierk M, Robic S, Rosenwald AG, Smith TM, et al. Bioinformatics core competencies for undergraduate life sciences education. PLOS ONE. 2018. Jun;13(6):e0196878. doi: 10.1371/journal.pone.0196878 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Prabhu P, Jablin TB, Raman A, Zhang Y, Huang J, Kim H, et al. A survey of the practice of computational science. In: State Pract Rep. New York, NY, USA: Association for Computing Machinery; 2011. p. 1–2. (SC ‘11). [Google Scholar]
  • 15.Ekmekci B, McAnany CE, Mura C. An Introduction to Programming for Bioscientists: A Python-Based Primer. PLOS Computational Biology. 2016. Jun;12(6):e1004867. doi: 10.1371/journal.pcbi.1004867 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wickham H. Tidy Data. J Stat Softw. 2014;59(10). [Google Scholar]
  • 17.Dasu T, Johnson T. Exploratory data mining and data cleaning. John Wiley & Sons; 2003. [Google Scholar]
  • 18.Simon, Mason R, Crick T, Davenport JH, Murphy E. Language Choice in Introductory Programming Courses at Australasian and UK Universities. In: Proc 49th ACM Tech Symp Comput Sci Educ. New York, NY, USA: Association for Computing Machinery; 2018. p. 852–7. (SIGCSE ‘18). [Google Scholar]
  • 19.Fourment M, Gillings MR. A comparison of common programming languages used in bioinformatics. BMC Bioinformatics. 2008. Feb;9(1):82. doi: 10.1186/1471-2105-9-82 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. doi: 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.McKinney W. Data Structures for Statistical Computing in Python. In: Proc 9th Python Sci Conf. 2010. p. 6. [Google Scholar]
  • 22.Walt S van der, Colbert SC, Varoquaux G. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering. 2011. Mar;13(2):22–30. [Google Scholar]
  • 23.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
  • 24.Géron A. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 1st ed. O’Reilly Media, Inc.; 2017. [Google Scholar]
  • 25.Perkins DN, Hancock C, Hobbs R, Martin F, Simmons R. Conditions of learning in novice programmers. J Educ Comput Res. 1986;2(1):37–55. [Google Scholar]
  • 26.Kelleher C, Pausch R. Lowering the barriers to programming: A taxonomy of programming environments and languages for novice programmers. ACM Comput Surv. 2005. Jun;37(2):83–137. [Google Scholar]
  • 27.Lahtinen E, Ala-Mutka K, Järvinen HM. A study of the difficulties of novice programmers. SIGCSE Bull. 2005. Jun;37(3):14–8. [Google Scholar]
  • 28.Luxton-Reilly A, Albluwi I, Becker BA, Giannakos M, Kumar AN, Ott L, et al. Introductory programming: A systematic literature review. In: Proc Companion 23rd Annu ACM Conf Innov Technol Comput Sci Educ. 2018. p. 55–106. [Google Scholar]
  • 29.Smith R, Rixner S. The Error Landscape: Characterizing the Mistakes of Novice Programmers. In: Proc 50th ACM Tech Symp Comput Sci Educ. New York, NY, USA: Association for Computing Machinery; 2019. p. 538–44. (SIGCSE ‘19). [Google Scholar]
  • 30.Becker BA, Denny P, Pettit R, Bouchard D, Bouvier DJ, Harrington B, et al. Compiler Error Messages Considered Unhelpful: The Landscape of Text-Based Programming Error Message Research. In: Proc Work Group Rep Innov Technol Comput Sci Educ. New York, NY, USA: Association for Computing Machinery; 2019. p. 177–210. (ITiCSE-WGR ‘19). [Google Scholar]
  • 31.Finnie-Ansley J, Denny P, Becker BA, Luxton-Reilly A, Prather J. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In: Proc 24th Australas Comput Educ Conf. New York, NY, USA: Association for Computing Machinery; 2022. p. 10–9. (ACE ‘22). [Google Scholar]
  • 32.Finnie-Ansley J, Denny P, Luxton-Reilly A, Santos EA, Prather J, Becker BA. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In: Proc 25th Australas Comput Educ Conf. New York, NY, USA: Association for Computing Machinery; 2023. p. 97–104. (ACE ‘23). [Google Scholar]
  • 33.Chen M, Tworek J, Jun H, Yuan Q, Pinto HP de O, Kaplan J, et al. Evaluating Large Language Models Trained on Code [Internet]. arXiv; 2021. [cited 2023 Feb 17]. Available from: https://arxiv.org/abs/2107.03374 [Google Scholar]
  • 34.Li Y, Choi D, Chung J, Kushman N, Schrittwieser J, Leblond R, et al. Competition-level code generation with AlphaCode. Science. 2022. Dec;378(6624):1092–7. doi: 10.1126/science.abq1158 [DOI] [PubMed] [Google Scholar]
  • 35.ChatGPT: Optimizing Language Models for Dialogue. OpenAI. [Cited 2023 September 19]. Available from https://openai.com/blog/chatgpt.
  • 36.Hendler J. Understanding the limits of AI coding. Science. 2023;379(6632):548–8. doi: 10.1126/science.adg4246 [DOI] [PubMed] [Google Scholar]
  • 37.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: Five priorities for research. Nature. 2023. Feb;614(7947):224–6. doi: 10.1038/d41586-023-00288-7 [DOI] [PubMed] [Google Scholar]
  • 38.Thorp HH. ChatGPT is fun, but not an author. Science. 2023. Jan;379(6630):313–3. doi: 10.1126/science.adg7879 [DOI] [PubMed] [Google Scholar]
  • 39.Kung TH, Cheatham M, Medenilla A, Sillos C, Leon LD, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023. Feb;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Jiao W, Wang W, Huang J, Wang X, Tu Z. Is ChatGPT A Good Translator? A Preliminary Study. 2023. Jan; Available from https://arxiv.org/abs/2301.08745. [Google Scholar]
  • 41.Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: Ethical challenges for medical publishing. The Lancet Digital Health. 2023. Feb;0(0). doi: 10.1016/S2589-7500(23)00019-5 [DOI] [PubMed] [Google Scholar]
  • 42.Jalil S, Rafi S, LaToza TD, Moran K, Lam W. ChatGPT and Software Testing Education: Promises & Perils [Internet]. arXiv; 2023. [cited 2023 Feb 17]. Available from: https://arxiv.org/abs/2302.03287 [Google Scholar]
  • 43.Elias J. Google is asking employees to test potential ChatGPT competitors, including a chatbot called ‘Apprentice Bard’. CNBC. https://www.cnbc.com/2023/01/31/google-testing-chatgpt-like-chatbot-apprentice-bard-with-employees.html; 2023.
  • 44.Hu K. ChatGPT sets record for fastest-growing user base—analyst note. Reuters. 2023. Feb; Available from https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01. [Google Scholar]
  • 45.Arel-Bundock V. Rdatasets: A collection of datasets originally distributed in various R packages. 2023. [Google Scholar]
  • 46.DiCiccio TJ, Efron B. Bootstrap confidence intervals. Stat Sci. 1996;11(3):189–228. [Google Scholar]
  • 47.Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. science. 2014;345(6202):1369–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Consortium U. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–15. doi: 10.1093/nar/gky1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Random Name Generator Generated full names. Random Lists. Available from https://www.randomlists.com/random-names.
  • 50.Ellis MJ, Gillette M, Carr SA, Paulovich AG, Smith RD, Rodland KK, et al. Connecting genomic alterations to cancer biology with proteomics: The NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discov. 2013;3(10):1108–12. doi: 10.1158/2159-8290.CD-13-0219 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Edwards NJ, Oberti M, Thangudu RR, Cai S, McGarvey PB, Jacob S, et al. The CPTAC data portal: A resource for cancer proteomics research. J Proteome Res. 2015;14(6):2707–13. doi: 10.1021/pr501254j [DOI] [PubMed] [Google Scholar]
  • 52.Lindgren CM, Adams DW, Kimball B, Boekweg H, Tayler S, Pugh SL, et al. Simplified and unified access to cancer proteogenomic data. J Proteome Res. 2021;20(4):1902–10. doi: 10.1021/acs.jproteome.0c00919 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Huang LS, Mathew B, Li H, Zhao Y, Ma SF, Noth I, et al. The mitochondrial cardiolipin remodeling enzyme lysocardiolipin acyltransferase is a novel target in pulmonary fibrosis. Am J Respir Crit Care Med. 2014;189(11):1402–15. doi: 10.1164/rccm.201310-1917OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: Archive for functional genomics data setsupdate. Nucleic Acids Research. 2012. Nov;41(D1):D991–5. doi: 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.White E. Programming for Biologists. Available from http://www.programmingforbiologists.org.
  • 56.Hunt J, Hunt J. Working with excel files. Adv Guide Python 3 Program. 2019;249–55. [Google Scholar]
  • 57.R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020. [Google Scholar]
  • 58.Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, et al. Welcome to the tidyverse. J Open Source Softw. 2019;4(43):1686. [Google Scholar]
  • 59.Owens B. How Nature readers are using ChatGPT. Nature. 2023. Feb; doi: 10.1038/d41586-023-00500-8 [DOI] [PubMed] [Google Scholar]
  • 60.Austin J, Odena A, Nye M, Bosma M, Michalewski H, Dohan D, et al. Program Synthesis with Large Language Models [Internet]. arXiv; 2021. [cited 2023 Feb 17]. Available from: https://arxiv.org/abs/2108.07732 [Google Scholar]
  • 61.Simon HA. Experiments with a Heuristic Compiler. J ACM. 1963. Oct;10(4):493–506. [Google Scholar]
  • 62.Manna Z, Waldinger RJ. Toward automatic program synthesis. Commun ACM. 1971. Mar;14(3):151–65. [Google Scholar]
  • 63.Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, et al. Codebert: A pre-trained model for programming and natural languages. ArXiv Prepr ArXiv200208155 [Internet]. 2020; Available from: https://arxiv.org/abs/2002.08155 [Google Scholar]
  • 64.Clement CB, Drain D, Timcheck J, Svyatkovskiy A, Sundaresan N. PyMT5: Multi-mode translation of natural language and Python code with transformers. ArXiv Prepr ArXiv201003150 [Internet]. 2020; Available from: https://arxiv.org/abs/2010.03150 [Google Scholar]
  • 65.Denny P, Kumar V, Giacaman N. Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. In: Proc 54th ACM Tech Symp Comput Sci Educ V 1. New York, NY, USA: Association for Computing Machinery; 2023. p. 1136–42. (SIGCSE 2023). [Google Scholar]
  • 66.Yellin DM. The Premature Obituary of Programming. Commun ACM. 2023. Jan;66(2):41–4. [Google Scholar]
  • 67.Robins AV, Margulieux LE, Morrison BB. Cognitive sciences for computing education. Camb Handb Comput Educ Res. 2019;231–75. [Google Scholar]
  • 68.Denny P, Luxton-Reilly A, Craig M, Petersen A. Improving complex task performance using a sequence of simple practice tasks. In: Proc 23rd Annu ACM Conf Innov Technol Comput Sci Educ. New York, NY, USA: Association for Computing Machinery; 2018. p. 4–9. (ITiCSE 2018). [Google Scholar]
  • 69.López-Nores M, Blanco-Fernández Y, Bravo-Torres JF, Pazos-Arias JJ, Gil-Solla A, Ramos-Cabrer M. Experiences from placing Stack Overflow at the core of an intermediate programming course. Comput Appl Eng Educ. 2019;27(3):698–707. [Google Scholar]
  • 70.Lehtinen T, Haaranen L, Leinonen J. Automated Questionnaires About Students’ JavaScript Programs: Towards Gauging Novice Programming Processes. In: Proc 25th Australas Comput Educ Conf. New York, NY, USA: Association for Computing Machinery; 2023. p. 49–58. (ACE ‘23). [Google Scholar]
  • 71.Crow T, Luxton-Reilly A, Wuensche B. Intelligent tutoring systems for programming education: A systematic review. In: Proc 20th Australas Comput Educ Conf. 2018. p. 53–62. [Google Scholar]
  • 72.MacNeil S, Tran A, Hellas A, Kim J, Sarsa S, Denny P, et al. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In: Proc 54th ACM Tech Symp Comput Sci Educ V 1. New York, NY, USA: Association for Computing Machinery; 2023. p. 931–7. (SIGCSE 2023). [Google Scholar]
  • 73.Sarsa S, Denny P, Hellas A, Leinonen J. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In: Proc 2022 ACM Conf Int Comput Educ Res—Vol 1. New York, NY, USA: Association for Computing Machinery; 2022. p. 27–43. (ICER ‘22; vol. 1). [Google Scholar]
  • 74.Shute VJ. Focus on formative feedback. Rev Educ Res. 2008;78(1):153–89. [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011511.r001

Decision Letter 0

Francis Ouellette

16 Jun 2023

Dear Dr. Piccolo,

Thank you very much for submitting your manuscript "Many bioinformatics programming tasks can be automated with ChatGPT" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

It will be specifically important to address the comments from all three reviewers, including reviewer #1.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

@bffo

BF Francis Ouellette

Section Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This manuscript argues that chatGPT can and should be used in research settings in bioinformatics, and, to a lesser extent, in teaching coding. The work is based on the experience of the authors in teaching an introductory coding class for bioinformatics, using code generated by querying ChatGPT with problems prepared for that class, and investigating the solutions provided via an automated grader used in educational settings.

The exercises the authors provide seem to be very basic introductory programming tasks for teaching basic coding concepts. String handling, control structures, loops, lists, dictionaries, etc. Some are bioinformatics-adapted and the final tasks seem to be more advanced with use of Pandas, CLI parsing and some visualization. However, those more complex tasks, by the authors own data, are the tasks in which ChatGPT failed. Therefore, I do not see evidence that ChatGPT I believe that the title, “Many bioinformatics programming tasks can be automated with ChatGPT”, is misleading, and should perhaps be changed to “ChatGPT receives a C grade on introductory programming tasks”.

Having read this manuscript, I believe it clearly demonstrates why using ChatGPT as a programming aide for scientists is harmful, and why teaching programming with ChatGPT is probably even more harmful. To do so I will start by quoting the first sentence in the Discussion:

“These findings are remarkable and signal a new era for life scientists. For many basic- to moderate-level programming tasks, researchers no longer need to write code from scratch.”

First, I will address this statement that has to do with the suggested use of ChatGPT in scientific coding. I see this statement as highly problematic. First: the authors have shown that for 25% of the tasks, ChatGPT requires several iterations until it generates a correct answer, if ever. Students are expected to have a higher success rate when submitting their exercise. One may argue that the back-and-forthing the authors did with ChatGPT to improve the code might be equivalent to the debugging process prior to grade submission, but it isn’t.

Second, the authors seem to advocate the use of ChatGPT to simplify simple code writing, pretty much the same way a calculator is used to simplify arithmetic: a tool to avoid basic errors, and to get on with the more important tasks of using the code for analysis rather than investing time in writing and debugging the code. Except that a calculator is always right (2+3=5, always) whereas with ChatGPT the answer is 25% wrong as it might vary between 2+3=4 in one iteration and 2+3=6 in another, with various other plausible but wrong results. Relying on ChatGPT-generated code would be like relying on a calculator that generated only 75% of the answers correctly in the first try. That scientist will not have the capability to debug the code. A coding-competent scientist may be tempted to use ChatGPT to write the code, and gloss over the details of the result. If one sees ChatGPT as a standard of truth, without understanding the code at all, one does not have the necessary skills to debug. Debugging, and not initial code writing, is the bulk of programming time, effort, and ability as coding is mostly about debugging and further adapting code to changing libraries, operating system, language versions and user input and requirements. The latter is especially true for research code which is used for exploratory analysis. If one does not have the skills to debug, one will not have the skills to modify code as required when the analysis requires changing, when the format of the input data changes, or any other myriad issues. They will have no tools for creating software tests that look for things such as edge cases and other problems that may arise with ChatGPT-generated code that is plausible, yet will fail, and the scientist generating the code will not even know that it failed, as the output will be plausible, but wrong. In a good scenario, the code generated by ChatGPT may not be completely wrong, and provide correct analysis to input, but only in some cases.

Finally, I don’t see any of the tasks they provided to ChatGPT as “moderate”. They are all introductory and easy. Most have little real-world relevance, although they have great educational value in teaching basic coding.

As for the educational aspect: here the authors seem to be more reserved. Quote: “Our findings have important implications for education. Until LLMs demonstrate an ability to replace all human programming efforts, it will remain necessary for students (and others) to gain programming skills”. This statement has the implicit assumption that programming skills have a threshold level that needs to be acquired. I disagree with that assumption, as programming is an iterative process that requires domain knowledge, creativity, synthesis, and many other skills. There is no threshold, there is constant skill acquisition in many dimensions.

However, I do agree with the authors that programming skills should be gained and that students should be able to do them themselves without restoring to a code generating LLM. To go back to the calculator analogy, one needs to know how to use a pencil-and-paper addition or long division, even if one will hardly ever use it in the future, otherwise one is arithmetically illiterate. The auction they advocate here is well-placed. It is therefore important to educate programming without the use of an LLM, as the authors suggest.

To go to the Conclusions: “In this study, we provide evidence that dialogue-based LLMs, such as ChatGPT, can aid in solving programming exercises in the life sciences, particularly in the field of bioinformatics.” I disagree. The authors have not demonstrated that they have provided anything but very basic problems to chatGPT, and I am not sure that they are research or real-world-relevant bioinformatics tasks but rather very basic programming exercises.

“However, despite generally excellent performance, these models cannot replace the need for human programming efforts entirely.”

75% is not “excellent”, it is a “C” grade. Especially on the very simple problems that the authors placed in the Additional Materials.

“In an authentic research setting, where an auto-grader is not available to provide instant feedback on the correctness of model-generated code, there is a risk that relying on their outputs may produce erroneous results.” I agree. And it may be harmful.

“With the help of machine learning models, instructors can provide more personalized and efficient feedback to students, and researchers can accelerate their work by automating programming tasks”

There is no demonstration of the former regarding students (how will instructors provide more personalized and efficient feedback to students using an LLM rather than grading software? The LLM in question was used for task solving, not for task feedback). As for the latter, using a code generator that is 25% wrong on even the most simple tasks (which are not even applicable in an authentic research or other real-world setting), does not accelerate research. I suspect that it probably slows it down and introduces unnecessary errors.

My suggestion is for the authors to review the data and their conclusions. It seems to me that the results clearly demonstrate that the use of an LLM for generating basic code, and worse, for teaching programming, should be highly discouraged.

Minor comments:

Please provide the URL for the OBF repository in the Data Availability Section

Please provide Results.html also as an Excel sheet or an ods file. It is hard to read as HTML.

Which Version of ChatGPT (dated, if possible) and Python were used?

Reviewer #2: The manuscript from Piccolo, Denny, Luxton-Reilly, Payne and Ridge explains the significance of large language models like ChatGPT applied to bioinformatics programming tasks. The methods explain how the researchers tested ChatGPT on several tasks, and evaluated its performance and the ability to correct answers guided by human prompts. The researchers found that ChatGPT solved 97.3% of the tasks within 7 or fewer attempts.

I believe the study to be very timely and interesting, but I have some suggestions on how to clarify its purpose:

- I would change the title to be less specific to ChatGPT and more general about LLMs. I understand that ChatGPT is well-known and highly discussed at the moment, and that this is the LLM that was used for the study, however I believe that the results of the study are applicable beyond ChatGPT only, and that keeping it in the title exposes the manuscript to the danger of "early obsolescence". This is just a suggestion and I would leave the decision to the authors preference, also considering their sentence in the discussion about the limitations of the study applicability.

- In the introduction, I believe that the purpose of the study should be better clarified. Was it to test if bioinformaticians' job can be sped up? Do the authors think they could quantify this, comparing the time to formulate ChatGPT prompts and to write code independently?

- Or was the purpose to discuss ChatGPT's potential as a learning tool? The introduction mentions that "learning to

program is a daunting challenge for many researchers". To discuss LLMs' implications in this field, I believe more data should be provided about the level of experience of the humans providing prompts to ChatGPT in this study. Were they all experts, even though they "took the stance of a naive programmer"? If so, would ChatGPT performance result change if the prompts were formulated by real beginner programmers?

- Even though all the conversations between humans and ChatGPT are shared, I would appreciate more summarizing considerations about the prompts to correct ChatGPT's previous errors. I don't find the explanation exhaustive: "our feedback focused on helping ChatGPT understand the exercise requirements and was restricted to natural language such that no source code was present in any of our prompts". Was the type of error (generated by the previous version of the code) provided? If not, and if it was rather an effort to rephrasing the prompt in such a way that it could better understood by the LLM, how much expert knowledge was involved in this process? And how did the researchers decide between the two alternatives: "helping ChatGPT understand the exercise" and "indicated that the previously generated solutions were unsuccessful and asked ChatGPT to try an alternative approach"?

I believe that the additional details suggested above would strengthen the message of the paper. In addition, the ambiguity I find in the introduction (is the purpose to demonstrate ChatGPT efficacy in learning or in working tasks?) reflects in some conflicts I find in the final discussion.

- If the purpose is to demonstrate that bioinformaticians will work quicker with ChatGPT, if we assume that they will all do once they get a job in the field, and if the purpose of the course is to prepare them to their future working environments, why summative assessments at the University do not allow access to the Internet?

- In relation to the above, the manuscript comments on the danger of "over-reliance by novices" and this suggest that this is the reason why summative assessment still don't incorporate the use of LLMs. Hence, the discussion seems to indicate that LLMs are not considered by the authors (yet) a good tool for learning, but this is not what suggested by the introduction (e.g. "learning to program is a daunting challenge for many researchers").

- I would also be interested in knowing more about the suggestions "one way that instructors could counter this behavior is to use LLMs to generate student-specific questions about their code" and "when students attempt to devise solutions but become stuck, they may be able to use LLMs as intelligent tutors".

In summary, I believe the study to be relevant, well structured and deserving to be published. However, I would appreciate a better clarity on its premises, aims, and conclusions from the point of view of the authors. Giving the inherently subjective nature of my comments, I am anyway happy for the authors to incorporate my suggestions to the extent they see fit and I would accept the submission even with minor or no changes.

Reviewer #3: The work tests the use of LLM, chatGPT specifically to generate bioinformatics relevant programming code. The authors conclude that 1)the chatGPT solutions are generic and can fail in providing correct solutions for specific biology problems 2) human in the loop is required to evaluate the solutions 3) since the LLM models are easily accessible the educators will have to account for their usage by themselves and by their students and adapt teaching and assessments accordingly.

It would be good to provide a clear table of DOS and DONTS

The versions of GPT model used must be mentioned

A big gap in the study is to show comparison with another LLM model - it is likely that the responses will vary between models and it is important to highlight those differences due to prompt, model types and model training.

While LLM models can be used in assisting with coding tasks, it does require the a human in the loop to evaluate the outcomes. The current solutions are not self sufficient.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Lisanna Paladin

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011511.r003

Decision Letter 1

Francis Ouellette

12 Sep 2023

Dear Dr. Piccolo,

We are pleased to inform you that your manuscript 'Evaluating a large language model’s ability to solve programming exercises from an introductory bioinformatics course' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

@bffo

BF Francis Ouellette

Section Editor

PLOS Computational Biology 

Patricia M Palagi

Section Editor

PLOS Computational Biology 

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have carefully read and answered all of my previous remarks. I appreciate their thorough responses, and the sweeping changes they made to the paper. However, we sill have a fundamental disagreement as to the credible use of ChatGPT. I will address what I see as the crux of the matter.

"As described in our earlier comments, we believe that iteratively generating code and

receiving feedback on functional correctness is a valid approach both in an educational

setting and in research settings".

My question is: why invest time and effort in developing the dubious "skill" of ChatGPT prompt engineering, rather than teach the real skill of writing and debugging a program? I do not see prompt engineering as formative learning associated with programming, it is simply a highly idiosyncratic way to get the generative-AI du-jour to perform adequately. While debugging is a large part (perhaps the biggest part) of learning how to program. I'm at a loss as to how anyone can learn how to program, without this fundamental skill. Throwing stochastically generated code lines at a code-checker and tweaking the promptings until converging onto a solution is not, in my opinion, a valid way to teach programming. Especially when in real-life situation there will not be such a code-checker, and the programmer would need to know how to generate a proper test set and test coding by themselves. The authors seem to imply that debugging can be introduced afterwards, in a more advanced teaching setting, but I do not believe that to be correct. Proper debugging skills are fundamental to coding and should be practiced at every level.

"Without approval we cannot provide details"

Although that is understandable, it is unfortunate, as it hampers both the authors' ability to defend their claim, and mine to accept it.

ChatGPT's specious stochastic output can be highly tempting for the beginning (and not-so-beginning) programmer to accept at face value, even though the authors have demonstrated that it is 25% erroneous. It is no different than any other fact-finding use of ChatGPT, which, left unscrutinized (or scrutinized in a cursory or wrong fashion) can lead to disastrous results due to chatgpt's "hallucinatory" behavior.

My question is: why teach programming using a stochastic code generator that is known to be erroneous? Would that not be like teaching arithmetic using a calculator that sometimes outputs 25+32=56 and sometimes 25+32=54? Would it not be better to do without the calculator altogether and teach long addition properly?

"Thank you for this suggestion. We attempted to use Excel, but the number of characters in

some of the programming prompts exceeded the maximum limit for a cell, so the prompts

were truncated. Thus, we used HTML. Although the HTML solution has disadvantages, wefeel it is a reasonable compromise."

A csv or tsv file would probably be easier to parse, then. HTML would be good for visualization of the results, but not for making them FAIR.

Reviewer #2: Thank you for taking into consideration my comments and suggestions.

Reviewer #4: This well structured report describes a study to investigate the potential of LLMs in aiding life scientists with computer programming tasks to help learning. Given that programming is an essential tool for life scientists but remains a challenging skill to master, this study evaluated how well ChatGPT can interpret and generate functional code based on human-language prompts in a training exercise setting.

A key issue is that naïve-to-coding students will inevitably use LLMs as they learn to code – are they of any help? Will the LLMs be able to provide appropriate code?

The results of this study details the responses and potential for LLMs in assisting with programming skills.

The manner in which teaching of coding will be impacted relies upon better understanding how well LLMs can code from prompts. Here the work makes a contribution in the setting of bioinformatics-related coding exercises

The majority of the other reviewers concerns have been addressed, and the concerns I may have raised have been reflected in their comments and responses.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Lisanna Paladin

Reviewer #4: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011511.r004

Acceptance letter

Francis Ouellette

25 Sep 2023

PCOMPBIOL-D-23-00520R1

Evaluating a large language model’s ability to solve programming exercises from an introductory bioinformatics course

Dear Dr Piccolo,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Judit Kozma

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: PLOS_Comp_Bio_Response_to_Reviewers.pdf

    Data Availability Statement

    We created a GitHub repository (https://github.com/srp33/ChatGPT_Bioinformatics) that includes the data we collected in this study. The instructors’ solutions have been removed so that students cannot see the solutions for exercises that ChatGPT did not solve. Due to cell-size limitations in Microsoft Excel, we exported the spreadsheets to HTML. Researchers who wish to reuse the data can import the HTML files using R code and then can export it to other formats; an example is provided in our GitHub repository. The repository also includes a record of our conversations with ChatGPT. The files are in Markdown format. The researcher’s portions of each conversation are prefixed with **Human:**. ChatGPT’s portions of each conversation are prefixed with **Assistant:**. The code that we used to analyze the data and generate figures is available in our GitHub repository.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES