Analysis of large-language model versus human performance for genetics questions

Dat Duong; Benjamin D Solomon

doi:10.1038/s41431-023-01396-8

. 2023 May 29;32(4):466–468. doi: 10.1038/s41431-023-01396-8

Analysis of large-language model versus human performance for genetics questions

Dat Duong ¹, Benjamin D Solomon ^1,^✉

PMCID: PMC10999420 PMID: 37246194

Abstract

Large-language models like ChatGPT have recently received a great deal of attention. One area of interest pertains to how these models could be used in biomedical contexts, including related to human genetics. To assess one facet of this, we compared the performance of ChatGPT versus human respondents (13,642 human responses) in answering 85 multiple-choice questions about aspects of human genetics. Overall, ChatGPT did not perform significantly differently (p = 0.8327) than human respondents; ChatGPT was 68.2% accurate, compared to 66.6% accuracy for human respondents. Both ChatGPT and humans performed better on memorization-type questions versus critical thinking questions (p < 0.0001). When asked the same question multiple times, ChatGPT frequently provided different answers (16% of initial responses), including for both initially correct and incorrect answers, and gave plausible explanations for both correct and incorrect answers. ChatGPT’s performance was impressive, but currently demonstrates significant shortcomings for clinical or other high-stakes use. Addressing these limitations will be important to guide adoption in real-life situations.

Subject terms: Diseases, Genetics

Introduction

Artificial intelligence (AI) applications, including subsets like deep learning (DL) have strong potential in biomedicine, including genetics [1–3]. Recently, large-language models (LLMs) like ChatGPT (https://chat.openai.com/chat) have been in the spotlight, including through demonstration of medical knowledge [4, 5]. LLMs like ChatGPT use a specific type of DL called a transformer. The model is trained on a large text dataset from sources like books, articles, and websites, and learns to predict the next word in a set of words following a prompt like a question. Training involves unsupervised learning, where the model learns to make predictions without explicit labels or annotations. Once trained, the model can be fine-tuned to improve performance on a specific task [6].

We aimed to explore how well ChatGPT would perform compared to human respondents in answering questions about human genetics.

Materials and methods

To help evaluate this model in genetics, we asked ChatGPT to answer a series of multiple-choice questions that had been posted on two social media platforms (Twitter: @BenjaminSolomo2; Mastodon: @solomonbenjamind@genomic.social). These questions have been posted weekly or biweekly since 2013, with answers and explanations given at the end of each week; over 430 questions have been posted to date. For this analysis, we used a subset of 85 questions posted starting in 2021, as ChatGPT was trained on data prior to this date. Via the social media polls, these 85 questions received a total of 13,642 responses. These questions were answered anonymously through poll functions on the social media platforms—it is not possible to gauge the precise expertize of the respondents. However, the accounts are primarily followed by individuals in the field of genetics, and the questions have been publicly suggested as being useful to individuals in genetics to help with board and examination preparation. The questions cover topics related to human genetics, including general knowledge, clinical genetics and patient diagnosis and management, molecular genetics and disease causes, and inheritance and risk calculations. For this analysis, we did not use questions with images.

To ask the questions of ChatGPT initially, we queried the online platform by inputting batches of 10–20 questions at a time. We chose this number as we noticed that ChatGPT would typically answer ~10–15 questions prior to ceasing to respond. Each question was accompanied by four possible answers, one of which was correct. We instructed ChatGPT to pick the single best answer to each question. As ChatGPT sometimes provides different answers when asked multiple times, we asked all questions twice. In doing this, we did not provide any feedback to ChatGPT, since ChatGPT may modify answers according to prompts. Finally, for questions where ChatGPT answered incorrectly on either or both attempts, we asked the question again (a third time), and also requested an explanation for the answer chosen in this final attempt.

To analyze results, we compared two categorical variables: the outcome (correct versus incorrect answer) and the type of respondent (ChatGPT versus human). The null hypothesis was that there was no relationship between the two categorical variables. To test this, we applied the chi-square test with Yate’s correction (accounting for discrete distributions) to compute the p value. We checked our values with Fisher’s exact test, which gave very similar values for all calculations.

Results

Questions, answers, and explanations (the same materials provided via social media to humans), including ChatGPT’s explanations for any question it incorrectly answered, are provided in Supplementary File 1. A summary of human responses is detailed in Supplementary Table 1. For cross-referencing, the questions are numbered according to that used on social media.

ChatGPT did not perform significantly differently (p = 0.8327) than human respondents; ChatGPT was 68.2% accurate, compared to 66.6% accuracy for humans. If we measured the respondents’ group accuracy by choosing the most common response as the overall group answer, the respondents significantly outperformed ChatGPT (p = 0.0003) (Table 1).

Table 1.

Performance of ChatGPT versus respondents.

Total
	Correct	Incorrect	Accuracy
Total
ChatGPT	58	27	68.2%
Respondents	9080	4562	66.6%
Most common response
ChatGPT	58	27	68.2%
Respondents	78	7	91.8%

Open in a new tab

Unless otherwise noted, calculations here and below were done according to ChatGPT’s initial answers.

We were interested in how ChatGPT performed for questions considered to focus on “fact look-up” or memorization (M) versus critical thinking (C). To evaluate, we divided questions into these categories based on our assessment of questions’ contents (Supplementary file 1). Both ChatGPT and humans performed significantly better for memorization than for critical thinking questions (for both, p < 0.0001). When comparing ChatGPT results to humans (Table 2), ChatGPT did not perform significantly differently than the human respondents for memorization (p = 0.3053) or critical thinking questions (p = 0.0871).

Table 2.

Comparison of results of memorization versus critical thinking questions.

Memorization
	Correct	Incorrect	Accuracy
ChatGPT	53	13	80.3%
Respondents	7137	2509	74.0%
Critical thinking
ChatGPT	5	14	26.3%
Respondents	1943	2053	48.6%

Open in a new tab

ChatGPT often provided different answers to the same questions, with 14 answer changes on second versus initial answers (16.5% of 85 questions). These involved both responses that were initially correct and incorrect (Fig. 1). When asked for explanations about the initially incorrect answers, ChatGPT would again sometimes provide different answers with the explanation. We noted that ChatGPT sometimes embellished responses. For example, as shown in the supplementary files, it added the acronym “(ECG)” after correctly answering “Electrocardiogram” once and added the phrase “and segregation testing” after correctly answering “Parental testing”. Both additions are logically correct, though were not part of the answer choices. ChatGPT would also sometimes provide full explanations without prompting, and would sometimes produce error messages.

Fig. 1 — The Sankey plot (constructed via Flourish, https://app.flourish.studio/projects) shows ChatGPT’s initial and second responses to the 85 questions used in the study.

ChatGPT’s explanations of wrong answers were all plausible in terms of providing believable, logistically consistent (though sometimes incorrect) explanations. Of explanations given when ChatGPT initially provided incorrect answers, ChatGPT subsequently gave the correct answer along with the explanation in 10 instances (37.0% of the initial 27 incorrect answers). ChatGPT gave the correct explanation (explaining why the correct answer was correct) but still indicated it chose the incorrect answer in 2 instances (7.4% of the initial 27 incorrect answers). In 7 instances (25.9% of the initial 27 incorrect answers), ChatGPT appeared to use incorrect information to select the answer; these were frequent for esoteric subjects. ChatGPT struggled with calculation-based and inheritance-type questions, with 6 incorrect answers (22.2% of the initial 27 incorrect answers). ChatGPT appeared to misunderstand 2 questions (7.4% of the initial 27 incorrect answers). For the 2 questions that ChatGPT answered correctly initially and incorrectly the second time, it provided the correct answer and explanation when asked for the explanation.

Discussion

We were impressed with ChatGPT’s performance, including because its answers were (via previous training on vast amounts of data) almost instantaneous, and were compared to humans who could look up answers or discuss the questions through social media. ChatGPT performed better with memorization versus critical thinking questions. This aligns with the consensus online and the theory of the LLM, where the observation is that AI only learns the distribution of words and is not better than humans at performing tasks traditionally considered in the realm of critical thinking (e.g., in response to ChatGPT’s proliferation in schools, some teachers are redesigning assignments to emphasize critical thinking). This differential performance may change with time as LLMs evolve.

We did not fine-tune or redesign ChatGPT, though this could be done for cost. There are also open-source alternatives that could be utilized. We note that it is not possible to know when ChatGPT is correct without additional verification—ChatGPT will offer a plausible explanation for an incorrect answer and may give completely different responses and explanations. In medical or other high-stakes settings, this is concerning. However, strategies like labeling responses with confidence scores, such as in AlphaFold, may help [7].

Overall, approaches using LLMs have shown that AI models can perform well in a variety of medical testing situations [4, 5]. Performance will improve, perhaps rapidly, and tools like ChatGPT may become ubiquitous, much as other forms of AI are currently used in genomics every day [8, 9]. Among many examples, patients and research subjects may increasingly turn to ChatGPT and other such interfaces to learn about conditions that may affect them, and clinicians may use these models to inquire about the suggested work-up or treatment for a certain patient. Potential negative consequences are critical to consider. As our analysis shows, the output may include spurious or unverifiable information. The algorithms may also reflect biases, or may rely on confounding data [10, 11]. Overall, much work remains to ensure that these types of applications are beneficial and do not harm vulnerable populations.

It seems fitting to offer ChatGPT the final words here; after describing the premise of this paper to ChatGPT, we asked it to provide a succinct concluding paragraph (query given 1/18/23 via the link in the Introduction). It wrote the following:

In conclusion, models like ChatGPT have the potential to greatly impact the field of clinical genetics by providing rapid and accurate responses to a wide range of genetic-related questions. This can aid healthcare professionals in the diagnosis and treatment of genetic disorders, as well as provide patients and their families with easily accessible information about their conditions. In addition, ChatGPT’s ability to understand and respond to natural language queries could make genetic information more widely available to a non-expert audience. As the field of genetics continues to advance, the use of natural language processing models like ChatGPT will become increasingly important in both research and clinical settings.

Supplementary information

Supplementary File 1^{(104.6KB, docx)}

Supplementary Table 1^{(21.5KB, xlsx)}

Author contributions

DD contributed to: formal analysis, investigation, methodology, and writing-review & editing. BDS contributed conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, and writing-original draft.

Funding

This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.

Data availability

All data used and presented are available in the paper and supplementary files.

Competing interests

The authors receive salary and research support from the intramural program of the National Human Genome Research Institute. BDS is the co-Editor-in-Chief of the American Journal of Medical Genetics, and has published some of the questions mentioned in this study in a book, as well as other questions [12]. Both editing/publishing activities are conducted as an approved outside activity, separate from his US Government role.

Ethics approval

No individual data were collected or analyzed (there was no access to individual respondent data); per discussion with NIH bioethics/IRB, the analyses described here are considered “not human subjects research” and do not require IRB review or formal exemption.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41431-023-01396-8.

References

1.Ledgister Hanchard SE, Dwyer MC, Liu S, Hu P, Tekendo-Ngongang C, Waikel RL, et al. Scoping review and classification of deep learning in medical genetics. Genet Med. 2022;24:1593–603. doi: 10.1016/j.gim.2022.04.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Schaefer J, Lehne M, Schepers J, Prasser F, Thun S. The use of machine learning in rare diseases: a scoping review. Orphanet J Rare Dis. 2020;15:145. doi: 10.1186/s13023-020-01424-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Dias R, Torkamani A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019;11:70. doi: 10.1186/s13073-019-0689-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:221213138. 2022.
5.Shelmerdine SC, Martin H, Shirodkar K, Shamshuddin S, Weir-McCall JR, Collaborators F-AS. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ. 2022;379:e072826. doi: 10.1136/bmj-2022-072826. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5:194. doi: 10.1038/s41746-022-00742-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell. 2019;176:535–48.e24. doi: 10.1016/j.cell.2018.12.015. [DOI] [PubMed] [Google Scholar]
9.Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36:983–7. doi: 10.1038/nbt.4235. [DOI] [PubMed] [Google Scholar]
10.DeGrave AJ, Janizek JD, Lee S-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell. 2021;3:610–9. doi: 10.1038/s42256-021-00338-7. [DOI] [Google Scholar]
11.Tekendo-Ngongang C, Owosela B, Fleischer N, Addissie YA, Malonga B, Badoe E, et al. Rubinstein-Taybi syndrome in diverse populations. Am J Med Genet A. 2020;182:2939–50. doi: 10.1002/ajmg.a.61888. [DOI] [PubMed] [Google Scholar]
12.Solomon BD. Medical Genetics and Genomics: Questions for Board Review. Wiley, Hoboken, 2022.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File 1^{(104.6KB, docx)}

Supplementary Table 1^{(21.5KB, xlsx)}

Data Availability Statement

All data used and presented are available in the paper and supplementary files.

[CR1] 1.Ledgister Hanchard SE, Dwyer MC, Liu S, Hu P, Tekendo-Ngongang C, Waikel RL, et al. Scoping review and classification of deep learning in medical genetics. Genet Med. 2022;24:1593–603. doi: 10.1016/j.gim.2022.04.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Schaefer J, Lehne M, Schepers J, Prasser F, Thun S. The use of machine learning in rare diseases: a scoping review. Orphanet J Rare Dis. 2020;15:145. doi: 10.1186/s13023-020-01424-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Dias R, Torkamani A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019;11:70. doi: 10.1186/s13073-019-0689-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:221213138. 2022.

[CR5] 5.Shelmerdine SC, Martin H, Shirodkar K, Shamshuddin S, Weir-McCall JR, Collaborators F-AS. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ. 2022;379:e072826. doi: 10.1136/bmj-2022-072826. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5:194. doi: 10.1038/s41746-022-00742-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell. 2019;176:535–48.e24. doi: 10.1016/j.cell.2018.12.015. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36:983–7. doi: 10.1038/nbt.4235. [DOI] [PubMed] [Google Scholar]

[CR10] 10.DeGrave AJ, Janizek JD, Lee S-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell. 2021;3:610–9. doi: 10.1038/s42256-021-00338-7. [DOI] [Google Scholar]

[CR11] 11.Tekendo-Ngongang C, Owosela B, Fleischer N, Addissie YA, Malonga B, Badoe E, et al. Rubinstein-Taybi syndrome in diverse populations. Am J Med Genet A. 2020;182:2939–50. doi: 10.1002/ajmg.a.61888. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Solomon BD. Medical Genetics and Genomics: Questions for Board Review. Wiley, Hoboken, 2022.

PERMALINK

Analysis of large-language model versus human performance for genetics questions

Dat Duong

Benjamin D Solomon

Abstract

Introduction

Materials and methods

Results

Table 1.

Table 2.

Fig. 1. Summary of ChatGPT’s responses.

Discussion

Supplementary information

Author contributions

Funding

Data availability

Competing interests

Ethics approval

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Analysis of large-language model versus human performance for genetics questions

Dat Duong

Benjamin D Solomon

Abstract

Introduction

Materials and methods

Results

Table 1.

Table 2.

Fig. 1. Summary of ChatGPT’s responses.

Discussion

Supplementary information

Author contributions

Funding

Data availability

Competing interests

Ethics approval

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases