Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2024 May 31;2024:679.

A Comparison of Google and ChatGPT for Automatic Generation of Health-related Multiple-choice Questions

Vivien Song 1, David Kauchak 1, John Hamre 1, Nick Morgenstein 1, Gondy Leroy 2
PMCID: PMC11141817  PMID: 38827114

Abstract

Critical to producing accessible content is an understanding of what characteristics affect understanding and comprehension. To answer this question, we are producing a large corpus of health-related texts with associated questions that can be read or listened to by study participants to measure the difficulty of the underlying content, which can later be used to better understand text difficulty and user comprehension. In this paper, we examine methods for automatically generating multiple-choice questions using Google’s related questions and ChatGPT. Overall, we find both algorithms generate reasonable questions that are complementary; ChatGPT questions are more similar to the snippet while Google related-search questions have more lexical variation.

Question-Answer Corpus Generation

To simulate a patient searching for health-related materials, we used 10K terms from the UMLS that were tagged as either “disease” or “syndrome” and submitted them as a search query to Google. We extracted the “related questions” along with snippets with potential answers resulting in 57K medical text snippets that have an associated question and answer generated by Google. For this study, we randomly selected 500 questions where the answer. For each unique snippet, we also use GPT-3.5-turbo through OpenAI’s API to generate a multiple-choice question resulting in a second question for each snippet.

Evaluation

To evaluate the content, coverage, and difficulty of the questions we utilized four different metrics: that compare the: average BLEU score between question and each snippet sentence, proportion of words in the question that occurred in the snippet (unigram precision), BERTScore between question and snippet, and the percentage of distinct bigrams. The first three metrics measure how similar the question is to the snippet, with BLEU and unigram precision focusing on lexical overlap and BERTScore semantic overlap. The proportion of unique bigrams in the question measures textual variation, with lower values indicating more variation.

Table 1 shows the results of the question evaluation metrics. ChatGPT questions are more similar to the text than the Google questions: ChatGPT has a significantly higher BLEU, a higher unigram precision, and a higher BERTScore. However, ChatGPT has a lower frequency of distinct bigrams by 4%, indicating that the ChatGPT generated question has less structural variation. In summary, both multiple-choice questions types were reasonable and could be used for measuring comprehension of the text. Given that they reflect very different approaches and different characteristics they can provide a complementary view of user understanding. Future research will look more at how the questions differ as well as the variation in potential wrong answers.

Table 1.

Question evaluation metrics for Google related-search questions and ChatGPT questions

Type BLEU (sent) Unigram precision BERTScore Distinct bigrams
Google 0.042 0.55 -0.048 0.88
ChatGPT 0.20 0.72 0.013 0.84

Acknowledgements

This research was supported by NLM-NIH Award Number R01LM011975. The content does not necessarily represent the official views of the NIH.


Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES