As recently highlighted in the New England Journal of Medicine,1,2 artificial intelligence (AI) has the potential to revolutionize the field of medicine. While AI undoubtedly represents a set of extremely powerful technologies, it is not infallible. Accordingly, in their illustrative paper on potential medical applications of the recently launched large language model GPT-4, Lee et al. point out that chatbot applications for this AI-driven large language model occasionally produce false responses and that “A false response by GPT-4 is sometimes referred to as a ‘hallucination,’.”1 Indeed, it has become standard in AI to refer to a response that is not justified by the training data as a hallucination.3 We find this terminology to be problematic for the following 2 reasons:
It is an imprecise metaphor. Hallucination is a medical term used to describe a sensory perception occurring in the absence of an external stimulus. AI models do not have sensory perceptions as such—and when they make errors, it does not occur in the absence of external stimulus. Rather, the data on which AI models are trained can (metaphorically) be considered as external stimuli—as can the prompts eliciting the (occasionally false) responses.
More importantly, it is a highly stigmatizing metaphor. Hallucinations can accompany many, primarily neurological or mental, illnesses, and represent a hallmark symptom of schizophrenia.4 Individuals with schizophrenia experience stigma from many sides of society, with inappropriate metaphorical use of the word schizophrenia (with negative connotation) being one of the sources.5 Metaphorical use of hallucination (also with a clear negative connotation) in AI—a field with clear links to both medicine in general and psychiatry specifically1,2,6,7—is, therefore, very unfortunate. Notably, this is occurring at a time when reducing stigma is a top priority for psychiatry at large—in order to improve the lives of those living with mental illness.
It is not constructive to merely criticize a terminology without providing an alternative. Therefore, given the topic and the timing, we sought advice from AI. Specifically, we first turned to GPT-3.5—the proverbial predecessor of GPT-4:
Human: People tend to use the term “hallucination” when a generative artificial intelligence chatbot gives/makes up a response that does not seem to be justified by its training data.
GPT-3.5: While it is possible for a generative artificial intelligence chatbot to produce responses that seem to be disconnected from its training data, the term “hallucination” may not be the most accurate description of this phenomenon.
Hallucinations are typically defined as sensory experiences that occur in the absence of any external stimulus, and are commonly associated with medical conditions such as schizophrenia, delirium, or drug-induced psychosis. The term “hallucination” is not typically used in the context of AI-generated responses, as these responses are not sensory experiences in the traditional sense.
A more appropriate term to describe the phenomenon you mentioned might be “non sequitur” or “unrelated response”. These terms suggest that the AI-generated response is not logically connected to the input it received, rather than implying a sensory experience that is not based on external stimuli. It is important to use accurate terminology when discussing AI-generated responses to ensure that we do not inadvertently create confusion or misunderstandings about the nature of these systems.
(ChatGPT from OpenAI. March 2023 version)
While acknowledging that part of this answer is likely driven by the fact that the training data for GPT-3.5 do not cover the period in which use of the term hallucination gained popularity in the field of AI,3 the answer seems spot on. Non sequitur, Latin for “it does not follow,” is a term commonly used in philosophy and rhetoric to describe inferences not following from the premises. This is exactly what occurs when AI models provide responses (inferences) that are not justified by the training data or the training itself (premises). As a “sensitivity analysis,” we asked GPT-4 to assess this alternative terminology proposed by GPT-3.5:
Human: I just had this correspondence with ChatGPT: (The question to GPT-3.5 and its answer was inserted followed by “End of correspondence.”). Do you agree with the response from ChatGPT?
GPT-4: As an AI language model, I am part of the same system you had a correspondence with (ChatGPT). The response provided is accurate and well-reasoned. It’s true that the term “hallucination” might not be the best fit for describing AI-generated responses that seem disconnected from their training data. The suggested alternatives, “non sequitur” or “unrelated response,” more accurately describe this phenomenon without implying a sensory experience or medical condition. Using precise terminology when discussing AI-generated responses helps to avoid confusion and misunderstandings about the nature of these systems.
(GPT-4 from OpenAI. March 2023 version)
GPT-4 clearly agreed. We then went on to inquire whether it would be useful to write a brief paper on this topic to disseminate the suggestion of changing terminology:
Human: Thanks. I am a researcher working in the field of AI in medicine. I am considering writing and publishing a brief paper asking my colleagues to change the terminology from “hallucinations” to “non sequitur”. Does that sound like a good idea?”
GPT-4: As a researcher in the field of AI in medicine, your perspective on terminology and its impact on understanding and communication is important. Writing and publishing a brief paper on this topic can be a good idea, as it encourages discussion and promotes clarity in the field. By proposing the change from “hallucinations” to “non sequitur,” you can help to reduce confusion and ensure that conversations around AI-generated responses are more accurate and precise.
In your paper, consider presenting arguments for the change, explaining the potential implications of using inappropriate terminology, and providing examples to illustrate the difference between the two terms. By engaging in this dialogue, you can contribute to the ongoing refinement of language and understanding within the AI and medical communities.
(GPT-4 from OpenAI. March 2023 version)
Confirmed. We acknowledge that non sequitur does not cover all false responses generated by AI models. Indeed, AI models can also make “hasty generalizations,” ie, the fallacy of making (too) strong claims based on (too) limited data (see figure 1 for representative examples of non sequitur and hasty generalization). GPT-4 agreed that this term would also be useful in the context of AI:
Fig. 1.

Hypothetical examples of a non sequitur and a hasty generalization in Artificial Intelligence.
Human: While non sequitur seems to be an appropriate term to describe very specific logical fallacies in AI, could it perhaps be good to also use the term “hasty generalization” in other cases?
GPT-4: Yes, using the term “hasty generalization” can be appropriate in certain cases to describe specific types of AI-generated responses. A hasty generalization is a logical fallacy that occurs when someone makes a generalization based on insufficient or unrepresentative evidence. In the context of AI, this could refer to instances where the AI model generates a response based on limited or biased information from its training data, leading to an inaccurate or overly broad conclusion.
In your paper, you can discuss the different types of logical fallacies and inaccuracies that can arise in AI-generated responses, such as non sequiturs and hasty generalizations. By providing examples and clarifying the distinctions between these fallacies, you can further contribute to more accurate communication and understanding within the field.
It’s important to keep in mind that AI language models like ChatGPT are probabilistic, which means they are not perfect and can sometimes generate responses that include logical fallacies. Identifying and addressing these issues is an essential part of ongoing AI research and development.
(GPT-4 from OpenAI. March 2023 version)
At this point, we realized that many other terms from philosophy and rhetoric describing logical fallacies of different types (eg, “false analogy,” “appeal to authority” and “false dilemma”) could also be used to label false responses from AI models (see examples in the Supplementary Material) at a much more specific level than is currently the case with hallucination.3 Increased specificity in error labeling based on a standardized vocabulary will facilitate insights into the underlying error-generating mechanisms of AI models and, in turn, allow for reduced inaccuracies and improvement in their overall performance and reliability.
In summary, this paper represents an appeal to the field of AI to change its labeling of false responses. There is no need to use an imprecise and stigmatizing metaphor when there are already specific labels available.
Supplementary Material
Acknowledgments
The authors thank Christian Winther Topp, MA (philosophy), for fruitful discussion. Conflicts of interestSDØ received the 2020 Lundbeck Foundation Young Investigator Prize. SDØ owns/has owned units of mutual funds with stock tickers DKIGI, IAIMWC, SPIC25KL, and WEKAFKI, and has owned units of exchange-traded funds with stock tickers BATE, TRET, QDV5, QDVH, QDVE, SADM, IQQH, USPY, EXH2, 2B76, and EUNL. KLN reports no conflicts of interest.
Contributor Information
Søren Dinesen Østergaard, Department of Clinical Medicine, Aarhus University, Aarhus, Denmark; Department of Affective Disorders, Aarhus University Hospital - Psychiatry, Aarhus, Denmark.
Kristoffer Laigaard Nielbo, Department of Culture and Society, Center for Humanities Computing, Aarhus University, Aarhus, Denmark.
Funding
There was no funding for this work. SDØ is supported by grants from the Novo Nordisk Foundation (grant number: NNF20SA0062874), the Lundbeck Foundation (grant numbers: R358-2020-2341 and R344-2020-1073), the Danish Cancer Society (grant number: R283-A16461), the Central Denmark Region Fund for Strengthening of Health Science (grant number: 1-36-72-4-20), the Danish Agency for Digitisation Investment Fund for New Technologies (grant number 2020-6720) and Independent Research Fund Denmark (grant numbers: 7016-00048B and 2096-00055A).
References
- 1. Lee P, Bubeck S, Petro J.. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388(13):1233–1239. doi: 10.1056/NEJMsr2214184 [DOI] [PubMed] [Google Scholar]
- 2. Haug CJ, Drazen JM.. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med. 2023;388(13):1201–1208. doi: 10.1056/NEJMra2302038 [DOI] [PubMed] [Google Scholar]
- 3. Wikipedia. Hallucination (artificial intelligence). Accessed April 8, 2023. https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence) [Google Scholar]
- 4. Marder SR, Cannon TD.. Schizophrenia. Boston, MA: New England Journal of Medicine; 2019. 1753–1761. [DOI] [PubMed] [Google Scholar]
- 5. Vahabzadeh A, Wittenauer J, Carr E.. Stigma, schizophrenia and the media: exploring changes in the reporting of schizophrenia in major U.S. newspapers. J Psychiatr Pract. 2011;17(6):439–446. doi: 10.1097/01.pra.0000407969.65098.35 [DOI] [PubMed] [Google Scholar]
- 6. Corona Hernández H, Corcoran C, Achim AM, et al. Natural language processing markers for psychosis and other psychiatric disorders: emerging themes and research agenda from a cross-linguistic workshop. Schizophr Bull. 2023;49(suppl_2):S86–S92. doi: 10.1093/schbul/sbac215 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Salazar de Pablo G, Studerus E, Vaquerizo-Serrano J, et al. Implementing precision psychiatry: a systematic review of individualized prediction models for clinical practice. Schizophr Bull. 2021;47(2):284–297. doi: 10.1093/schbul/sbaa120 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
