Skip to main content
JMIR Formative Research logoLink to JMIR Formative Research
. 2024 May 17;8:e53985. doi: 10.2196/53985

Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study

Yukinori Harada 1,2,, Tetsu Sakamoto 1, Shu Sugimoto 3, Taro Shimizu 1
Editor: Amaryllis Mavragani
Reviewed by: Md Belal Bin Heyat, Byron Crowe
PMCID: PMC11143391  PMID: 38758588

Abstract

Background

Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited.

Objective

This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world.

Methods

This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker’s diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year).

Results

A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker.

Conclusions

A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.

Keywords: atypical presentations, diagnostic accuracy, diagnosis, diagnostics, symptom checker, uncommon diseases, symptom checkers, uncommon, rare, artificial intelligence

Introduction

Diagnostic errors are a significant global patient safety issue [1]. In outpatient settings, diagnostic errors are evident in 1%-5% of cases [2-5]. Notably, the risk of such errors increases for outpatients unexpectedly admitted shortly after their initial visit [6,7]. The most common factors contributing to diagnostic errors in outpatient settings include problems with data integration, interpretation, and differential diagnosis [3,5,8-10]. To address this, the integration of diagnostic decision support systems, such as differential diagnosis generators, into clinical practice is recommended [11].

Differential diagnosis generators produce possible differential diagnoses by processing clinical information through algorithms, thereby supporting clinicians by reducing the likelihood of overlooking possible diagnoses and countering the cognitive biases inherent to the diagnostic process [12]. Early deployment of differential diagnosis generators can augment an existing list of differential diagnoses, increasing the odds of including the correct diagnosis [13], and can prompt more thorough history taking [14]. Therefore, current symptom checkers—generators that produce differential diagnoses based on the inputs from patients themselves before they encounter a clinician—are potentially promising tools to reduce diagnostic errors. Indeed, some symptom checkers have already been used in clinical practice [15-17] and even in national health services, such as the National Health Service 111 system in the United Kingdom [18].

Given that modern artificial intelligence (AI) is designed to be dynamic and to evolve according to real-world data [19], one might expect the performance of AI-based symptom checkers to improve over time. Importantly, at the same time, a decline in the performance of AI by feedback with data from different populations and settings is also possible. Monitoring such a shift and drift of AI performance is required to use AI-based symptom checkers effectively and safely [19,20]. However, their developers do not usually disclose the data, such as how AI algorithms changed and what types of clinical indicators improved. One set of studies using the same sets of clinical vignettes found that the diagnostic accuracy of symptom checkers improved from 2015 to 2020 [21]. However, because these case vignettes are publicly available, the developers may have trained symptom checker algorithms using these cases. Therefore, it remains unknown whether symptom checkers improve their diagnostic performance over time [12]. Moreover, because clinical vignettes have been found to have considerable inherent limitations when used to assess diagnostic accuracy in comparison with real-world data [22], longitudinal evaluations of the performance of symptom checkers in the real world are needed.

Concerns have arisen regarding the low diagnostic accuracy of current symptom checker output, which often lags behind that of physicians [12,17,23]. Inaccurate initial diagnoses can be detrimental, steering clinicians toward errors [24]. One major hurdle in the accuracy of symptom checker outputs is patient input variability. Differences in symptom interpretation, clinical literacy, input sequencing, and symptom listings can profoundly influence the quality of a symptom checker’s output [12,25]. Another challenge is the disparity between simulated and real-world data. Previous research has indicated a diminished accuracy of symptom checker output when applied to real cases instead of fictional vignettes [23]. This could be attributed to the fact that crafted vignettes often provide typical presentations [25], whereas real cases include more atypical presentations and may contribute to diagnostic errors [26]. Therefore, symptom checkers should be trained using real-world patient data, covering a diverse range of cases and including atypical presentations, to improve their accuracy [12,25]. The real-world application and refinement of these tools after development are crucial.

Therefore, this study was conducted to assess the changes in the accuracy of differential diagnosis lists created by AI-based symptom checkers in the real world. This paper defined AI-based symptom checkers as those using contemporary machine learning models.

The contributions of our proposed work are summarized as follows: (1) we provide the data of 3-year longitudinal changes in the diagnostic performance of contemporary machine learning–based symptom checkers and (2) we also provide factors related to the diagnostic performance of AI-based symptom checkers.

Methods

Study Design and Participants

This was a single-center, retrospective, observational study. Patients who visited the internal medicine outpatient clinic at Nagano Chuo Hospital without an appointment between May 1, 2019, and April 30, 2022, and who were then admitted within 30 days after their index visit were considered eligible. We set the inclusion criteria because admission within 30 days after the index visit was considered a useful option to capture the patients with a high risk of diagnostic errors [27-31]; diagnostic decision support systems are particularly needed for these population. We included only patients who used an AI-based symptom checker that identified 10 possible differential diagnoses (Ubie Inc) at the index visit and excluded patients for whom the AI-based symptom checker produced less than 10 differential diagnoses, whose diagnosis was not confirmed, and who were admitted for a reason unassociated with their index visit complaint. For patients who used the AI-based symptom checker multiple times at different outpatient visits or who were admitted twice or more, we included only data from the first outpatient visit and admission (others were excluded as duplicates). An overview of this study is shown in Figure 1.

Figure 1.

Figure 1

Overview of the study. This study included patients who visited the internal medicine outpatient clinic at a community hospital without an appointment between May 2019 and April 2022 and were admitted within 30 days after their index visit. This study included only patients who used an AI-based symptom checker that identified 10 possible differential diagnoses at the index visit. The final diagnoses were categorized into common or uncommon diseases, and clinical presentations were categorized into typical or atypical. The change in the diagnostic accuracy of the AI-based symptom checker over 3 years was assessed by using a chi-square test by dividing the study duration into 3 periods: from May 2019 to April 2020 (first year), from May 2020 to April 2021 (second year), and from May 2021 to April 2022 (third year). A multivariable logistic regression analysis was conducted to assess independent factors with a diagnostic accuracy of the top 10 differential diagnosis lists created by the AI-based symptom checker. AI: artificial intelligence.

Ethical Considerations

The study complied with the principles of the Declaration of Helsinki. The research ethics committee of Nagano Chuo Hospital approved this study (NCR202208) and waived the requirement for written informed consent from the participants because of the opt-out method used in this study. We informed the participants by providing detailed information about the study in the outpatient waiting area at Nagano Chuo Hospital and on the hospital’s website. The study data are deidentified. There was no compensation for the participants.

AI-Based Symptom Checker

Details of the AI-based symptom checkers assessed in this study have been described previously [7,32]. In brief, the AI-based symptom checker converted the data entered by patients on tablet terminals into medical terms. Patients entered their background information, such as age, sex, and chief complaint, as a free text on a tablet in the waiting room. This AI-based symptom checker asked approximately 20 questions, one by one, tailored to the patient. Based on the previous answers of the same patient, the questions were optimized to generate the most relevant list of potential differential diagnoses. The hospital staff at Nagano Chuo Hospital provided support to the patients when they found it difficult to input information independently. Physicians could view the entered data as a summarized medical history with the top 10 possible differential diagnoses along with their ranks. According to the developer’s website, this AI-based symptom checker improved quality through feedback from more than 1500 medical institutions. However, we could not show the mathematical expression and algorithm of the machine learning model because the developer did not disclose a detailed machine learning methodology.

Data Collection

We retrospectively collected data from the patients’ electronic health records. The following data were collected: date of the index visit, age, sex, medical history recorded by the AI-based symptom checker (including chief complaints, history of present illness, past medical history, family history, and social history), 10 differential diagnoses developed by the AI-based symptom checker, and the final diagnosis. The final diagnosis was judged independently by 2 researchers (YH and SS) based on the descriptions in the medical records, and disagreements were resolved through discussion. Final diagnoses were coded by the first author (YH) using the ICD-11 (International Classification of Diseases, 11th Revision) codes. Final diagnoses were further categorized into common or uncommon diagnoses based on whether the incidence was more than 1 in 2000 (common disease) or not (uncommon disease) [33]; unclear cases were judged by 2 researchers (YH and T Sakamoto) through discussion. According to the final diagnosis and medical history created by the AI-based symptom checker, 2 researchers (YH and T Sakamoto) independently judged all cases as typical or atypical, and conflicts were resolved by discussion.

Primary Outcome

The primary outcome measure was the accuracy of the differential diagnosis list created using the AI-based symptom checker. The accuracy of the differential diagnosis list created by the AI-based symptom checker was defined as the presence of the final diagnosis in the list of 10 differential diagnoses created by the AI-based symptom checker. Two researchers (YH and T Sakamoto) independently judged the accuracy of the differential diagnosis list created by the AI-based symptom checker, and conflicts were resolved through discussion. The accuracy of the AI over 3 years was also assessed in the following subgroups: age 65 years and older and younger than 65 years, men and women, single and multiple chief complaints, common and uncommon disease, and typical and atypical presentation.

Statistical Analysis

Continuous or ordinal data are presented as mean and SD or median and quantiles and compared using a 2-tailed t test, U test, or ANOVA. Categorical or binary data are presented as numbers and percentages and compared using the chi-square or Fisher exact test. To assess the change in the diagnostic accuracy of the AI-based symptom checker over 3 years, we compared the accuracy of the differential diagnosis lists created by the AI-based symptom checker using a chi-square test by dividing the study duration into 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year). We calculated 108 patients as the minimum required sample size based on an α error of .05, power of 0.80, effect size of 0.30 (medium), and degrees of freedom of 2. We also created a multivariable logistic regression model that included the correctness of the differential diagnosis list created by the AI-based symptom checker as an independent variable and the visit year (first, second, and third year), age (as a continuous variable), sex (male or female), typicality of presentation (typical or atypical), and commonality of final diagnosis (common or uncommon) as dependent variables; these variables were selected as confounders because they were considered to be associated with the accuracy of the differential diagnosis list created by the AI-based symptom checker. P values below .05 were considered significant. All statistical analyses were performed using R (version 4.1.0; R Foundation for Statistical Computing).

Results

Baseline Characteristics

Of the 484 eligible cases, 103 were excluded (duplication: 20, admission unrelated to the index visit: 9, no final diagnosis: 18, and AI produced less than 10 differential diagnoses: 56). Therefore, 381 cases were finally included in the analysis. The mean age was 68 (SD 18) years, and 205 (53.8%) were men. In total, 174 (45.7%) patients inputted more than 1 complaint. Diseases of the digestive system were the most common final diagnosis category (n=128, 33.6%), followed by diseases of the circulatory system (n=55, 14.4%), respiratory system (n=44, 11.5%), neoplasms (n=42, 11%), and infectious or parasitic diseases (n=26, 6.8%). Regarding commonality and typicality, 257 (67.5%) were common diseases, and 298 (78.2%) were typical presentations. Typical presentation of common disease was the most common group (n=205, 53.8%), followed by typical presentation of uncommon disease (n=93, 24.4%), atypical presentation of common disease (n=52, 13.6%), and atypical presentation of uncommon disease (n=31, 8.1%). The number of patients was higher in the first year than in the second and third years (Table 1) due to the COVID-19 pandemic. Although there was a significant difference in age, no significant differences were observed in other baseline characteristics among the 3 groups.

Table 1.

Baseline characteristics of patients who visited the internal medicine outpatient clinic at Nagano Chuo Hospital without an appointment and then admitted within 30 days for 3 years from May 2019 to April 2022.


First yeara (n=219) Second yearb (n=72) Third yearc (n=90) P value
Age (years), mean (SD) 70 (18) 63 (15) 64 (17) .002
Men, n (%) 114 (52.1) 40 (55.6) 51 (56.7) .72
Multiple chief complaints, n (%) 104 (47.5) 32 (44.4) 38 (42.2) .68
Common disease, n (%) 146 (66.7) 45 (62.5) 66 (73.3) .32
Typical presentation, n (%) 164 (74.9) 60 (83.3) 74 (82.2) .18

aThe first year was from May 1, 2019, to April 30, 2020.

bThe second year was from May 1, 2020, to April 30, 2021.

cThe third year was from May 1, 2021, to April 30, 2022.

Primary Outcome

Overall, the final diagnosis was observed in the top 10 differential diagnosis lists created by the AI-based symptom checker in 172 (45.1%) patients. The accuracy of the differential diagnosis list created by the AI-based symptom checker did not significantly differ among the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). There was also no significant difference in the accuracy of AI differential diagnosis among the 3 years in the subgroups (Table 2). In the subgroups with uncommon diseases and atypical presentations, the correct rate of the AI differential diagnosis list was <30%. Some examples of cases with uncommon diseases and atypical presentations are shown in Multimedia Appendix 1.

Table 2.

Proportion of patients with a correct diagnosis included in the top 10 differential diagnosis list generated by artificial intelligence in patients who visited the internal medicine outpatient clinic at Nagano Chuo Hospital without an appointment and then admitted within 30 days for 3 years from May 2019 to April 2022.


Total (N=381), n/N (%) First yeara (n=219), n/N (%) Second yearb (n=72), n/N (%) Third yearc (n=90), n/N (%) P value
Overall accuracy 172/381 (45.1) 97/219 (44.3) 32/72 (44.4) 43/90 (47.7) .85
Age ≥65 years 110/243 (45.3) 69/159 (43.4) 15/35 (42.9) 26/49 (53.1) .47
Age <65 years 62/138 (44.9) 28/60 (46.7) 17/37 (45.9) 17/41 (41.5) .87
Men 102/205 (49.8) 53/114 (46.5) 20/40 (50) 29/51 (56.9) .47
Women 70/176 (39.8) 44/105 (41.9) 12/32 (37.5) 14/39 (35.9) .77
Single chief complaint 103/207 (49.8) 53/115 (46.1) 23/40 (57.5) 27/52 (51.9) .43
Multiple chief complaints 69/174 (39.7) 44/104 (42.3) 9/32 (28.1) 16/38 (42.1) .34
Common disease 142/257 (55.3) 79/146 (54.1) 27/45 (60) 36/66 (54.5) .78
Uncommon disease 30/124 (24.2) 18/73 (24.7) 5/27 (18.5) 7/24 (29.2) .67
Typical presentation 160/298 (53.7) 88/164 (53.7) 29/60 (48.3) 43/74 (58.1) .53
Atypical presentation 12/83 (14.5) 9/55 (16.4) 3/12 (25) 0/16 (0) .10

aThe first year was from May 1, 2019, to April 30, 2020.

bThe second year was from May 1, 2020, to April 30, 2021.

cThe third year was from May 1, 2021, to April 30, 2022.

Logistic Regression Model

In the multivariate logistic regression model, the year of the index visit was not significantly associated with whether the final diagnosis was included in the top 10 differential diagnosis lists created by the AI-based symptom checker (Table 3). By contrast, in the multivariate logistic regression model, the commonality of disease and typicality of presentation were significantly associated with the accuracy of the differential diagnosis list created by the AI-based symptom checker.

Table 3.

A logistic regression model for whether the correct diagnosis was included in the differential diagnosis list generated by artificial intelligence in patients who visited the internal medicine outpatient clinic at Nagano Chuo Hospital without an appointment and then admitted within 30 days.

Variables ORa (95% CI) P value
Year of visit

Second yearb (reference: first yearc) 0.84 (0.45-1.54) .57

Third yeard (reference: first year) 0.88 (0.51-1.54) .67
Age (for 1-year increase) 0.99 (0.98-1.00) .16
Men (reference: women) 1.42 (0.92-2.30) .11
Multiple complaints (reference: single complaint) 0.70 (0.44-1.11) .13
Common disease (reference: uncommon disease) 4.13 (2.50-6.98) <.001
Typical presentation (reference: atypical presentation) 6.92 (3.62-14.2) <.001

aOR: odds ratio.

bThe second year was from May 1, 2020, to April 30, 2021.

cThe first year was from May 1, 2019, to April 30, 2020.

dThe third year was from May 1, 2021, to April 30, 2022.

Discussion

Principal Results

In this study, at a community hospital in Japan, a 3-year longitudinal assessment of the performance of an AI-based symptom checker showed no change in the diagnostic accuracy of its differential diagnosis lists in outpatients admitted within 30 days of their index visit. In the exploratory subgroup and multivariate logistic regression analyses, the commonality of disease and typicality of presentation were significantly associated with the accuracy of the differential diagnosis list created by the AI-based symptom checker.

Implications of the Study

This study suggests that current AI-based symptom checkers used in the real world may not improve their diagnostic performance over time. In this study, no improvement in the diagnostic accuracy of AI was observed, even in the common disease and typical presentation subgroups. Machine learning, using data with reliable teaching labels, is required to improve the accuracy of AI-based symptom checkers. However, patients may not always be able to accurately provide their final diagnosis, which may prevent effective machine learning. In addition, even if symptom checkers are used in health care facilities, reliable feedback may not be guaranteed because of diagnostic uncertainty, low diagnostic quality, and care fragmentation. The results of this study indicate that the developers and users of AI-based symptom checkers should be more responsible for improving the diagnostic quality of AI-based symptom checkers by providing reliable feedback on diagnostic labels.

There can be another perspective for this study’s results. We assumed that the performance of the AI-based symptom checker did not improve over time based on the result that the diagnostic accuracy did not change. However, it is possible that the developer also set indicators other than diagnostic accuracy, such as the impact of service use, clinical and cost-effectiveness, and patient satisfaction, to improve the algorithm of the AI-based symptom checker [17]. Balancing the different outcomes may limit the increase in diagnostic accuracy. In addition, since we do not know the ideal and theoretical upper limit of diagnostic accuracy in specific clinical contexts with some restrictions, it is also possible that some AI-based symptom checkers’ diagnostic accuracy has already reached the theoretical upper limit of their performance. For example, minimizing questions to save time may reduce the diagnostic performance. Indeed, our previous study showed that physicians’ diagnostic accuracy was only 56% when reading the information taken by the same AI-based symptom checker used in this study [34]. Therefore, the judgment that no improvement in diagnostic accuracy was observed in this study may be unfair. We need a standard method with clear indicators for an unbiased and fair evaluation of the improvement of the performance of AI-based symptom checkers.

Comparison With Prior Work

Longitudinal comparisons of the diagnostic performance of symptom checkers in the real world are scarce; however, several studies have assessed changes in the diagnostic accuracy of symptom checkers using clinical vignettes. According to Schmieding et al [21], the rate of correct diagnoses listed among the top 10 differential diagnoses of symptom checkers was at least 15% higher in 2020 than in 2015 using the same clinical vignettes. In contrast, other studies suggested that the diagnostic accuracy of symptom checkers did not change from 2015 to 2020 when using some of the new vignettes [21,35]. Considering these and our study results, the diagnostic accuracy of symptom checkers may be improved for prototypical or standardized patients; however, because there are many variants of demographic patterns and clinical presentations in the real world, slight improvements may not result in the overall improvement of diagnostic accuracy.

In this study, the diagnostic accuracy of the AI-based symptom checker for uncommon diseases was approximately 30% lower than that for common diseases; similarly, approximately 40% lower diagnostic accuracy was observed for atypical presentations than for typical presentations. The diagnostic accuracy of symptom checkers may depend on the urgency of the clinical condition as well as common and uncommon conditions [17]. Indeed, a previous study also showed that the correct diagnosis was less frequently listed in the top 10 differential diagnoses of symptom checkers for uncommon diseases than for common diseases with a 60% difference (8% vs 68%) [35]. Our study provides evidence that atypical presentations, another aspect of uncommon conditions, may also negatively affect the diagnostic accuracy of symptom checkers. Uncommon diseases and atypical presentations are associated with a high risk of diagnostic error [26,36]. Through this perspective, our data indicate that current and future symptom checkers should be further trained with data on uncommon conditions, such as uncommon diseases and atypical presentations, to improve diagnostic quality in clinical practice. According to a previous study, symptom checkers can collect only 30% of all pertinent findings and are not good at collecting pertinent negative findings [37]. Considering that collecting pertinent findings is vital for diagnosing uncommon conditions, training with data on uncommon conditions and a system of high-quality feedback and reinforcement by expert diagnosticians are warranted for future symptom checkers.

Recent emerging generative AI-related tools such as ChatGPT (OpenAI Inc), a chatbot that uses a large language model, have been studied for their potential as new differential diagnosis generators. Several studies have demonstrated the high diagnostic accuracy of ChatGPT for simple to complex clinical cases using clinical vignettes and published case reports [38-40]. However, these studies input clinical information, including test results. Regarding symptom checking, while one study showed ChatGPT exhibited high accuracy in symptom checking for a broad range of diseases using the Mayo Clinic symptom checker as a benchmark [41], another study showed no difference in diagnostic accuracy between current symptom checkers and ChatGPT for patients with urgent or emergent clinical problems [15]. In addition, regarding ChatGPT, there is a concern that the near-infinite range of possible inputs and outputs prevents standardized regulations [15,42]. Furthermore, generative AI did not seem to overcome the problem of current symptom checkers that worsened diagnostic accuracy in cases of uncommon conditions [43]. Therefore, generative AI-related tools cannot be effective symptom checkers right now. However, compared to current symptom checkers, the diagnostic performance of generative AI-related tools can rapidly improve over time. Indeed, some studies showed that ChatGPT-4 outperformed ChatGPT-3.5 in diagnostic performance [38,43,44]. Therefore, generative AI-related tools may be a choice for diagnosis generators before patient-clinician encounters in the near future.

Limitations

This study has some limitations. First, the modification details of the symptom checker model used in this study, including the type of machine learning methods used or manual updates used and the frequency at which the model was modified, remained unclear. Second, 3 years may not be appropriate for assessing contemporary machine learning model improvement since there is no standard time frame to assess the improvement of the machine learning model. However, considering that AI-related tools such as ChatGPT show rapid performance improvement, 3 years can be considered enough. Third, the COVID-19 pandemic may have affected our results due to low participants in the second and third years. Fourth, because this was a single-center retrospective study and we only included patients admitted within 30 days of the index outpatient visit, the results should be interpreted with caution regarding generalizability. Fifth, because there was no validated tool to assess the typicality of the presentation, which was assessed based on the information produced by the AI, the classification of typicality in this study may have been biased. This was also true for disease commonality, which could change if other criteria for uncommon diseases were applied.

Conclusions

A 3-year single-center, retrospective, observational study of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, currently implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with lower diagnostic accuracy of the differential diagnosis lists generated by the AI-based symptom checker. In the future, symptom checkers should be trained to recognize uncommon conditions.

Abbreviations

AI

artificial intelligence

ICD-11

International Classification of Diseases, 11th Revision

Multimedia Appendix 1

Examples of cases with uncommon diseases or atypical presentations.

Data Availability

The data sets generated and analyzed during this study are available from the corresponding author on reasonable request.

Footnotes

Authors' Contributions: YH conceptualized this study, wrote the manuscript, and conducted all statistical analyses. YH, SS, and T Sakamoto collected data. T Shimizu supervised the manuscript creation and revision. All authors reviewed the final manuscript.

Conflicts of Interest: None declared.

References

  • 1.Singh H, Schiff GD, Graber ML, Onakpoya I, Thompson MJ. The global burden of diagnostic errors in primary care. BMJ Qual Saf. 2017;26(6):484–494. doi: 10.1136/bmjqs-2016-005401. http://qualitysafety.bmj.com/lookup/pmidlookup?view=long&pmid=27530239 .bmjqs-2016-005401 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Avery AJ, Sheehan C, Bell B, Armstrong S, Ashcroft DM, Boyd MJ, Chuter A, Cooper A, Donnelly A, Edwards A, Evans HP, Hellard S, Lymn J, Mehta R, Rodgers S, Sheikh A, Smith P, Williams H, Campbell SM, Carson-Stevens A. Incidence, nature and causes of avoidable significant harm in primary care in England: retrospective case note review. BMJ Qual Saf. 2021;30(12):961–976. doi: 10.1136/bmjqs-2020-011405. http://qualitysafety.bmj.com/lookup/pmidlookup?view=long&pmid=33172907 .bmjqs-2020-011405 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Cheraghi-Sohi S, Holland F, Singh H, Danczak A, Esmail A, Morris RL, Small N, Williams R, de Wet C, Campbell SM, Reeves D. Incidence, origins and avoidable harm of missed opportunities in diagnosis: longitudinal patient record review in 21 English general practices. BMJ Qual Saf. 2021;30(12):977–985. doi: 10.1136/bmjqs-2020-012594. http://qualitysafety.bmj.com/lookup/pmidlookup?view=long&pmid=34127547 .bmjqs-2020-012594 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Singh H, Meyer AND, Thomas EJ. The frequency of diagnostic errors in outpatient care: estimations from 3 large observational studies involving US adult populations. BMJ Qual Saf. 2014;23(9):727–731. doi: 10.1136/bmjqs-2013-002627. http://qualitysafety.bmj.com/lookup/pmidlookup?view=long&pmid=24742777 .bmjqs-2013-002627 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Harada Y, Otaka Y, Katsukura S, Shimizu T. Effect of contextual factors on the prevalence of diagnostic errors among patients managed by physicians of the same specialty: a single-centre retrospective observational study. BMJ Qual Saf. 2023:1–9. doi: 10.1136/bmjqs-2022-015436.bmjqs-2022-015436 [DOI] [PubMed] [Google Scholar]
  • 6.Singh H, Giardina TD, Forjuoh SN, Reis MD, Kosmach S, Khan MM, Thomas EJ. Electronic health record-based surveillance of diagnostic errors in primary care. BMJ Qual Saf. 2012;21(2):93–100. doi: 10.1136/bmjqs-2011-000304. https://europepmc.org/abstract/MED/21997348 .bmjqs-2011-000304 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kawamura R, Harada Y, Sugimoto S, Nagase Y, Katsukura S, Shimizu T. Incidence of diagnostic errors among unexpectedly hospitalized patients using an automated medical history-taking system with a differential diagnosis generator: retrospective observational study. JMIR Med Inform. 2022;10(1):e35225. doi: 10.2196/35225. https://medinform.jmir.org/2022/1/e35225/ v10i1e35225 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Singh H, Giardina TD, Meyer AND, Forjuoh SN, Reis MD, Thomas EJ. Types and origins of diagnostic errors in primary care settings. JAMA Intern Med. 2013;173(6):418–425. doi: 10.1001/jamainternmed.2013.2777. https://europepmc.org/abstract/MED/23440149 .1656540 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Singh H, Thomas EJ, Khan MM, Petersen LA. Identifying diagnostic errors in primary care using an electronic screening algorithm. Arch Intern Med. 2007;167(3):302–308. doi: 10.1001/archinte.167.3.302. https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/411672 .167/3/302 [DOI] [PubMed] [Google Scholar]
  • 10.Gandhi TK, Kachalia A, Thomas EJ, Puopolo AL, Yoon C, Brennan TA, Studdert DM. Missed and delayed diagnoses in the ambulatory setting: a study of closed malpractice claims. Ann Intern Med. 2006;145(7):488–496. doi: 10.7326/0003-4819-145-7-200610030-00006.145/7/488 [DOI] [PubMed] [Google Scholar]
  • 11.National Academies of Sciences, Engineering, and Medicine. Institute of Medicine. Board on Health Care Services. Committee on Diagnostic Error in Health Care . In: Improving Diagnosis in Health Care. Balogh EP, Miller BT, Ball JR, editors. Washington, DC: National Academies Press; 2015. [PubMed] [Google Scholar]
  • 12.Wallace W, Chan C, Chidambaram S, Hanna L, Iqbal FM, Acharya A, Normahani P, Ashrafian H, Markar SR, Sounderajah V, Darzi A. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. NPJ Digit Med. 2022;5(1):118. doi: 10.1038/s41746-022-00667-w. doi: 10.1038/s41746-022-00667-w.10.1038/s41746-022-00667-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Sibbald M, Monteiro S, Sherbino J, LoGiudice A, Friedman C, Norman G. Should electronic differential diagnosis support be used early or late in the diagnostic process? A multicentre experimental study of Isabel. BMJ Qual Saf. 2022;31(6):426–433. doi: 10.1136/bmjqs-2021-013493. http://qualitysafety.bmj.com/lookup/pmidlookup?view=long&pmid=34611040 .bmjqs-2021-013493 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kämmer JE, Schauber SK, Hautz SC, Stroben F, Hautz WE. Differential diagnosis checklists reduce diagnostic error differentially: a randomised experiment. Med Educ. 2021;55(10):1172–1182. doi: 10.1111/medu.14596. https://boris.unibe.ch/id/eprint/158007 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R. Comparison of diagnostic and triage accuracy of Ada Health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: clinical data analysis study. JMIR Mhealth Uhealth. 2023;11:e49995. doi: 10.2196/49995. https://mhealth.jmir.org/2023//e49995/ v11i1e49995 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Morse KE, Ostberg NP, Jones VG, Chan AS. Use characteristics and triage acuity of a digital symptom checker in a large integrated health system: population-based descriptive study. J Med Internet Res. 2020;22(11):e20549. doi: 10.2196/20549. https://www.jmir.org/2020/11/e20549/ v22i11e20549 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chambers D, Cantrell AJ, Johnson M, Preston L, Baxter SK, Booth A, Turner J. Digital and online symptom checkers and health assessment/triage services for urgent health problems: systematic review. BMJ Open. 2019;9(8):e027743. doi: 10.1136/bmjopen-2018-027743. https://bmjopen.bmj.com/lookup/pmidlookup?view=long&pmid=31375610 .bmjopen-2018-027743 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Turner J, Knowles E, Simpson R, Sampson F, Dixon S, Long J, Bell-Gorrod H, Jacques R, Coster J, Yang H, Nicholl J, Bath P, Fall D, Stone T. Impact of NHS 111 Online on the NHS 111 telephone service and urgent care system: a mixed-methods study. Health Serv Deliv Res. 2021;9(21):1–147. doi: 10.3310/hsdr09210. [DOI] [PubMed] [Google Scholar]
  • 19.Dorr DA, Adams L, Embí P. Harnessing the promise of artificial intelligence responsibly. JAMA. 2023;329(16):1347–1348. doi: 10.1001/jama.2023.2771.2803078 [DOI] [PubMed] [Google Scholar]
  • 20.Embi PJ. Algorithmovigilance-advancing methods to analyze and monitor artificial intelligence-driven health care for effectiveness and equity. JAMA Netw Open. 2021;4(4):e214622. doi: 10.1001/jamanetworkopen.2021.4622. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/10.1001/jamanetworkopen.2021.4622 .2778569 [DOI] [PubMed] [Google Scholar]
  • 21.Schmieding ML, Kopka M, Schmidt K, Schulz-Niethammer S, Balzer F, Feufel MA. Triage accuracy of symptom checker apps: 5-year follow-up evaluation. J Med Internet Res. 2022;24(5):e31810. doi: 10.2196/31810. https://www.jmir.org/2022/5/e31810/ v24i5e31810 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.El-Osta A, Webber I, Alaa A, Bagkeris E, Mian S, Sharabiani MTA, Majeed A. What is the suitability of clinical vignettes in benchmarking the performance of online symptom checkers? An audit study. BMJ Open. 2022;12(4):e053566. doi: 10.1136/bmjopen-2021-053566. https://bmjopen.bmj.com/lookup/pmidlookup?view=long&pmid=35477872 .bmjopen-2021-053566 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gilbert S, Mehl A, Baluch A, Cawley C, Challiner J, Fraser H, Millen E, Montazeri M, Multmeier J, Pick F, Richter C, Türk E, Upadhyay S, Virani V, Vona N, Wicks P, Novorol C. How accurate are digital symptom assessment apps for suggesting conditions and urgency advice? A clinical vignettes comparison to GPs. BMJ Open. 2020;10(12):e040269. doi: 10.1136/bmjopen-2020-040269. https://bmjopen.bmj.com/lookup/pmidlookup?view=long&pmid=33328258 .bmjopen-2020-040269 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Meyer FML, Filipovic MG, Balestra GM, Tisljar K, Sellmann T, Marsch S. Diagnostic errors induced by a wrong a priori diagnosis: a prospective randomized simulator-based trial. J Clin Med. 2021;10(4):826. doi: 10.3390/jcm10040826. https://www.mdpi.com/resolver?pii=jcm10040826 .jcm10040826 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Painter A, Hayhoe B, Riboli-Sasco E, El-Osta A. Online symptom checkers: recommendations for a vignette-based clinical evaluation standard. J Med Internet Res. 2022;24(10):e37408. doi: 10.2196/37408. https://www.jmir.org/2022/10/e37408/ v24i10e37408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Newman-Toker DE, Peterson SM, Badihian S, Hassoon A, Nassery N, Parizadeh D, Wilson LM, Jia Y, Omron R, Tharmarajah S, Guerin L, Bastani PB, Fracica EA, Kotwal S, Robinson KA. Diagnostic errors in the emergency department: a systematic review. Comparative effectiveness review no. 258. (Prepared by the Johns Hopkins University evidence-based practice center under contract no. 75Q80120D00003.) AHRQ publication no. 22(23)-EHC043. Agency for Healthcare Research and Quality. 2022. [2024-04-30]. https://effectivehealthcare.ahrq.gov/products/diagnostic-errors-emergency-updated/research . [PubMed]
  • 27.Liberman AL, Newman-Toker DE. Symptom-Disease Pair Analysis of Diagnostic Error (SPADE): a conceptual framework and methodological approach for unearthing misdiagnosis-related harms using big data. BMJ Qual Saf. 2018;27(7):557–566. doi: 10.1136/bmjqs-2017-007032. https://europepmc.org/abstract/MED/29358313 .bmjqs-2017-007032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Sharp AL, Baecker A, Nassery N, Park S, Hassoon A, Lee MS, Peterson S, Pitts S, Wang Z, Zhu Y, Newman-Toker DE. Missed acute myocardial infarction in the emergency department-standardizing measurement of misdiagnosis-related harms using the SPADE method. Diagnosis (Berl) 2021;8(2):177–186. doi: 10.1515/dx-2020-0049.dx-2020-0049 [DOI] [PubMed] [Google Scholar]
  • 29.Nassery N, Horberg MA, Rubenstein KB, Certa JM, Watson E, Somasundaram B, Shamim E, Townsend JL, Galiatsatos P, Pitts SI, Hassoon A, Newman-Toker DE. Antecedent treat-and-release diagnoses prior to sepsis hospitalization among adult emergency department patients: a look-back analysis employing insurance claims data using Symptom-Disease Pair Analysis of Diagnostic Error (SPADE) methodology. Diagnosis (Berl) 2021;8(4):469–478. doi: 10.1515/dx-2020-0140.dx-2020-0140 [DOI] [PubMed] [Google Scholar]
  • 30.Horberg MA, Nassery N, Rubenstein KB, Certa JM, Shamim EA, Rothman R, Wang Z, Hassoon A, Townsend JL, Galiatsatos P, Pitts SI, Newman-Toker DE. Rate of sepsis hospitalizations after misdiagnosis in adult emergency department patients: a look-forward analysis with administrative claims data using Symptom-Disease Pair Analysis of Diagnostic Error (SPADE) methodology in an integrated health system. Diagnosis (Berl) 2021;8(4):479–488. doi: 10.1515/dx-2020-0145.dx-2020-0145 [DOI] [PubMed] [Google Scholar]
  • 31.Chang TP, Bery AK, Wang Z, Sebestyen K, Ko YH, Liberman AL, Newman-Toker DE. Stroke hospitalization after misdiagnosis of "benign dizziness" is lower in specialty care than general practice: a population-based cohort analysis of missed stroke using SPADE methods. Diagnosis (Berl) 2021;9(1):96–106. doi: 10.1515/dx-2020-0124.dx-2020-0124 [DOI] [PubMed] [Google Scholar]
  • 32.Harada Y, Shimizu T. Impact of a commercial artificial intelligence-driven patient self-assessment solution on waiting times at general internal medicine outpatient departments: retrospective study. JMIR Med Inform. 2020;8(8):e21056. doi: 10.2196/21056. https://medinform.jmir.org/2020/8/e21056/ v8i8e21056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Orphanet. [2024-04-30]. https://www.orpha.net/consor/cgi-bin/index.php .
  • 34.Harada Y, Katsukura S, Kawamura R, Shimizu T. Efficacy of artificial-intelligence-driven differential-diagnosis list on the diagnostic accuracy of physicians: an open-label randomized controlled study. Int J Environ Res Public Health. 2021;18(4):2086. doi: 10.3390/ijerph18042086. https://www.mdpi.com/resolver?pii=ijerph18042086 .ijerph18042086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hill MG, Sim M, Mills B. The quality of diagnosis and triage advice provided by free online symptom checkers and apps in Australia. Med J Aust. 2020;212(11):514–519. doi: 10.5694/mja2.50600. https://onlinelibrary.wiley.com/doi/full/10.5694/mja2.50600 . [DOI] [PubMed] [Google Scholar]
  • 36.Blöß S, Klemann C, Rother AK, Mehmecke S, Schumacher U, Mücke U, Mücke M, Stieber C, Klawonn F, Kortum X, Lechner W, Grigull L. Diagnostic needs for rare diseases and shared prediagnostic phenomena: results of a German-wide expert Delphi survey. PLoS One. 2017;12(2):e0172532. doi: 10.1371/journal.pone.0172532. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172532 .PONE-D-16-30848 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ben-Shabat N, Sharvit G, Meimis B, Ben Joya D, Sloma A, Kiderman D, Shabat A, Tsur AM, Watad A, Amital H. Assessing data gathering of chatbot based symptom checkers—a clinical vignettes study. Int J Med Inform. 2022;168:104897. doi: 10.1016/j.ijmedinf.2022.104897. https://linkinghub.elsevier.com/retrieve/pii/S1386-5056(22)00211-8 .S1386-5056(22)00211-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Hirosawa T, Kawamura R, Harada Y, Mizuta K, Tokumasu K, Kaji Y, Suzuki T, Shimizu T. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform. 2023;11:e48808. doi: 10.2196/48808. https://medinform.jmir.org/2023//e48808/ v11i1e48808 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Public Health. 2023;20(4):3378. doi: 10.3390/ijerph20043378. https://www.mdpi.com/resolver?pii=ijerph20043378 .ijerph20043378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80. doi: 10.1001/jama.2023.8288. https://europepmc.org/abstract/MED/37318797 .2806457 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Chen A, Chen DO, Tian L. Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases. J Am Med Inform Assoc. 2023:ocad245. doi: 10.1093/jamia/ocad245.7477862 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Wachter RM, Brynjolfsson E. Will generative artificial intelligence deliver on its promise in health care? JAMA. 2024;331(1):65–69. doi: 10.1001/jama.2023.25054.2812615 [DOI] [PubMed] [Google Scholar]
  • 43.Sandmann S, Riepenhausen S, Plagwitz L, Varghese J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun. 2024;15(1):2050. doi: 10.1038/s41467-024-46411-8. doi: 10.1038/s41467-024-46411-8.10.1038/s41467-024-46411-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Luk DWA, Ip WCT, Shea YF. Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties. J Chin Med Assoc. 2024;87(3):259–260. doi: 10.1097/JCMA.0000000000001064. https://journals.lww.com/jcma/fulltext/2024/03000/performance_of_gpt_4_and_gpt_3_5_in_generating.3.aspx .02118582-990000000-00342 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia Appendix 1

Examples of cases with uncommon diseases or atypical presentations.

Data Availability Statement

The data sets generated and analyzed during this study are available from the corresponding author on reasonable request.


Articles from JMIR Formative Research are provided here courtesy of JMIR Publications Inc.

RESOURCES