This survey study investigates the association of use of an artificial intelligence (AI) application with consumer understanding of dermatology cases.
Key Points
Question
Can artificial intelligence (AI)–powered tools help consumers with understanding dermatology cases?
Findings
In this survey study of 2345 individuals who had sought information on skin concerns, participants examining deidentified retrospective dermatology cases comprising images and metadata were more willing to name a condition with an AI application’s assistance and were more accurate at naming the condition. Participants’ understanding of next steps was not more accurate.
Meaning
These findings suggest that AI applications may be able to help consumers understand the condition depicted in a case, but further progress remains in improving user understanding of possible next steps.
Abstract
Importance
The potential association of artificial intelligence (AI)–powered informational tools with consumer health needs outcomes is not well understood. Such tools could potentially be associated with the volume and distribution of patients seeking care.
Objective
To investigate the association of an AI-powered dermatology application with changes in consumer understanding of retrospective skin condition cases.
Design, Setting, and Participants
This survey study used a between-participants randomized survey with 3 arms: control (using existing tools, such as web search), AI (with access to predictions from a prototype AI application), and a Wizard of Oz method (with the same interface as in the AI group but using dermatologist panel ground truth differentials instead of AI predictions) run on a commercial panel platform. Participants interpreted retrospective deidentified skin condition cases containing images and structured medical history. US-based participants who self-reported having sought information for a skin concern within the past year were included. The survey was conducted March 17 to May 16, 2023, and data were analyzed from June 2023 through November 2024.
Interventions
Participants in AI and Wizard of Oz arms were presented 3 to 7 conditions per case patient as a horizontal carousel of condition cards. Cards included textbook images of each condition and brief descriptions and could be expanded to show more textbook images and details about each condition’s background and treatment. Conditions in the Wizard of Oz arm had identical visual presentation, but predicted conditions were the dermatologist-provided ground truth.
Main Outcomes and Measures
Participants in each arm self-reported whether they could name the condition depicted, and if so, they could list 1 or more conditions they thought were shown. Participants also reported the next step they thought appropriate for each case patient, as well as their self-reported confidence in their assessment and satisfaction with the information search experience. Condition name and next step accuracy were assessed against a reference diagnosis from dermatologists, and next steps were derived from the conditions.
Results
Among 2345 participants (509 aged 30-39 years [21.71%]; 1650 female [70.36%]), 11 725 participant case patient reads were obtained across 3 study arms. Compared with the control group (41.21%; 95% CI, 39.66%-42.76%), participants were more willing to name a condition in AI (62.26%; 95% CI, 60.75%-63.76%) and Wizard of Oz (61.76%; 95% CI, 60.21%-63.28%) arms (both P < .001; permutation test with false discovery rate correction), with increased accuracy for AI (22.79%; 95% CI, 21.48%-24.09%; P < .001) and Wizard of Oz (36.20%; 95% CI, 34.70%-37.73%; P = .002) vs control (7.86%; 95% CI, 7.03%-8.71%). Next-step accuracy increased for the Wizard of Oz (62.95%; 95% CI, 61.42%-64.44%; P < .001) compared with the control group (60.10%; 95% CI, 58.55%-61.65%).
Conclusions and Relevance
In this study, AI applications were associated with increased accuracy and confidence of consumer understanding of skin concerns, with the degree of improvement in accuracy varying by the accuracy of presented conditions; benefits further improved when predictions were as accurate as possible. Imperfect guessing accuracy when predictions presented matched dermatologist differentials highlighted the need for further design improvements (such as the condition information presented) to help consumers better understand skin conditions.
Introduction
Skin conditions represent a significant global health burden, affecting an estimated 2 billion individuals. There is a substantial disparity in access to dermatological care, with only 28% of reported skin conditions evaluated by dermatologists, and this is compounded by factors such as long wait times, particularly for underserved populations like Medicaid recipients. Artificial intelligence (AI)–based tools present an opportunity to bridge the access gap by empowering clinicians, consumers, or both. AI algorithms have demonstrated performance comparable to that of panels of trained specialists, and AI assistance has been associated with increased diagnostic agreement between dermatologists and nonspecialist clinicians.
Direct-to-consumer AI applications may enable consumers to find information about their dermatologic concerns and inform next steps in management, such as whether to seek clinical care. While even a general internet search may be associated with improved diagnostic accuracy, AI offers the prospect of more precise and accessible information retrieval. Access to model predictions has been associated with improved accuracy in assessments of skin lesion malignancy and treatment options for readers with a range of prior medical training. Direct-to-consumer AI applications may be associated with increased confidence among users in identifying their own skin concerns and may better align with specialist assessment.
Despite this promise, several factors warrant further investigation. Potential risks include inaccurate predictions, increased anxiety, and unnecessary health care use. AI-based interventions may serve different subpopulations differently due to variations in training data, potentially impacting cases with rare conditions or image artifacts. Variability in consumer-provided image quality and observed underperformance on darker skin tones also raise concerns about equitable access and accuracy, although fine-tuning of strategies shows promise in mitigating these biases. We investigated whether everyday users may benefit from AI assistance by presenting consumers with deidentified case patient vignettes containing images of dermatological conditions and structured medical history information and asking them to report their understanding of the condition and best next steps.
Methods
To evaluate whether an AI-powered application could assist consumers in understanding their skin concerns and seeking the appropriate care, we conducted a survey study in which participants were asked to imagine a retrospective case as their own skin issue and answer a series of questions. For each case patient, participants saw 1 to 3 images, along with structured clinical metadata that included age, sex, the duration of symptoms, affected body parts, reason why the patient was seeking care, and any nondermatological symptoms. Given that this study was retrospective and used deidentified datasets, the need for further review and consent was waived by the Advarra Institutional Review Board. This study is reported following the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guideline.
AI-Powered Application
This study tested a research-only prototype AI-powered application (Figure 1). The application presented a list of matching conditions for each case patient. Each matched condition included a picture of its canonical presentation and a text description. Upon tapping on a condition, the user could see supplemental information about the condition, including its symptoms, treatment options, risk of contagion, and typical severity and duration.
Figure 1. Example Screenshots of Artificial Intelligence–Powered Application Interface.
A, Case details section containing images from the case and other clinically relevant metadata. For some cases, some metadata fields were not available and were therefore left blank (eg, "Where the issue is primarily located" in the illustrated example). B, Scrollable carousel of potential matching conditions. Each matching condition was a selectable card. C, Example information page that was presented when participants selected an example condition. The information page showed example textbook images of the condition and text information about symptoms, contagiousness, severity, treatment, duration, how common it is, additional information, and a “What you should know” section with details such as whether people with the condition often saw a physician. The interface was available in desktop and mobile formats; the desktop presentation is shown here.
In the AI arm, conditions were from an AI model developed for research purposes based on prior published work. The model was updated with different architecture and training data, with advancements described and evaluated using an external validation set in Rikhye et al. This model had accuracy ranging from 80.9% for the top 1 condition on the test dataset used in this study to 99.2% for the top 7 conditions (eTable 1 in Supplement 1).
Depending on model confidence scores, the top 3 to 7 matching conditions were shown on the interface. The number of conditions shown was chosen to cumulatively cover the top 90% of the probability mass. The ordering of conditions was determined by these condition scores (from highest to lowest), although the scores were not directly presented to the user on the interface and participants were not informed of whether the ordering conveyed any specific meaning. Due to a technical error, the conditions in the Wizard of Oz arm (described subsequently) were all presented but in lowest to highest confidence score order rather than highest to lowest. Because dermatologist differentials tended to have fewer conditions than the AI applications predictions list (eTable 2 in Supplement 1), all conditions were visible without scrolling for 92 of 126 case patients in the Wizard of Oz arm (73.0%).
Dataset
The dataset used in the study consisted of 127 deidentified case patients from a teledermatology service. It was previously used in the validation of AI systems to diagnose skin conditions. These case patients had no overlap with the dataset used in the development of the AI-powered application.
Study Design
We conducted this multireader, multicase, survey-based study online between March 17 and May 16, 2023. Participants were randomly assigned to 1 of 3 study arms. They were recruited via a commercial panel provider (Qualtrics) using stratified sampling. Sample sizing was done qualitatively, aiming for as close to 1000 participants per study arm as possible given budget and survey recruitment constraints at the time of the study. All participants were US based, aged 18 years or older, and screened for having experienced a skin issue or having cared for someone who did in the 6 months prior to their recruitment. To minimize selection bias and ensure a sample representative of the target population, we applied stratified sampling quotas based on age and sex distributions. During study onboarding, participants consented to viewing images of skin conditions and self-reported metadata, including age, sex, race and ethnicity, and comfort in filling out medical forms (as an indicator of health literacy). Race and ethnicity were assessed to understand any association of these factors with study end points. Available categories were American Indian or Alaska Native; Asian; Black or African American; Hispanic, Latino, or Spanish origin; Middle Eastern or North African; Native Hawaiian or Other Pacific Islander; White; another race or ethnicity not listed; and prefer not to answer. Middle Eastern or North African, another race or ethnicity not listed, and prefer not to answer were combined as other due to low population numbers.
Participants were randomized into 1 of 3 arms: control, AI (the predictions came from the deep learning model developed for research, described previously), and Wizard of Oz. Participants in the control group viewed only case patient details. They were instructed to use external resources that they had previously used for their own concerns to research this case. They did not have access to the application’s potential matches section. In the AI arm, matching conditions were produced by the AI model. In the Wizard of Oz arm, participants viewed the same interface as those in the AI arm, but matching conditions used differential diagnoses from dermatologists instead of AI predictions. This mimicked a perfect AI because these differentials also constituted the ground truth for evaluation.
After participants reviewed and researched a case patient, they were asked to imagine that the skin lesion was their own and answer a series of survey questions (described subsequently). The median (IQR) time on task across 5 case patients for participants was 17.0 (11.9-26.5) minutes (eFigure 2 in Supplement 1). Although this varied across arms (eFigure 2 in Supplement 1), differences were not significant.
Primary Measures: Condition Naming and Next Step Accuracy
The primary end point for the study was condition naming accuracy, measured as the rate of case patients in which participants named a condition that was also in the top 3 by dermatologist differential. To be scored as accurate, participants must have answered yes to the question “Based on the research you just conducted, do you believe that you can name the condition pictured above?” and provided a condition name that was mapped to a condition in the top 3 for differential diagnosis from dermatologists. We required both steps to avoid a potential confound of different guessing rates in each arm.
Reference standard differential diagnoses for each case patient were derived from review by a panel of 3 dermatologists, similar to prior work, who were blinded to results of this study. Dermatologists also provided recommended next steps for each case patient, used as ground truth for next step accuracy.
When comfortable naming the condition, participants provided free-text names. These free-text responses were compared against the top 3 ground truth differential diagnosis using the Med-PaLM 2 model version 2.0 (Google), used June 2023; further details are in the eMethods and eFigure 3 in Supplement 1.
Secondary Measures
We measured user-reported confidence in provided condition guesses and next steps, perceived satisfaction with search results and search time, perceived relevance of search results, satisfaction with the search tool (asked only in AI and Wizard of Oz arms, where a tool was available), and perceived degree to which the assessed case patient matched their own skin or the skin of someone they would care for. Each question was provided with a fully labeled, 5-point response scale (eg, “How confident are you in the condition you named?” would have responses “Not at all confident,” “Slightly confident,” “Moderately confident,” “Very confident,” and “Extremely confident”). For analysis, responses were binarized to group the top 2 bins for each question (eg, very or extremely confident vs lower responses).
For next step assessment, participants answered the question “Based on the research you just conducted, what is the next step that you think would be the most appropriate?” Possible response options were: (1) “I would do nothing unless it continued or got worse, as I expect it will get better on its own”; (2) “I would treat it myself at home using a home remedy, creams/ointments/gels, or over-the-counter medications”; (3) “I would schedule a non-urgent visit with my healthcare provider (primary care doctor, dermatologist, nurse practitioner)”; and (4) “I would schedule an urgent or emergency visit with my provider (for today) or seek urgent care.”
Evaluation and Statistical Analysis
Metrics were calculated as binary accuracy or error rates or rates of selecting the top 2 bins on 5-point scales. We computed 95% CIs by bootstrap with 5000 samples. Statistical comparisons between arms used a permutation test blocked by the case patient used to control for case patient–level variance (as in Ruamviboonsuk et al). To correct for multiple comparisons, we identified P values using the Benjamini-Hochberg procedure to control for a false discovery rate of 5%.
We measured the association of participant demographic factors with accuracy (eTables 6-7 in Supplement 1) using a mixed-effects logistic regression model and the lmer package in R statistical software version 1.1.37 (R Project for Statistical Computing). To account for the nested nature of the data given multiple observations per participant and per clinical case patient, participant and case patient identification numbers were modeled as random effects. For regressions, significance was set at α = .05, and P values were 2-sided. Data were analyzed from June 2023 through November 2024.
Results
Study Participant Rate
A total of 9772 participants entered the study, with 2345 individuals (509 aged 30-39 years [21.71%]; 1650 female [70.36%]; 138 Asian [5.88%], 410 Black [17.48%], 153 Hispanic [6.52%], 1390 White [59.28%], and 200 with multiple races or ethnicities [8.53%]) completing the study (26.0% incidence rate; eFigure 1 in Supplement 1) and reviewing a total of 11 725 participant case patient reads (Table). Compared with the general US population, this sample had a somewhat higher representation of females (70.36% vs 50.50%) and of individuals aged 30 to 49 years (948 participants [40.43%] vs 26.51%) (Table; eTable 3 in Supplement 1).
Table. Demographic Distributions of Retrospective Case Patients and Participants.
| Demographic | No. (%)a | |
|---|---|---|
| Case patients (N = 127) | Participants (N = 2345) | |
| Age, y | ||
| ≤29 | 30 (23.62) | 356 (15.18) |
| 30-39 | 23 (18.11) | 509 (21.71) |
| 40-49 | 18 (14.17) | 439 (18.72) |
| 50-59 | 30 (23.62) | 353 (15.05) |
| 60-69 | 19 (14.96) | 382 (16.29) |
| 70-79 | 6 (4.72) | 270 (11.51) |
| ≥80 | 1 (0.79) | 36 (1.54) |
| Sex | ||
| Female | 83 (65.35) | 1650 (70.36) |
| Male | 44 (34.65) | 681 (29.04) |
| Other or no response | 0 | 14 (0.60) |
| Self-reported race and ethnicity | ||
| American Indian or Alaska Native | 1 (0.79) | 0 |
| Asian | 19 (14.96) | 138 (5.88) |
| Black or African American | 6 (4.72) | 410 (17.48) |
| Hispanic, Latino, or Spanish origin | 54 (42.52) | 153 (6.52) |
| Native Hawaiian or Pacific Islander | 2 (1.57) | 0 |
| White | 34 (26.77) | 1390 (59.28) |
| Multiple | 0 | 200 (8.53) |
| Otherc | 8 (6.30) | 0 |
| Prefer not to say or other answer | 3 (2.36) | 54 (2.30) |
Each row represents 1 demographic category based on participant self-report, along with counts and percentages of case patients reviewed and participants reviewing case patients. Participants were randomly assigned to review 5 case patients each.
Participants had the option of choosing more than 1 race or ethnicity category.
Other includes Middle Eastern or North African, another race or ethnicity not listed, and prefer not to answer categories from the original question. Self-reported categories for another race and ethnicity included Hungarian, Mexican American, Native American, South Asian, Spanish American, West Indian, and multiracial.
Condition Naming and Next Step Accuracy
Participants in AI (62.26%; 95% CI, 60.75%-63.76%; P < .001) and Wizard of Oz (61.76%; 95% CI, 60.21%-63.28%; P < .001) arms were significantly more willing to guess condition names for provided case patients than the control group (41.21%; 95% CI, 39.66%-42.76%) (Figure 2; eTable 4 in Supplement 1). Participants were also considerably more accurate at naming conditions that agreed with dermatologist assessment. Compared with a reference of 7.86% (95% CI, 7.03%-8.71%) accuracy in the control arm, condition naming accuracy was nearly 3 times as large in the AI arm (22.79%; 95% CI, 21.48%-24.09%; P < .001; log odds = 1.58; 95% CI, 1.37-1.79) and nearly 4 times as large in the Wizard of Oz arm (36.20%; 95% CI, 34.70%-37.73%; P = .002; log odds = 2.50; 95% CI, 2.29-2.71). By contrast, next step accuracy was increased only for the Wizard of Oz arm (62.95%; 95% CI, 61.42%-64.44%; P < .001) but not for the AI arm compared with control (60.10%; 95% CI, 58.55%-61.65%) (eTable 4 in Supplement 1). Detailed analysis of example case patients helped and not helped by AI assistance is provided in eResults, eTable 5, and eFigures 4 to 8 in Supplement 1.
Figure 2. Study Metrics by Arm.
Condition naming rate, condition name accuracy, next step accuracy, and overcall and undercall rates are summarized as described previously; other metrics report the percentage rate of responses in the top 2 bins of each 5-point scale (eg, very or extremely satisfied for satisfaction metrics). Error bars are 95% CIs from permutation tests. AI indicates artificial intelligence.
We also examined undercall and overcall rates for next steps (for which participants may suggest next steps more or less urgent than dermatologists). Next step errors were more likely to undercall than overcall in the control arm (27.34%; 95% CI, 25.94%-28.73% vs 7.65%; 95% CI, 6.80%-8.48%; P < .001). While overcall rates were not significantly different across arms, participants in the AI arm were significantly more likely to undercall (29.76%; 95% CI, 28.38%-31.19%) vs the control arm (P = .02) and Wizard of Oz arm (27.08%; 95% CI, 25.66%-28.50%; P = .03) (eTable 4 in Supplement 1).
Secondary Measures and Assistance
In addition to accuracy measures, secondary measures were also improved in both assisted arms (Figure 2; eTable 4 in Supplement 1). Confidence in the guessed condition name was higher in AI (48.93%; 95% CI, 46.96%-50.91%) and Wizard of Oz (52.97%; 95% CI, 51.00%-54.94%) arms compared with the control arm (42.63%; 95% CI, 40.19%-45.08%; P < .001 for both); but confidence in the planned next step was significantly higher only in the Wizard of Oz arm (55.4%; 95% CI, 53.93%-57.00%; P = .03) but not in the AI arm (52.87%; 95% CI, 51.37%-54.43%; P = .97) compared with control (52.92%; 95% CI, 51.34%-54.52%) (eTable 4 in Supplement 1). Satisfaction with overall search results (AI: 43.11%; 95% CI, 41.58%-44.67%; Wizard of Oz: 43.59%; 95% CI, 42.04%-45.14%; control: 28.68%; 95% CI, 27.26%-30.10%), relevance of results, and time spent searching was significantly higher in AI and Wizard of Oz arms (all P < .001) (Figure 2; eTable 4 in Supplement 1). We also asked about satisfaction with the AI interface in AI and Wizard of Oz arms; there was no significant difference in satisfaction with the tool between the 2 arms (Figure 2; eTable 4 in Supplement 1).
We also observed an unexpected increase in the rate at which participants considered the skin of the reviewed case patients as similar to their own skin or the skin of someone they would care for (Figure 2). For both of these measures, the increase was significantly higher in the AI arm compared with the Wizard of Oz arm (eTable 4 in Supplement 1).
Accuracy Associations Across Participant Subgroups
We examined whether different case patient or participant characteristics were associated with condition naming and next step accuracy. We applied logistic regressions with study arm and participant and case patient demographic variables as regressors (eTables 6-7 in Supplement 1) against outcomes of correct condition naming and next step assessment, respectively. Controlling for these factors did not substantially reduce associated outcomes by study arm, suggesting that the association of AI assistance was not substantially modulated by participant subgroups. For condition naming accuracy, we observed an association of the patient age of the case being evaluated but not of the participant age (eTable 6 in Supplement 1). We did not observe associations for participant age, and most other factors did not show a pattern of associations.
Discussion
This survey study reports results of a large-scale reader study comprising layperson participants evaluating dermatological case patient vignettes. We found that providing assistance in the form of a set of matching conditions from an AI model or dermatologist differential presented as AI output was associated with increased willingness of participants to name matching conditions, likelihood to provide accurate matching conditions, and likelihood to feel confident in their condition assessments and satisfied with their search experience. We also found significant increases in accuracy and confidence of planned next steps compared with dermatologist assessments in Wizard of Oz but not AI arms, suggesting that the potential of these tools depends on their accuracy.
These results have implications on the potential association of direct-to-consumer AI-powered tools with health outcomes and health care use. Our results also have implications on considerations of equitable access to care and the importance of ensuring high accuracy in AI algorithms for disease assessment.
Implications for Direct-to-Consumer Health Applications
Our findings suggest that AI assistance may be associated with substantial improvement in laypeople’s ability to understand skin concerns, with a nearly 3-fold increase in the ability to accurately name a skin condition compared with existing search tools. This builds on previous findings that AI assistance was associated with improved diagnostic accuracy for clinicians and nonclinicians. Our results are consistent with those of Han et al but extend to a larger scale (2345 participants vs 23 participants). Even with these improvements, condition-naming accuracy rates were below 50%, suggesting that this is a difficult task for which more support is needed (including, when appropriate, direct consultation with clinicians).
Our observed accuracy increases were much larger in magnitude for condition naming (significant, 3- to 4-fold increases for AI and Wizard of Oz arms) than for next step assessment (significant only for the Wizard of Oz arm, with a 5% increase). This suggests that the change in management decisions associated with AI applications may be more modest, consistent with prior studies on general health search. Our findings suggest additional benefits to those in Levine et al given that AI assistance was in our study compared with the baseline of internet search. There may be further benefit in better informing laypeople about appropriate next steps for given skin conditions to translate outcomes associated with direct-to-consumer AI tools.
Results Across Racial and Ethnic Groups
Our analysis of different participant subgroups indicated fairly modest differences across demographic groups in the association of AI assistance with accuracy. We did not observe consistent patterns of differences in condition naming or next step accuracy across participant racial and ethnic groups, household income levels, or health literacy levels (as measured by comfort in filling medical forms). We observed higher next step accuracy for female vs male participants. Outcomes associated with age mostly involved the age of the individual in the case patient (itself correlated with the conditions), rather than the participant age. This suggests that the ability to use AI assistance may not vary substantially across participant age subgroups but that instead the type of case may be associated with AI accuracy and thus assistive outcomes. These results do not provide evidence for AI tools to provide disparate benefits of condition understanding across demographic subgroups. This is consistent with recent analyses suggesting that, given models that have been trained or fine-tuned to reduce disparities, performance of deep-learning systems may be comparable for groups with historically worse health outcomes. We note that while our analyses examined associations of case patient and participant sex, age, and race and ethnicity, they did not examine skin tone, which may be a key dimension for equity considerations in dermatology.
Limitations
This study has several limitations. It involves retrospective review of preexisting case patient vignettes. It is thus an indirect assessment of the potential association of AI tools with health decision-making. The behavior of laypeople regarding their own concerns may vary, in particular in terms of assessments of risk and determining next steps based on access to care and other factors. Our study also included the technical error of lowest to highest score order in the Wizard of Oz arm, resulting in our observed differences with the AI arm being potentially underestimated. Furthermore, it is unclear whether and how participants in our control group may have used online resources, including AI. (However, the study was conducted in early 2023, when use of AI and particularly multimodal AI may have been less common.) The information shown per condition in AI and Wizard of Oz arms was also based solely on the condition and not customized based on the case images, which may have reduced the ability of participants to better understand ideal next steps based on case patient–specific nuances.
In addition, participants and case patients shown here may differ from the overall population who may use AI technology like this. This may have been exacerbated by relatively low response rates, which introduced some potential for sampling bias. While we aimed to broadly target US-based participants, panels somewhat overrepresented some groups, such as females and people aged 30 to 39 years. The set of case patients that participants evaluated were drawn from teledermatology and may differ in condition distribution from what may be observed from widely available direct-to-consumer tools. Thus, results may not apply to populations not well represented in the study, including those outside the US and those less likely to respond to online surveys. Recent research suggests that patient-taken images may be of lower quality compared with clinic-taken images for teledermatology. This implies that images taken by consumers in such apps may have different quality considerations that may impact accuracy.
Conclusions
This survey study’s findings suggest that direct-to-consumer AI-powered dermatology applications may benefit layperson users by enabling them to more accurately understand their skin concerns and make appropriate decisions about care. There may be additional gains in further improving the accuracy of algorithms, helping users beyond condition recognition, and specifically helping with care-seeking decisions.
eMethods.
eResults.
eFigure 1. Diagram of participant recruitment and completion flow
eFigure 2. Time on task distributions for each arm in the study
eFigure 3. Heatmaps showing joint distribution of automated response coding
eFigure 4. Overlap, precision and recall of AI predictions against reference standard dermatologist differentials
eFigure 5. Condition naming accuracy broken down by overlap between AI app predictions and dermatologist ground truth
eFigure 6. Condition naming accuracy broken down by skin condition category
eFigure 7. Next step accuracy broken down by self-reported confidence in the next step
eFigure 8. Examples of case patients with varying AI assistance associations
eTable 1. Accuracy of app AI predictions against dermatologist-panel’s top-3 differential
eTable 2. Distribution of number of matching conditions for each reviewed case patient for the AI and Wizard of Oz arms
eTable 3. Additional metadata distributions for participants in the study
eTable 4. Detailed metrics for each experiment arm
eTable 5. Top-matching conditions from the dermatologist differentials for each case patient
eTable 6. Logistic regression results for the outcome of accurately naming conditions
eTable 7. Logistic regression results for the outcome of next step accuracy
Data Sharing Statement
References
- 1.GBD 2017 Disease and Injury Incidence and Prevalence Collaborators . Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet. 2018;392(10159):1789-1858. doi: 10.1016/S0140-6736(18)32279-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Feldman SR, Fleischer AB Jr, Williford PM, White R, Byington R. Increasing utilization of dermatologists by managed care: an analysis of the National Ambulatory Medical Care Survey, 1990-1994. J Am Acad Dermatol. 1997;37(5 Pt 1):784-788. doi: 10.1016/S0190-9622(97)70118-X [DOI] [PubMed] [Google Scholar]
- 3.Creadore A, Desai S, Li SJ, et al. Insurance acceptance, appointment wait time, and dermatologist access across practice types in the US. JAMA Dermatol. 2021;157(2):181-188. doi: 10.1001/jamadermatol.2020.5173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115-118. doi: 10.1038/nature21056 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Han SS, Park GH, Lim W, et al. Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: automatic construction of onychomycosis datasets by region-based convolutional deep neural network. PLoS One. 2018;13(1):e0191493. doi: 10.1371/journal.pone.0191493 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Jain A, Way D, Gupta V, et al. Development and assessment of an artificial intelligence-based tool for skin condition diagnosis by primary care physicians and nurse practitioners in teledermatology practices. JAMA Netw Open. 2021;4(4):e217249. doi: 10.1001/jamanetworkopen.2021.7249 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Levine DM, Mehrotra A. Assessment of diagnosis and triage in validated case vignettes among nonphysicians before and after internet search. JAMA Netw Open. 2021;4(3):e213287-e213287. doi: 10.1001/jamanetworkopen.2021.3287 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Krogue JD, Sayres R, Hartford J, et al. Searching for dermatology information online using images vs text: a randomized study. medRxiv. Preprint posted online October 27, 2024. doi: 10.1101/2024.10.25.24316155 [DOI]
- 9.Nelson CA, Pérez-Chada LM, Creadore A, et al. Patient perspectives on the use of artificial intelligence for skin cancer screening: a qualitative study. JAMA Dermatol. 2020;156(5):501-512. doi: 10.1001/jamadermatol.2019.5014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sayres R, Devon-Sand A, Schaekermann M, et al. Navigating skin concerns with AI: a human-centered investigation of a dermatology app in a diverse community. In: Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery; 2025:1-16. [Google Scholar]
- 11.Devon-Sand A, Sayres R, Liu Y, et al. A multiparty collaboration to engage diverse populations in community-centered artificial intelligence research. Mayo Clin Proc Digit Health. 2024;2(3):463-469. doi: 10.1016/j.mcpdig.2024.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Han SS, Park I, Eun Chang S, et al. Augmented intelligence dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J Invest Dermatol. 2020;140(9):1753-1761. doi: 10.1016/j.jid.2020.01.019 [DOI] [PubMed] [Google Scholar]
- 13.Babic B, Gerke S, Evgeniou T, Cohen IG. Direct-to-consumer medical machine learning and artificial intelligence applications. Nat Mach Intell. 2021;3:283-287. doi: 10.1038/s42256-021-00331-0 [DOI] [Google Scholar]
- 14.Smak Gregoor AM, Sangers TE, Bakker LJ, et al. An artificial intelligence based app for skin cancer detection evaluated in a population based setting. NPJ Digit Med. 2023;6(1):90. doi: 10.1038/s41746-023-00831-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li Z, Koban KC, Schenck TL, Giunta RE, Li Q, Sun Y. Artificial intelligence in dermatology image analysis: current developments and future trends. J Clin Med. 2022;11(22):6826. doi: 10.3390/jcm11226826 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Rikhye RV, Hong GE, Singh P, et al. Differences between patient and clinician-taken images: implications for virtual care of skin conditions. Mayo Clin Proc Digit Health. 2024;2(1):107-118. doi: 10.1016/j.mcpdig.2024.01.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022;8(32):eabq6147. doi: 10.1126/sciadv.abq6147 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Schaekermann M, Spitz T, Pyles M, et al. Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study. EClinicalMedicine. 2024;70:102479. doi: 10.1016/j.eclinm.2024.102479 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jain A, Way D, Gupta V, et al. Race- and Ethnicity-Stratified Analysis of an Artificial Intelligence–Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners. Iproceedings. 2022;8(1):e36885. doi: 10.2196/36885 [DOI] [Google Scholar]
- 20.Liu Y, Jain A, Eng C, et al. A deep learning system for differential diagnosis of skin diseases. Nat Med. 2020;26(6):900-908. doi: 10.1038/s41591-020-0842-3 [DOI] [PubMed] [Google Scholar]
- 21.Rikhye RV, Loh A, Hong GE, et al. Closing the AI generalisation gap by adjusting for dermatology condition distribution differences across clinical settings. EBioMedicine. 2025;116:105766. doi: 10.1016/j.ebiom.2025.105766 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wallace LS, Rogers ES, Roskos SE, Holiday DB, Weiss BD. Brief report: screening items to identify patients with limited health literacy skills. J Gen Intern Med. 2006;21(8):874-877. doi: 10.1111/j.1525-1497.2006.00532.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31(3):943-950. doi: 10.1038/s41591-024-03423-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Ruamviboonsuk P, Tiwari R, Sayres R, et al. Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study. Lancet Digit Health. 2022;4(4):e235-e244. doi: 10.1016/S2589-7500(22)00017-6 [DOI] [PubMed] [Google Scholar]
- 25.Chen S. Multiple Testing and False Discovery Rate Control: Theory Methods and Algorithms. UC San Diego; 2019. [Google Scholar]
- 26.US Census Bureau . SO101: age and sex. 2024. Accessed December 17, 2025. https://data.census.gov/table/ACSST1Y2024.S0101
- 27.R Core Team . The R project for statistical computing. 2023. Accessed March 13, 2026. https://www.R-project.org/
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
eMethods.
eResults.
eFigure 1. Diagram of participant recruitment and completion flow
eFigure 2. Time on task distributions for each arm in the study
eFigure 3. Heatmaps showing joint distribution of automated response coding
eFigure 4. Overlap, precision and recall of AI predictions against reference standard dermatologist differentials
eFigure 5. Condition naming accuracy broken down by overlap between AI app predictions and dermatologist ground truth
eFigure 6. Condition naming accuracy broken down by skin condition category
eFigure 7. Next step accuracy broken down by self-reported confidence in the next step
eFigure 8. Examples of case patients with varying AI assistance associations
eTable 1. Accuracy of app AI predictions against dermatologist-panel’s top-3 differential
eTable 2. Distribution of number of matching conditions for each reviewed case patient for the AI and Wizard of Oz arms
eTable 3. Additional metadata distributions for participants in the study
eTable 4. Detailed metrics for each experiment arm
eTable 5. Top-matching conditions from the dermatologist differentials for each case patient
eTable 6. Logistic regression results for the outcome of accurately naming conditions
eTable 7. Logistic regression results for the outcome of next step accuracy
Data Sharing Statement


