Skip to main content
. 2018 Jul 11;25(9):1248–1258. doi: 10.1093/jamia/ocy072

Table 3.

Study characteristics and results from the evaluation of conversational agents supporting clinicians and both patients and clinicians

Author, yeara Health domain CA purpose Study type and methods Evaluation measures and main findings
Technical performance User experience Health-related measures
Technology supporting cliniciansb
Philip et al., 201722 Mental health (depression) Clinical interview (major depressive disorder diagnosis) Crossover RCT [179 patients were submitted to 2 clinical interviews in a random order (ECA and psychiatrist)] NR • High acceptability of the ECA: score 25.4 (0-30) with the Acceptability e-Scale (validated) • Sens.=49%, spec.=93%, PPV=63%, NPV=88% (severe depressive symptoms: sens.=73% and spec.=95%); AUC: 0.71 (95% CI 0.59–0.81)
Lucas et al., 201735 Mental health (PTSD) Clinical interview (PTSD diagnosis) Quasi-experimental [PTSD-related questions; Study 1: n=29, single group, post-deployment assessment + anonymized survey + ECA; Study 2: n=132, single group, ECA + anonymized survey] NR NR • Study 1: Participants reported more PTSD symptoms when asked by the ECA than the other 2 modalities (p=.02).• Study 2: no significant differences
Philip et al., 201423 Obstructive Sleep Apnea (daytime sleepiness) Clinical interview (excessive daytime sleepiness diagnosis) Quasi-experimental [32 patients + 30 healthy volunteers, single group; 2 similar clinical interviews (based on the Epworth Sleepiness Scale (ESS)) first with ECA, then with a physician] NR • Most subjects had a positive perception of the ECA and considered the ECA interview as a good experience (non-validated questionnaire, 7 questions) • Sens.>0.89, spec.>0.81 (sleepiest patients: sens. and spec.>98%)• ESS scores from ECA and physician interviews were correlated (r=0.95; p<.001)
Beveridge and Fox, 200625 Breast cancer Data collection and clinician decision support (referral to a cancer specialist) Quasi-experimental [6 users interacted with the system following scripted scenarios; dialogues were analyzed] Speech recognition: 71.8% word accuracy; 59.2% sentence recognition; 78.0% concept accuracy; 76.1% semantic recognition• Dialogue manager: 80.8% successful task completion; 8.2% turns correcting errors • Ease of use: moderate (nq)• 691 system responses; 79.2% “appropriate,” 4.6% “borderline appropriate/; inappropriate,” 14.5% “completely inappropriate,” 1.2% “incomprehensible,” and 0.6% “total failure”• Issues: spoken language understanding and dialogue management NR
Technology supporting patients and cliniciansb
Black et al. 2005,29Harper et al. 2008,30Griol et al. 201331 Type 2 diabetes Data collection, telemonitoring Quasi-experimental + content analysis of dialogues + interviews [Black 2005: 8 weeks, 5 patients with diabetes][Harper 2008: 16 weeks, 13 patients asked to call the CA once/week][Griol 2013: 6 participants following a set of scripted scenarios, 150 dialogues] Black 2005: 90.4% successful task completion, 74.7% recognition success• Harper 2008: 92.2% successful task completion, 97.2% recognition accuracy• Griol 2013: 97% successful task completion, 25% confirmation rate, 91% error correction Black 2005:• Patients mentioned they appreciated the level of personalization achieved by the systemHarper 2008:• User satisfaction: 85% (measurement tool NR)• Issues with speech recognition and technical problems that resulted in system disconnections Harper 2008:• Self-reported behavior change (eg physical activity, diet) (nq)• 19 alerts were generated for the healthcare professionals; therapeutic optimization occurred for 12 patients
Levin and Levin, 200632 Pain monitoring Data collection Quasi-experimental [24 participants used the CA as a pain monitoring voice diary during 2 weeks; 177 data collection sessions] • Data capture rate: 98% (2% flagged for transcription)• Task-oriented dialogue turns: 82% • Users became more efficient with experience, increasing the % of interrupted prompts and task-oriented dialogue NR
Giorgino et al. 2005,33 Azzini et al. 200334 Hypertension Data collection, telemonitoring Quasi-experimental + content analysis [15 users (assigned a disease profile); 400 dialogues transcribed and analyzed] • Authors mention satisfying performance but evaluation data is not reported in detail• 80% successful task completion; 35% confirmation questions NR NR

Abbreviations: AUC: Area Under the Curve; CA: conversational agent; CI: confidence interval; ECA: Embodied Conversational Agent; ESS: Epworth Sleepiness Scale; nq: not quantified in the paper; NR: not reported; p: p-value, measure of statistical significance; PTSD: Post Traumatic Stress Disorder; r: correlation coefficient; RCT: randomized controlled trial; sens.: sensitivity; spec.: specificity

aStudies evaluating the same conversational agent were grouped together; bTechnology supporting clinicians: systems that support clinical work at the healthcare setting (e.g. CA substituting a clinician in a clinical interview with diagnostic purposes); Technology supporting patients and clinicians: systems that support both consumers in their daily lives and clinical work at the healthcare setting (e.g. telemonitoring systems involving a CA).