Table 3.
Study characteristics and results from the evaluation of conversational agents supporting clinicians and both patients and clinicians
Author, yeara | Health domain | CA purpose | Study type and methods | Evaluation measures and main findings |
||
---|---|---|---|---|---|---|
Technical performance | User experience | Health-related measures | ||||
Technology supporting cliniciansb | ||||||
Philip et al., 201722 | Mental health (depression) | Clinical interview (major depressive disorder diagnosis) | Crossover RCT [179 patients were submitted to 2 clinical interviews in a random order (ECA and psychiatrist)] | NR | • High acceptability of the ECA: score 25.4 (0-30) with the Acceptability e-Scale (validated) | • Sens.=49%, spec.=93%, PPV=63%, NPV=88% (severe depressive symptoms: sens.=73% and spec.=95%); AUC: 0.71 (95% CI 0.59–0.81) |
Lucas et al., 201735 | Mental health (PTSD) | Clinical interview (PTSD diagnosis) | Quasi-experimental [PTSD-related questions; Study 1: n=29, single group, post-deployment assessment + anonymized survey + ECA; Study 2: n=132, single group, ECA + anonymized survey] | NR | NR | • Study 1: Participants reported more PTSD symptoms when asked by the ECA than the other 2 modalities (p=.02).• Study 2: no significant differences |
Philip et al., 201423 | Obstructive Sleep Apnea (daytime sleepiness) | Clinical interview (excessive daytime sleepiness diagnosis) | Quasi-experimental [32 patients + 30 healthy volunteers, single group; 2 similar clinical interviews (based on the Epworth Sleepiness Scale (ESS)) first with ECA, then with a physician] | NR | • Most subjects had a positive perception of the ECA and considered the ECA interview as a good experience (non-validated questionnaire, 7 questions) | • Sens.>0.89, spec.>0.81 (sleepiest patients: sens. and spec.>98%)• ESS scores from ECA and physician interviews were correlated (r=0.95; p<.001) |
Beveridge and Fox, 200625 | Breast cancer | Data collection and clinician decision support (referral to a cancer specialist) | Quasi-experimental [6 users interacted with the system following scripted scenarios; dialogues were analyzed] | • Speech recognition: 71.8% word accuracy; 59.2% sentence recognition; 78.0% concept accuracy; 76.1% semantic recognition• Dialogue manager: 80.8% successful task completion; 8.2% turns correcting errors | • Ease of use: moderate (nq)• 691 system responses; 79.2% “appropriate,” 4.6% “borderline appropriate/; inappropriate,” 14.5% “completely inappropriate,” 1.2% “incomprehensible,” and 0.6% “total failure”• Issues: spoken language understanding and dialogue management | NR |
Technology supporting patients and cliniciansb | ||||||
Black et al. 2005,29Harper et al. 2008,30Griol et al. 201331 | Type 2 diabetes | Data collection, telemonitoring | Quasi-experimental + content analysis of dialogues + interviews [Black 2005: 8 weeks, 5 patients with diabetes][Harper 2008: 16 weeks, 13 patients asked to call the CA once/week][Griol 2013: 6 participants following a set of scripted scenarios, 150 dialogues] | • Black 2005: 90.4% successful task completion, 74.7% recognition success• Harper 2008: 92.2% successful task completion, 97.2% recognition accuracy• Griol 2013: 97% successful task completion, 25% confirmation rate, 91% error correction | Black 2005:• Patients mentioned they appreciated the level of personalization achieved by the systemHarper 2008:• User satisfaction: 85% (measurement tool NR)• Issues with speech recognition and technical problems that resulted in system disconnections | Harper 2008:• Self-reported behavior change (eg physical activity, diet) (nq)• 19 alerts were generated for the healthcare professionals; therapeutic optimization occurred for 12 patients |
Levin and Levin, 200632 | Pain monitoring | Data collection | Quasi-experimental [24 participants used the CA as a pain monitoring voice diary during 2 weeks; 177 data collection sessions] | • Data capture rate: 98% (2% flagged for transcription)• Task-oriented dialogue turns: 82% | • Users became more efficient with experience, increasing the % of interrupted prompts and task-oriented dialogue | NR |
Giorgino et al. 2005,33 Azzini et al. 200334 | Hypertension | Data collection, telemonitoring | Quasi-experimental + content analysis [15 users (assigned a disease profile); 400 dialogues transcribed and analyzed] | • Authors mention satisfying performance but evaluation data is not reported in detail• 80% successful task completion; 35% confirmation questions | NR | NR |
Abbreviations: AUC: Area Under the Curve; CA: conversational agent; CI: confidence interval; ECA: Embodied Conversational Agent; ESS: Epworth Sleepiness Scale; nq: not quantified in the paper; NR: not reported; p: p-value, measure of statistical significance; PTSD: Post Traumatic Stress Disorder; r: correlation coefficient; RCT: randomized controlled trial; sens.: sensitivity; spec.: specificity
aStudies evaluating the same conversational agent were grouped together; bTechnology supporting clinicians: systems that support clinical work at the healthcare setting (e.g. CA substituting a clinician in a clinical interview with diagnostic purposes); Technology supporting patients and clinicians: systems that support both consumers in their daily lives and clinical work at the healthcare setting (e.g. telemonitoring systems involving a CA).