Table 3. Comparison of completeness and readability of chatbot responses on US prostate cancer screening guidelines.
| Study | Chatbot name | Completeness, n/N (%) | Average readability score | |
|---|---|---|---|---|
| mean (SD) | %, mean (SD) | |||
| This study | PCIa | 17/23 (74) | 64.5 (8.7) | —b |
| Zhu et al [15] | ChatGPT | 21/22 (95)c | — | 100 (NRd) |
| Zhu et al [15] | ChatGPT Plus | 20.3/22 (92)c | — | 100 (NR) |
| Zhu et al [15] | ChatSonic | 14.3/22 (65) | — | 95 (NR) |
| Zhu et al [15] | YouChat | 10.34/22 (47) | — | 98 (NR) |
| Zhu et al [15] | Neeva AI | 8.8/22 (40) | — | 84 (NR) |
| Zhu et al [15] | Perplexity Detailed | 6.6/22 (30) | — | 95 (NR) |
| Zhu et al [15] | Perplexity Concise | 6.6/22 (30) | — | 95 (NR) |
| Owens et al [23] | ChatGPT 3.5 standard response | 6/11 (54) | 38.0 (7.6) | — |
| Owens et al [23] | ChatGPT 3.5 low literacy response | 4/11 (34) | 70.3 (7.2)e | — |
| Owens et al [23] | ChatGPT 4.0 standard response | 7/11 (63) | 43.1 (9.2) | — |
| Owens et al [23] | ChatGPT 4.0 low literacy response | 7/11 (63) | 74.1 (9.9)e | — |
| Owens et al [23] | Google Gemini standard response | 6/11 (54) | 55.7 (10.4) | — |
| Owens et al [23] | Google Gemini low literacy response | 5/11 (45) | 81.0 (3.6)e | — |
| Owens et al [23] | Google Gemini Advanced standard response | 6/11 (54) | 66.3 (9.4)e | — |
| Owens et al [23] | Google Gemini Advanced low literacy response | 6/11 (54) | 79.4 (5.1)e | — |
| Owens et al [23] | Microsoft Copilot standard response | 8/11 (72) | 50.8 (9.3) | — |
| Owens et al [23] | Microsoft Copilot low literacy response |
6/11 (54) | 65.1 (6.6)e | — |
| Owens et al [23] | Microsoft Copilot Pro standard response | 7/11 (63) | 61.2 (9.5) | — |
| Owens et al [23] | Microsoft Copilot Pro low literacy response |
6/1 (54) | 78.8 (4.7)e | — |
PCI: Prostate Cancer Info.
Not applicable.
Chatbot had a higher completeness score than PCI.
NR: not reported.
Chatbot had definitively higher readability scores than PCI based on the Flesch-Kincaid readability. Other scores may also be higher but were not based on a validated measure.