Diabetes |
Differentiate ChatGPT-generated and human answers with 10 questions about diabetes. |
59.5% participants could identify ChatGPT-generated answers with a binary assessment. |
The strongest correlation is with previous ChatGPT use (p = 0.003) among all factors. Linguistic features might hold a higher predictive value than the content itself. |
Hulman et al., 202314
|
Microbiology |
Answers to 96 first- and second-order questions based on the competency-based medical education (CBME) curriculum. |
Mean score was 4.07 (±0.32)/5, which had an accuracy rate of 80%. |
ChatGPT has the potential to be an effective tool for automated question-answering in microbiology. |
Das et al., 202310
|
Medical responses |
Answers to 284 medical questions across 17 specialties. |
Largely correct responses with mean accuracy score of 4.8/6 and mean completeness score of 2.5/3. |
Substantial improvements were observed in 34 of 36 questions scored 1–2. ChatGPT generated largely accurate information although with important limitations. |
Johnson et al., 202311
|
Cancer |
Comparison with NCI's answers about 13 questions on a cancer-related web page. |
96.9% overall agreement on the accuracy of cancer information. |
Few noticeable differences were in the number of words or the readability of the answers from NCI or ChatGPT. |
Johnson et al., 202316
|
Cirrhosis and hepatocellular carcinoma (HCC) |
Performance comparison between ChatGPT and physicians or trainees in 164 questions. |
An overall accuracy of 76.9% in quality measures. |
ChatGPT demonstrated strong knowledge of cirrhosis and HCC, but lacked comprehensiveness in diagnosis and preventive medicine. |
Yeo et al., 202315
|
Prostate cancer |
Comparison between five state-of-the-art large language models in providing information on 22 common prostate questions. |
ChatGPT had the highest accuracy rate and comprehensiveness score among all five Large Language Model, and had satisfactory patient readability. |
Large Language Models with internet-connected dataset was not superior to ChatGPT. Paid version of ChatGPT did not show superiority over the free version. |
Zhu et al., 202317
|
Toxicology |
Response to a case of acute organophosphate poisoning. |
Subjective assessment. |
ChatGPT fared well in answering and offered good explanations of the underlying reasoning. |
Abdel-Messih et al., 202318
|
Shoulder Impingement Syndrome (SIS) |
Analyzed the ability of providing information for SIS. |
Subjective assessment. |
ChatGPT could provide useful medical information and treatment options for patients with SIS, including symptoms, similar diseases, orthopedic tests, and exercise recommendations. But potential biases and inappropriate information must be taken into account. |
Kim J hee, 202219
|
Infection |
Antimicrobial advice in eight hypothetical infection scenario based questions. |
Subjective assessment of appropriateness, consistency, safety, and antimicrobial stewardship implications. |
ChatGPT exhibited deficiencies in situational awareness, inference, and consistency in clinical practice. |
Howard et al., 202320
|
Ophthalmology |
Tested the accuracy on two question banks used for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) exam. |
55.8% and 42.7% accuracy in the two 260-question simulated exams. |
Performance of ChatGPT varied across ophthalmic subspecialties. Domain-specific and pre-training might improve its performance. |
Antaki et al., 202321
|
Neuropathic pain |
Tested the performance on 50 pairs of causal relationships in neuropathic pain diagnosis. |
ChatGPT tended to make false negative mistakes, and showed high precision and low recall from the binary assessments of true/false. |
ChatGPT lacked in consistency and stability in the context of neuropathic pain diagnosis. Using Large Language Models' causal claims as causal discovery results requires caution because of fundamental differences in the tasks and biases in causal benchmarks. |
Tu et al., 202322
|