Table 3. Assessment of ChatGPT models’ performance per topic and the overall performance across topics.
O&P: Ova and parasite examination; AST: antimicrobial susceptibility testing; MRSA: methicillin-resistant Staphylococcus aureus; UTI: urinary tract infection; PCR: polymerase chain reaction; ID: microbial identification; Dx: diagnostic approach; CLEAR: Completeness, Lack of false information, Evidence support, Appropriateness, and Relevance.
The average scores were calculated by the sum of the two raters’ scores divided by 2.
CASE | Query classification | Average CLEAR score for ChatGPT-3.5 | Average CLEAR score for ChatGPT-4 | t-test |
Average performance in ID | 3.4 (Very good) | 3.83 (Very good) | t(3)=-3.087, P=0.054 | |
Q1 (O&P examination) | ID | 1.7 (Poor) | 1.8 (Satisfactory) | |
Q5 (Candida albicans identification) | ID | 4.1 (Very good) | 4.8 (Excellent) | |
Q7 (Brucella spp. identification) | ID | 4.4 (Excellent) | 4.7 (Excellent) | |
Q10 (Salmonella enterica identification) | ID | 3.4 (Very good) | 4.0 (Very good) | |
Average performance in AST | 1.87 (Satisfactory) | 2.37 (Satisfactory) | t(2)=-1.387, P=0.300 | |
Q2 (AST for colistin) | AST | 1.4 (Poor) | 1.6 (Poor) | |
Q3 (MRSA resistance to all beta-lactams) | AST | 2.8 (Good) | 3.2 (Good) | |
Q4 (Enterococci resistance to clindamycin) | AST | 1.4 (Poor) | 2.3 (Satisfactory) | |
Average performance in Dx | 2.4 (Satisfactory) | 3.2 (Good) | t(2)=-2.402, P=0.138 | |
Q6 (Laboratory diagnosis of UTI) | Dx | 2.9 (Good) | 2.9 (Good) | |
Q8 (Interpretation of real-time PCR testing for respiratory viruses/atypical bacteria) | Dx | 1.5 (Poor) | 3.5 (Very good) | |
Q9 (Sputum quality assessment for microbiologic culture) | Dx | 2.8 (Good) | 3.3 (Good) | |
Overall performance across the three categories | 2.64 (Good) | 3.21 (Good) | t(9)=-3.143, P=0.012 |