TABLE 2.
(Top) Descriptive statistics of periodontal residents' performance divided by year of training. (Middle) Results of sub‐analyses. Section analysis. Performance of each large language model divided by exam sections and respective p value comparing the difference in performances. (Bottom) Results of performance analysis of artificial intelligence models on the most difficult periodontal in‐service exam questions.
Exam Year | |||||
---|---|---|---|---|---|
2020 (368) | 2021 (337) | 2022 (305) | 2023 (302) | 2020–2023 | |
PGY‐1 Residents | |||||
N | 174 (33.14%) | 158 (33.05%) | 182 (32.21%) | 192 (36.16%) | 706 |
Avg Score |
214.86 ± 29.94 (58.39% ± 29.94) |
216.04 ± 33.43 (64.06% ± 33.43) |
203.13 ± 28.76 (66.57% ± 28.76) |
193.18 ± 34.56 (63.92% ± 34.56) |
206.43 ± 32.84 (63.48% ± 31.67) |
PGY‐2 Residents | |||||
N | 182 (34.67%) | 170 (35.56%) | 203 (35.92%) | 198 (37.29%) | 753 |
Avg Score |
222.58 ± 30.17 (60.53% ± 30.17) |
229.51 ± 33.04 (68.02% ± 33.04) |
216.57 ± 28.97 (71.00% ± 28.97) |
206.82 ± 33.27 (68.45% ± 33.27) |
218.86 ± 31.11 (66.25% ± 31.61) |
PGY‐3 Residents | |||||
N | 169 (32.19%) | 150 (31.38%) | 180 (31.85%) | 141 (26.55%) | 640 |
Avg Score |
229.07 ± 28.75 (62.22% ± 28.75) |
232.61 ± 35.31 (68.99% ± 35.31) |
219.29 ± 28.81 (71.84% ± 28.81) |
223.91 ± 28.95 (74.18% ± 28.95) |
224.32 ± 30.32 (69.06% ± 30.45) |
All Residents | |||||
N | 525 | 479 | 565 | 531 | 2375 |
Avg Score |
222.11 ± 30.14 (60.35% ± 30.14) |
226.03 ± 34.57 (67.04% ± 34.57) |
213.11 ± 29.63 (69.81% ± 29.63) |
206.43 ± 34.76 (68.35% ± 34.76) |
214.62 ± 32.71 (66.39% ± 32.77) |
Google Gemini | 260 (70.65) | 247 (73.29) | 231 (75.73) | 218 (72.18) | 956 (72.86%) |
GPT‐3.5 | 230 (62.5) | 230 (68.24) | 213 (69.83) | 179 (59.27) | 852 (64.93%) |
GPT‐4 | (290) 78.80% | (266) 78.93% | (247) 80.98% | (241)79.80 | 1044 (79.57%) |
GPT‐3.5 vs. Bard | <0.01 b | 0.15 | 0.1 | <0.001 b | <0.001 b |
GPT4 vs. Bard | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
GPT4 vs. GPT‐3.5 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
Section | GPT‐4 score | Gemini score | GPT‐3.5 score | p Value | ||
---|---|---|---|---|---|---|
GPT4 vs. Gemini | GPT‐4 vs. GPT‐3.5 | GPT‐3.5 vs. Gemini | ||||
Embryology and Anatomy | 80 (83.33%) | 77 (80.2%) | 67 (69.8%) | .02 | <.01 | .09 |
Biostatistics, Experimental Design, and data analysis | 14 (93.33%) | 11 (73.3%) | 13 (86.7%) | .58 | .26 | .37 |
Biochemistry‐Physiology | 114 (93.44%) | 103 (84.4%) | 104 (85.2%) | <.001 | <.001 | .85 |
Microbiology and Immunology | 101 (88.59%) | 92 (80.7%) | 94 (82.5%) | <.001 | <.001 | .73 |
Periodontal Etiology and Pathology | 109 (78.41%) | 97 (69.8%) | 83 (59.7%) | <.001 | <.001 | .07 |
Pharmacology and Therapeutics | 131 (91.60%) | 123 (86.0%) | 118 (82.5%) | <.001 | <.001 | .41 |
Diagnosis | 77 (70%) b | 66 (60.0%) | 61 (55.5%) | .32 | .01 | .49 |
Treatment Planning and Prognosis | 79 (70.53%) | 69 (61.6%) | 42 (37.5%) | <.001 | .03 | <.001 |
Therapy | 206 (69.12%) | 184 (61.7%) | 140 (47.0%) | <.001 | <.001 | <.001 |
Oral Pathology/Oral Medicine | 148 (90.79%) | 134 (82.2%) | 130 (79.8%) | <.001 | <.001 | .57 |
Difficult Questions | ||||
---|---|---|---|---|
Model | Correct | Total | Percentage | p‐Value |
GPT‐4 | 80 | 127 | 62.99% | .02 a , .09 b |
GPT‐3.5 | 69 | 127 | 54.33% | .02 a , .70 c |
Gemini | 73 | 127 | 57.48% | .09 b , .70 c |
Residents | 52 | 127 | 40.52% | — |
Abbreviations: Avg, Average; N, Number of residents who participated in the exam. Bold values indicate a statistically significant p value.
p Value for chi‐square test, comparing GPT‐4 versus GPT‐3.5.
p Value for chi‐square test, comparing GPT‐4 versus Bard.
p Value for chi‐square test, comparing GPT‐3.5 versus Bard.