. 2024 Jul 18;60(2):121–133. doi: 10.1111/jre.13323

TABLE 2.

(Top) Descriptive statistics of periodontal residents' performance divided by year of training. (Middle) Results of sub‐analyses. Section analysis. Performance of each large language model divided by exam sections and respective p value comparing the difference in performances. (Bottom) Results of performance analysis of artificial intelligence models on the most difficult periodontal in‐service exam questions.

	Exam Year
	2020 (368)	2021 (337)	2022 (305)	2023 (302)	2020–2023
PGY‐1 Residents
N	174 (33.14%)	158 (33.05%)	182 (32.21%)	192 (36.16%)	706
Avg Score	214.86 ± 29.94 (58.39% ± 29.94)	216.04 ± 33.43 (64.06% ± 33.43)	203.13 ± 28.76 (66.57% ± 28.76)	193.18 ± 34.56 (63.92% ± 34.56)	206.43 ± 32.84 (63.48% ± 31.67)
PGY‐2 Residents
N	182 (34.67%)	170 (35.56%)	203 (35.92%)	198 (37.29%)	753
Avg Score	222.58 ± 30.17 (60.53% ± 30.17)	229.51 ± 33.04 (68.02% ± 33.04)	216.57 ± 28.97 (71.00% ± 28.97)	206.82 ± 33.27 (68.45% ± 33.27)	218.86 ± 31.11 (66.25% ± 31.61)
PGY‐3 Residents
N	169 (32.19%)	150 (31.38%)	180 (31.85%)	141 (26.55%)	640
Avg Score	229.07 ± 28.75 (62.22% ± 28.75)	232.61 ± 35.31 (68.99% ± 35.31)	219.29 ± 28.81 (71.84% ± 28.81)	223.91 ± 28.95 (74.18% ± 28.95)	224.32 ± 30.32 (69.06% ± 30.45)
All Residents
N	525	479	565	531	2375
Avg Score	222.11 ± 30.14 (60.35% ± 30.14)	226.03 ± 34.57 (67.04% ± 34.57)	213.11 ± 29.63 (69.81% ± 29.63)	206.43 ± 34.76 (68.35% ± 34.76)	214.62 ± 32.71 (66.39% ± 32.77)
Google Gemini	260 (70.65)	247 (73.29)	231 (75.73)	218 (72.18)	956 (72.86%)
GPT‐3.5	230 (62.5)	230 (68.24)	213 (69.83)	179 (59.27)	852 (64.93%)
GPT‐4	(290) 78.80%	(266) 78.93%	(247) 80.98%	(241)79.80	1044 (79.57%)
GPT‐3.5 vs. Bard	<0.01 ^b	0.15	0.1	<0.001 ^b	<0.001 ^b
GPT4 vs. Bard	<0.001	<0.001	<0.001	<0.001	<0.001
GPT4 vs. GPT‐3.5	<0.001	<0.001	<0.001	<0.001	<0.001

Section	GPT‐4 score	Gemini score	GPT‐3.5 score	p Value
Section	GPT‐4 score	Gemini score	GPT‐3.5 score	GPT4 vs. Gemini	GPT‐4 vs. GPT‐3.5	GPT‐3.5 vs. Gemini
Embryology and Anatomy	80 (83.33%)	77 (80.2%)	67 (69.8%)	.02	<.01	.09
Biostatistics, Experimental Design, and data analysis	14 (93.33%)	11 (73.3%)	13 (86.7%)	.58	.26	.37
Biochemistry‐Physiology	114 (93.44%)	103 (84.4%)	104 (85.2%)	<.001	<.001	.85
Microbiology and Immunology	101 (88.59%)	92 (80.7%)	94 (82.5%)	<.001	<.001	.73
Periodontal Etiology and Pathology	109 (78.41%)	97 (69.8%)	83 (59.7%)	<.001	<.001	.07
Pharmacology and Therapeutics	131 (91.60%)	123 (86.0%)	118 (82.5%)	<.001	<.001	.41
Diagnosis	77 (70%) ^b	66 (60.0%)	61 (55.5%)	.32	.01	.49
Treatment Planning and Prognosis	79 (70.53%)	69 (61.6%)	42 (37.5%)	<.001	.03	<.001
Therapy	206 (69.12%)	184 (61.7%)	140 (47.0%)	<.001	<.001	<.001
Oral Pathology/Oral Medicine	148 (90.79%)	134 (82.2%)	130 (79.8%)	<.001	<.001	.57

Difficult Questions
Model	Correct	Total	Percentage	p‐Value
GPT‐4	80	127	62.99%	.02 ^a , .09 ^b
GPT‐3.5	69	127	54.33%	.02 ^a , .70 ^c
Gemini	73	127	57.48%	.09 ^b , .70 ^c
Residents	52	127	40.52%	—

Abbreviations: Avg, Average; N, Number of residents who participated in the exam. Bold values indicate a statistically significant p value.

^{^a}

p Value for chi‐square test, comparing GPT‐4 versus GPT‐3.5.

^{^b}

p Value for chi‐square test, comparing GPT‐4 versus Bard.

^{^c}

p Value for chi‐square test, comparing GPT‐3.5 versus Bard.