. 2025 Aug 6;27:e74231. doi: 10.2196/74231

Table 3. Large language model concordance and agreement metrics in complaints classification.

HCAT (GP)^a field	GPT-3.5 turbo		GPT-4o mini		Claude 3.5 Sonnet		GPT-3.5 versus GPT-4o	GPT-3.5 versus Claude 3.5	GPT-4o versus Claude 3.5
	Concordance, %	Cohen κ	Concordance, %	Cohen κ	Concordance, %	Cohen κ	P value^b	P value^b	P value^b
Domain	78.4	0.612	79.4	0.623^c	80.5^c	0.619	.41	.01^d	.14
Category	64.3	0.520	69.8^c	0.571^c	69.2	0.568	<.001^d	<.001^d	.59
Severity	48.8	0.201	53.9^c	0.226	51.3	0.239^c	<.001^d	.03^d	.02^d
Stage of care	60.7	0.468	66.1	0.534	68.1^c	0.561^c	.38	<.001^d	.02^d
Patient harm	57.5	0.114	74.2	0.162	75.0^c	0.175^c	<.001^d	<.001^d	.25
Average	61.9	—^e	68.7	—	68.8	—	—	—	—

HCAT (GP): Healthcare Complaint Analysis Tool (General Practice).

McNemar test between concordance of AI models.

These values showcase the highest values within the row.

Significant P values <.05.

Not applicable.