Skip to main content
. 2025 Aug 6;27:e74231. doi: 10.2196/74231

Table 3. Large language model concordance and agreement metrics in complaints classification.

HCAT (GP)a field GPT-3.5 turbo GPT-4o mini Claude 3.5 Sonnet GPT-3.5 versus GPT-4o GPT-3.5 versus Claude 3.5 GPT-4o versus Claude 3.5
Concordance, % Cohen κ Concordance, % Cohen κ Concordance, % Cohen κ P valueb P valueb P valueb
Domain 78.4 0.612 79.4 0.623c 80.5c 0.619 .41 .01d .14
Category 64.3 0.520 69.8c 0.571c 69.2 0.568 <.001d <.001d .59
Severity 48.8 0.201 53.9c 0.226 51.3 0.239c <.001d .03d .02d
Stage of care 60.7 0.468 66.1 0.534 68.1c 0.561c .38 <.001d .02d
Patient harm 57.5 0.114 74.2 0.162 75.0c 0.175c <.001d <.001d .25
Average 61.9 e 68.7 68.8
a

HCAT (GP): Healthcare Complaint Analysis Tool (General Practice).

b

McNemar test between concordance of AI models.

c

These values showcase the highest values within the row.

d

Significant P values <.05.

e

Not applicable.