Table 3. Large language model concordance and agreement metrics in complaints classification.
| HCAT (GP)a field | GPT-3.5 turbo | GPT-4o mini | Claude 3.5 Sonnet | GPT-3.5 versus GPT-4o | GPT-3.5 versus Claude 3.5 | GPT-4o versus Claude 3.5 | |||
|---|---|---|---|---|---|---|---|---|---|
| Concordance, % | Cohen κ | Concordance, % | Cohen κ | Concordance, % | Cohen κ | P valueb | P valueb | P valueb | |
| Domain | 78.4 | 0.612 | 79.4 | 0.623c | 80.5c | 0.619 | .41 | .01d | .14 |
| Category | 64.3 | 0.520 | 69.8c | 0.571c | 69.2 | 0.568 | <.001d | <.001d | .59 |
| Severity | 48.8 | 0.201 | 53.9c | 0.226 | 51.3 | 0.239c | <.001d | .03d | .02d |
| Stage of care | 60.7 | 0.468 | 66.1 | 0.534 | 68.1c | 0.561c | .38 | <.001d | .02d |
| Patient harm | 57.5 | 0.114 | 74.2 | 0.162 | 75.0c | 0.175c | <.001d | <.001d | .25 |
| Average | 61.9 | —e | 68.7 | — | 68.8 | — | — | — | — |
HCAT (GP): Healthcare Complaint Analysis Tool (General Practice).
McNemar test between concordance of AI models.
These values showcase the highest values within the row.
Significant P values <.05.
Not applicable.