Table 2.
Internal consistency by tagging strategy and task.
| Coding strategy | T1 (Macro ) |
T2 (MAE) |
T3 (Accuracy) |
T4 (Accuracy) |
T5 (Accuracy) |
|---|---|---|---|---|---|
| Outsourced humans | 0.672 | 0.786 | 0.581 | 0.367 | 0.352 |
| GPT-3.5-turbo | 0.985 | 0.048 | 0.971 | 0.924 | 0.938 |
| GPT-4-turbo | 0.995 | 0.024 | 0.976 | 0.971 | 0.967 |
| Claude 3 Opus | 0.999 | 0.014 | 1.000 | 0.995 | 0.990 |
| Claude 3.5 Sonnet | 0.997 | 0.010 | 1.000 | 0.986 | 0.986 |
Each row represents a coding strategy and each column a task. Cell values measure consistency by comparing how well the first draw replicates the value obtained in the second draw (which we treat as the true label), using each task’s performance metric. Higher consistency is indicated by values close to 1 for T1, T3, T4, and T5, and by values close to 0 for T2.
