. 2025 Nov 17;15:40122. doi: 10.1038/s41598-025-23798-y

Table 2.

Internal consistency by tagging strategy and task.

Coding strategy	T1 (Macro )	T2 (MAE)	T3 (Accuracy)	T4 (Accuracy)	T5 (Accuracy)
Outsourced humans	0.672	0.786	0.581	0.367	0.352
GPT-3.5-turbo	0.985	0.048	0.971	0.924	0.938
GPT-4-turbo	0.995	0.024	0.976	0.971	0.967
Claude 3 Opus	0.999	0.014	1.000	0.995	0.990
Claude 3.5 Sonnet	0.997	0.010	1.000	0.986	0.986

Each row represents a coding strategy and each column a task. Cell values measure consistency by comparing how well the first draw replicates the value obtained in the second draw (which we treat as the true label), using each task’s performance metric. Higher consistency is indicated by values close to 1 for T1, T3, T4, and T5, and by values close to 0 for T2.