Skip to main content
. 2024 Mar 27;24:78. doi: 10.1186/s12874-024-02203-8

Table 3.

Comparing human raters against ChatGPT at threshold ≥ 3, across the whole dataset

Human raters Evaluation METRIC Value [95% CI] ChatGPT [95% CI] P-value (two-tailed)
GP 1 Sensitivity 0.55 [0.48,0.63] 0.95 [0.91,0.98] < 0.001
Specificity 0.94 [0.93,0.96] 0.65 [0.62,0.68] < 0.001
Precision (PPV) 0.58 [0.50,0.66] 0.28 [0.24,0.32] < 0.001
Negative Predictive Value 0.94 [0.92,0.95] 0.99 [0.98,1.00] < 0.001
Positive Likelihood Ratio 9.70 [7.48,13.34] 2.71 [2.46,3.01] < 0.001
Negative Likelihood Ratio 0.47 [0.39,0.56] 0.08 [0.03,0.14] < 0.001
Balanced Accuracy 0.75 [0.71,0.79] 0.80 [0.77,0.82] 0.016
Jaccard Index 0.39 [0.33,0.47] 0.27 [0.23,0.31] < 0.001
False Negative Rate 0.45 [0.37,0.52] 0.05 [0.02,0.09] < 0.001
Proportion Missed 0.06 [0.05,0.08] 0.01 [0.00,0.02] < 0.001
GP 2 Sensitivity 0.55 [0.46,0.63] 0.95 [0.91,0.98] < 0.001
Specificity 0.99 [0.98,0.99] 0.65 [0.62,0.68] < 0.001
Precision (PPV) 0.86 [0.79,0.93] 0.28 [0.24,0.32] < 0.001
Negative Predictive Value 0.94 [0.92,0.95] 0.99 [0.98,1.00] < 0.001
Positive Likelihood Ratio 44.20 [26.11,90.48] 2.71 [2.46,3.01] < 0.001
Negative Likelihood Ratio 0.46 [0.38,0.54] 0.08 [0.03,0.14] < 0.001
Balanced Accuracy 0.77 [0.72,0.81] 0.80 [0.77,0.82] 0.16
Jaccard Index 0.50 [0.42,0.58] 0.27 [0.23,0.31] < 0.001
False Negative Rate 0.45 [0.37,0.54] 0.05 [0.02,0.09] < 0.001
Proportion Missed 0.06 [0.05,0.08] 0.01 [0.00,0.02] < 0.001
GP 3 Sensitivity 0.74 [0.66,0.80] 0.95 [0.91,0.98] < 0.001
Specificity 0.94 [0.92,0.95] 0.65 [0.62,0.68] < 0.001
Precision (PPV) 0.62 [0.55,0.68] 0.28 [0.24,0.32] < 0.001
Negative Predictive Value 0.96 [0.95,0.97] 0.99 [0.98,1.00] < 0.001
Positive Likelihood Ratio 11.37 [9.00,14.60] 2.71 [2.46,3.01] < 0.001
Negative Likelihood Ratio 0.28 [0.21,0.36] 0.08 [0.03,0.14] < 0.001
Balanced Accuracy 0.84 [0.80,0.87] 0.80 [0.77,0.82] 0.076
Jaccard Index 0.50 [0.44,0.57] 0.27 [0.23,0.31] < 0.001
False Negative Rate 0.26 [0.20,0.34] 0.05 [0.02,0.09] < 0.001
Proportion Missed 0.04 [0.03,0.05] 0.01 [0.00,0.02] < 0.001
Voting Consensus (GPs) Sensitivity 0.62 [0.54,0.70] 0.95 [0.91,0.98] < 0.001
Specificity 0.98 [0.97,0.99] 0.65 [0.62,0.68] < 0.001
Precision (PPV) 0.83 [0.75,0.89] 0.28 [0.24,0.32] < 0.001
Negative Predictive Value 0.95 [0.93,0.96] 0.99 [0.98,1.00] < 0.001
Positive Likelihood Ratio 34.35 [22.94,58.37] 2.71 [2.46,3.01] < 0.001
Negative Likelihood Ratio 0.39 [0.31,0.46] 0.08 [0.03,0.14] < 0.001
Balanced Accuracy 0.80 [0.76,0.84] 0.80 [0.77,0.82] 0.906
Jaccard Index 0.55 [0.48,0.62] 0.27 [0.23,0.31] < 0.001
False Negative Rate 0.38 [0.30,0.46] 0.05 [0.02,0.09] < 0.001
Proportion Missed 0.05 [0.04,0.07] 0.01 [0.00,0.02] < 0.001
Specific Consensus (GPs) Sensitivity 0.32 [0.25,0.39] 0.95 [0.91,0.98] < 0.001
Specificity 1.00 [0.99,1.00] 0.65 [0.62,0.68] < 0.001
Precision (PPV) 0.94 [0.87,1.00] 0.28 [0.24,0.32] < 0.001
Negative Predictive Value 0.91 [0.89,0.93] 0.99 [0.98,1.00] < 0.001
Positive Likelihood Ratio 111.15 [46.58,∞] 2.71 [2.46,3.01] < 0.001
Negative Likelihood Ratio 0.68 [0.61,0.76] 0.08 [0.03,0.14] < 0.001
Balanced Accuracy 0.66 [0.62,0.69] 0.80 [0.77,0.82] < 0.001
Jaccard Index 0.31 [0.24,0.38] 0.27 [0.23,0.31] 0.392
False Negative Rate 0.68 [0.61,0.75] 0.05 [0.02,0.09] < 0.001
Proportion Missed 0.09 [0.07,0.11] 0.01 [0.00,0.02] < 0.001
Sensitive consensus (GPs) Sensitivity 0.90 [0.85,0.95] 0.95 [0.91,0.98] 0.074
Specificity 0.89 [0.87,0.91] 0.65 [0.62,0.68] < 0.001
Precision (PPV) 0.53 [0.47,0.59] 0.28 [0.24,0.32] < 0.001
Negative Predictive Value 0.98 [0.98,0.99] 0.99 [0.98,1.00] 0.364
Positive Likelihood Ratio 7.93 [6.72,9.64] 2.71 [2.46,3.01] < 0.001
Negative Likelihood Ratio 0.11 [0.06,0.17] 0.08 [0.03,0.14] 0.364
Balanced Accuracy 0.89 [0.87,0.92] 0.80 [0.77,0.82] < 0.001
Jaccard Index 0.50 [0.44,0.56] 0.27 [0.23,0.31] < 0.001
False Negative Rate 0.10 [0.05,0.15] 0.05 [0.02,0.09] 0.074
Proportion Missed 0.02 [0.01,0.02] 0.01 [0.00,0.02] 0.364