. 2024 Mar 27;24:78. doi: 10.1186/s12874-024-02203-8

Table 3.

Comparing human raters against ChatGPT at threshold ≥ 3, across the whole dataset

Human raters	Evaluation METRIC	Value [95% CI]	ChatGPT [95% CI]	P-value (two-tailed)
GP 1	Sensitivity	0.55 [0.48,0.63]	0.95 [0.91,0.98]	< 0.001
	Specificity	0.94 [0.93,0.96]	0.65 [0.62,0.68]	< 0.001
	Precision (PPV)	0.58 [0.50,0.66]	0.28 [0.24,0.32]	< 0.001
	Negative Predictive Value	0.94 [0.92,0.95]	0.99 [0.98,1.00]	< 0.001
	Positive Likelihood Ratio	9.70 [7.48,13.34]	2.71 [2.46,3.01]	< 0.001
	Negative Likelihood Ratio	0.47 [0.39,0.56]	0.08 [0.03,0.14]	< 0.001
	Balanced Accuracy	0.75 [0.71,0.79]	0.80 [0.77,0.82]	0.016
	Jaccard Index	0.39 [0.33,0.47]	0.27 [0.23,0.31]	< 0.001
	False Negative Rate	0.45 [0.37,0.52]	0.05 [0.02,0.09]	< 0.001
	Proportion Missed	0.06 [0.05,0.08]	0.01 [0.00,0.02]	< 0.001
GP 2	Sensitivity	0.55 [0.46,0.63]	0.95 [0.91,0.98]	< 0.001
	Specificity	0.99 [0.98,0.99]	0.65 [0.62,0.68]	< 0.001
	Precision (PPV)	0.86 [0.79,0.93]	0.28 [0.24,0.32]	< 0.001
	Negative Predictive Value	0.94 [0.92,0.95]	0.99 [0.98,1.00]	< 0.001
	Positive Likelihood Ratio	44.20 [26.11,90.48]	2.71 [2.46,3.01]	< 0.001
	Negative Likelihood Ratio	0.46 [0.38,0.54]	0.08 [0.03,0.14]	< 0.001
	Balanced Accuracy	0.77 [0.72,0.81]	0.80 [0.77,0.82]	0.16
	Jaccard Index	0.50 [0.42,0.58]	0.27 [0.23,0.31]	< 0.001
	False Negative Rate	0.45 [0.37,0.54]	0.05 [0.02,0.09]	< 0.001
	Proportion Missed	0.06 [0.05,0.08]	0.01 [0.00,0.02]	< 0.001
GP 3	Sensitivity	0.74 [0.66,0.80]	0.95 [0.91,0.98]	< 0.001
	Specificity	0.94 [0.92,0.95]	0.65 [0.62,0.68]	< 0.001
	Precision (PPV)	0.62 [0.55,0.68]	0.28 [0.24,0.32]	< 0.001
	Negative Predictive Value	0.96 [0.95,0.97]	0.99 [0.98,1.00]	< 0.001
	Positive Likelihood Ratio	11.37 [9.00,14.60]	2.71 [2.46,3.01]	< 0.001
	Negative Likelihood Ratio	0.28 [0.21,0.36]	0.08 [0.03,0.14]	< 0.001
	Balanced Accuracy	0.84 [0.80,0.87]	0.80 [0.77,0.82]	0.076
	Jaccard Index	0.50 [0.44,0.57]	0.27 [0.23,0.31]	< 0.001
	False Negative Rate	0.26 [0.20,0.34]	0.05 [0.02,0.09]	< 0.001
	Proportion Missed	0.04 [0.03,0.05]	0.01 [0.00,0.02]	< 0.001
Voting Consensus (GPs)	Sensitivity	0.62 [0.54,0.70]	0.95 [0.91,0.98]	< 0.001
	Specificity	0.98 [0.97,0.99]	0.65 [0.62,0.68]	< 0.001
	Precision (PPV)	0.83 [0.75,0.89]	0.28 [0.24,0.32]	< 0.001
	Negative Predictive Value	0.95 [0.93,0.96]	0.99 [0.98,1.00]	< 0.001
	Positive Likelihood Ratio	34.35 [22.94,58.37]	2.71 [2.46,3.01]	< 0.001
	Negative Likelihood Ratio	0.39 [0.31,0.46]	0.08 [0.03,0.14]	< 0.001
	Balanced Accuracy	0.80 [0.76,0.84]	0.80 [0.77,0.82]	0.906
	Jaccard Index	0.55 [0.48,0.62]	0.27 [0.23,0.31]	< 0.001
	False Negative Rate	0.38 [0.30,0.46]	0.05 [0.02,0.09]	< 0.001
	Proportion Missed	0.05 [0.04,0.07]	0.01 [0.00,0.02]	< 0.001
Specific Consensus (GPs)	Sensitivity	0.32 [0.25,0.39]	0.95 [0.91,0.98]	< 0.001
	Specificity	1.00 [0.99,1.00]	0.65 [0.62,0.68]	< 0.001
	Precision (PPV)	0.94 [0.87,1.00]	0.28 [0.24,0.32]	< 0.001
	Negative Predictive Value	0.91 [0.89,0.93]	0.99 [0.98,1.00]	< 0.001
	Positive Likelihood Ratio	111.15 [46.58,∞]	2.71 [2.46,3.01]	< 0.001
	Negative Likelihood Ratio	0.68 [0.61,0.76]	0.08 [0.03,0.14]	< 0.001
	Balanced Accuracy	0.66 [0.62,0.69]	0.80 [0.77,0.82]	< 0.001
	Jaccard Index	0.31 [0.24,0.38]	0.27 [0.23,0.31]	0.392
	False Negative Rate	0.68 [0.61,0.75]	0.05 [0.02,0.09]	< 0.001
	Proportion Missed	0.09 [0.07,0.11]	0.01 [0.00,0.02]	< 0.001
Sensitive consensus (GPs)	Sensitivity	0.90 [0.85,0.95]	0.95 [0.91,0.98]	0.074
	Specificity	0.89 [0.87,0.91]	0.65 [0.62,0.68]	< 0.001
	Precision (PPV)	0.53 [0.47,0.59]	0.28 [0.24,0.32]	< 0.001
	Negative Predictive Value	0.98 [0.98,0.99]	0.99 [0.98,1.00]	0.364
	Positive Likelihood Ratio	7.93 [6.72,9.64]	2.71 [2.46,3.01]	< 0.001
	Negative Likelihood Ratio	0.11 [0.06,0.17]	0.08 [0.03,0.14]	0.364
	Balanced Accuracy	0.89 [0.87,0.92]	0.80 [0.77,0.82]	< 0.001
	Jaccard Index	0.50 [0.44,0.56]	0.27 [0.23,0.31]	< 0.001
	False Negative Rate	0.10 [0.05,0.15]	0.05 [0.02,0.09]	0.074
	Proportion Missed	0.02 [0.01,0.02]	0.01 [0.00,0.02]	0.364