. 2021 May 25;10(6):30. doi: 10.1167/tvst.10.6.30

Table 2.

Performance Comparison Among Our Approach, Baseline A, Baseline B, and Baseline C on the Full Testing Dataset

	Accuracy/No. (%, 95% CI)	FNR/No. (%, 95% CI)	Recall/No. (%, 95% CI)	Specificity/No. (%, 95% CI)	AUC % (95% CI)	P Values
Our approach	101	7	50	51	92.74%	<0.001
	(88.60%, 82.76%–94.43%)	(12.28%, 6.26%–18.31%)	(87.72%, 81.69%–93.74%)	(89.47%, 83.84%–95.11%)	(87.71%–97.76%)
Baseline A	93	19	38	55	83.58%	<0.001
	(81.58%, 74.46%–88.70%)	(33.33%, 24.68%–41.99%)	(66.67%, 58.01%–75.32%)	(96.49%, 93.11%–99.87%)	(76.11%–91.05%)
Baseline B	81	23	34	47	74.05%	<0.001
	(71.05%, 62.73%–79.38%)	(40.35%, 31.34%–49.36%)	(59.65%, 50.64%–68.66%)	(82.46%, 75.47%–89.44)	(64.96%–83.14%)
Baseline C	91	6	51	40	87.32%	<0.001
	(79.82%, 72.46%–87.19%)	(10.53%, 4.89%–16.16%)	(89.47%, 83.84%–95.11%)	(70.18%, 61.78%–78.57%)	(80.71%–93.93%)

Performance comparison between our approach (alternate gradient descent with binary output), baseline A (2 single modal CNNs as 3-output task), baseline B (interpretability classifiers followed by 2 single modal CNNs as 2-output task), and baseline C (two-stream CNNs representing state-of-the-art methods for 2-modal image analysis) on the full testing dataset.^†

^†

Statistics in italic correspond to better performance achieved by baselines than our approach, which are discussed in detail in the Results section. CIs for accuracy, FNR, Recall and Specificity were generated following the Wilson score interval.⁴⁷ CI for AUC computed following Hanley et al.⁴⁸

P values generated by performing McNemar's test between the predictions and labels.