Table 7:

For the Adult Cohort, we show the p-value for the two-sided Wilcoxon paired signed-rank test comparing the second ( Inline graphic LiviaNET) and third ( DeepNet) placed teams to the top ( CERES2) ranked team across the four hierarchies (Coarse, Lobe, Vermis, Lobule) of labeling and also the combination of all 38 labels (Consolidated). The mean Dice overlap for each method, at the respective hierarchy, is shown underneath the method’s name.

Hierarchy	Method		p-value
Hierarchy	Mean Dice Overlap		p-value
Coarse	CERES2 0.9118	vs. LiviaNET 0.8967	6.9 × 10^{−3 †}
Coarse	CERES2 0.9118	vs. DeepNet 0.8908	6.1 × 10^{−5 ‡}
Lobe	CERES2 0.8395	vs. LiviaNET 0.8289	2.2 × 10⁻¹
Lobe	CERES2 0.8395	vs. DeepNet 0.8021	1.9 × 10^{−4 †}
Vermis	CERES2 0.8302	vs. LiviaNET 0.8012	1.2 × 10⁻²
Vermis	CERES2 0.8302	vs. DeepNet 0.8003	5.6 × 10^{−4 †}
Lobule	CERES2 0.7657	vs. LiviaNET 0.7168	5.5 × 10^{−5 ‡}
Lobule	CERES2 0.7657	vs. DeepNet 0.7382	1.2 × 10^{−5 ‡}
Consolidated	CERES2 0.8013	vs. LiviaNET 0.7657	3.0 × 10^{−7 ‡}
Consolidated	CERES2 0.8013	vs. DeepNet 0.7719	3.1 × 10^{−12 ‡}

^†

Denotes weak statistical significance (p-value < 0.001).

^‡

Denotes strong statistical significance (p-value < 0.0001).