Skip to main content
. 2022 Mar 2;5:826402. doi: 10.3389/frai.2022.826402

Table 1.

Model performance results for the baseline single-report metastases prediction model and the three novel multi-report metastases prediction models.

Model Metric Training Testing Validation
Lung
(n = 5,413)
Liver
(n = 1,943)
Adrenal
(n = 2,874)
Lung
(n = 1,160)
Liver
(n = 417)
Adrenal
(n = 617)
Lung
(n = 1,160)
Liver
(n = 417)
Adrenal
(n = 616)
TF-IDF ensemble model (Baseline) Accuracy 99.69% (±0.15%) 99.95% (±0.10%) 99.23% (±0.32%) 92.33% (±1.53%) 90.12% (±2.86%) 96.60% (±1.43%) 93.80% (±1.39%) 92.50% (±2.53%) 96.10% (±1.53%)
Precision 0.9977 (±0.00) 1.0000 (±0.00) 1.0000 (±0.00) 0.8553 (±0.02) 0.9060 (±0.03) 0.9444 (±0.02) 0.9080 (±0.02) 0.8990 (±0.03) 1.0000 (±0.00)
Recall 0.9833 (±0.00) 0.9983 (±0.00) 0.8932 (±0.01) 0.6733 (±0.03) 0.7794 (±0.04) 0.4595 (±0.04) 0.6860 (±0.03) 0.8310 (±0.04) 0.5000 (±0.04)
F1-score 0.9904 (±0.00) 0.9991 (±0.00) 0.9436 (±0.01) 0.7535 (±0.02) 0.8379 (±0.04) 0.6182 (±0.04) 0.7815 (±0.02) 0.8637 (±0.03) 0.6667 (±0.04)
Simple CNN Accuracy 99.93% (±5.21%) 99.85% (±7.59%) 100% (±0.00%) 97.41% (±0.91%) 98.56% (±1.14%) 99.03% (±0.77%) 96.64% (±1.04%) 98.56% (±1.14%) 99.51% (±0.55%)
Precision 0.9956 (±0.00) 0.9950 (±0.00) 1.0000 (±0.00) 0.9526 (±0.01) 0.9851 (±0.01) 0.9429 (±0.02) 0.9526 (±0.01) 0.9746 (±0.02) 0.9592 (±0.02)
Recall 1.0000 (±0.00) 1.0000 (±0.00) 1.0000 (±0.00) 0.8960 (±0.02) 0.9706 (±0.02) 0.8919 (±0.02) 0.8564 (±0.02) 0.9746 (±0.02) 0.9792 (±0.01)
F1-score 0.9978 (±0.00) 0.9975 (±0.00) 1.0000 (±0.00) 0.9234 (±0.02) 0.9778 (±0.01) 0.9167 (±0.02) 0.8920 (±0.02) 0.9746 (±0.02) 0.9691 (±0.02)
Augmented CNN Accuracy 99.98% (±0.04%) 99.90% (±0.14%) 99.97% (±0.06%) 97.41% (±0.91%) 98.56% (±1.14%) 98.87% (±0.83%) 96.81% (±1.01%) 99.04% (±0.94%) 99.68% (±0.45%)
Precision 0.9989 (±0.00) 0.9966 (±0.00) 0.9952 (±0.00) 0.9388 (±0.01) 0.9710 (±0.02) 0.9167 (±0.02) 0.9467 (±0.01) 0.9831 (±0.01) 0.9792 (±0.01)
Recall 1.0000 (±0.00) 1.0000 (±0.00) 1.0000 (±0.00) 0.9109 (±0.02) 0.9853 (±0.01) 0.8919 (±0.02) 0.8511 (±0.02) 0.9831 (±0.01) 0.9792 (±0.01)
F1-score 0.9994 (±0.00) 0.9983 (±0.00) 0.9976 (±0.00) 0.9246 (±0.02) 0.9781 (±0.01) 0.9041 (±0.02) 0.8964 (±0.02) 0.9831 (±0.01) 0.9792 (±0.01)
Bidirectional LSTM Accuracy 97.97% (±0.38%) 99.23% (±0.39%) 99.72% (±0.19%) 96.66% (±1.03%) 98.56% (±1.14%) 98.70% (±0.89%) 97.16% (±0.96%) 98.32% (±1.23%) 99.03% (±0.77%)
Precision 0.9052 (±0.01) 0.9798 (±0.01) 0.9660 (±0.01) 0.8465 (±0.02) 0.9853 (±0.01) 0.8919 (±0.02) 0.8404 (±0.02) 0.9661 (±0.02) 0.9375 (±0.02)
Recall 0.9366 (±0.01) 0.9873 (±0.00) 0.9803 (±0.01) 0.8976 (±0.02) 0.9781 (±0.01) 0.8919 (±0.02) 0.9054 (±0.02) 0.9702 (±0.02) 0.9375 (±0.02)
F1-score 0.9206 (±0.01) 0.9835 (±0.01) 0.9731 (±0.01) 0.8713 (±0.02) 0.9817 (±0.01) 0.8919 (±0.02) 0.8717 (±0.02) 0.9682 (±0.02) 0.9375 (±0.02)

Organ datasets are split into three subsets for training (70%), testing (15%), and validation (15%). The n values correspond to the size of the sets. The highest values for each organ in each performance metric are bolded. Values in parentheses are within the 95% confidence interval rounded to two decimal places.