Table 1.
Model | Metric | Training | Testing | Validation | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Lung
(n = 5,413) |
Liver
(n = 1,943) |
Adrenal
(n = 2,874) |
Lung
(n = 1,160) |
Liver
(n = 417) |
Adrenal
(n = 617) |
Lung
(n = 1,160) |
Liver
(n = 417) |
Adrenal
(n = 616) |
||
TF-IDF ensemble model (Baseline) | Accuracy | 99.69% (±0.15%) | 99.95% (±0.10%) | 99.23% (±0.32%) | 92.33% (±1.53%) | 90.12% (±2.86%) | 96.60% (±1.43%) | 93.80% (±1.39%) | 92.50% (±2.53%) | 96.10% (±1.53%) |
Precision | 0.9977 (±0.00) | 1.0000 (±0.00) | 1.0000 (±0.00) | 0.8553 (±0.02) | 0.9060 (±0.03) | 0.9444 (±0.02) | 0.9080 (±0.02) | 0.8990 (±0.03) | 1.0000 (±0.00) | |
Recall | 0.9833 (±0.00) | 0.9983 (±0.00) | 0.8932 (±0.01) | 0.6733 (±0.03) | 0.7794 (±0.04) | 0.4595 (±0.04) | 0.6860 (±0.03) | 0.8310 (±0.04) | 0.5000 (±0.04) | |
F1-score | 0.9904 (±0.00) | 0.9991 (±0.00) | 0.9436 (±0.01) | 0.7535 (±0.02) | 0.8379 (±0.04) | 0.6182 (±0.04) | 0.7815 (±0.02) | 0.8637 (±0.03) | 0.6667 (±0.04) | |
Simple CNN | Accuracy | 99.93% (±5.21%) | 99.85% (±7.59%) | 100% (±0.00%) | 97.41% (±0.91%) | 98.56% (±1.14%) | 99.03% (±0.77%) | 96.64% (±1.04%) | 98.56% (±1.14%) | 99.51% (±0.55%) |
Precision | 0.9956 (±0.00) | 0.9950 (±0.00) | 1.0000 (±0.00) | 0.9526 (±0.01) | 0.9851 (±0.01) | 0.9429 (±0.02) | 0.9526 (±0.01) | 0.9746 (±0.02) | 0.9592 (±0.02) | |
Recall | 1.0000 (±0.00) | 1.0000 (±0.00) | 1.0000 (±0.00) | 0.8960 (±0.02) | 0.9706 (±0.02) | 0.8919 (±0.02) | 0.8564 (±0.02) | 0.9746 (±0.02) | 0.9792 (±0.01) | |
F1-score | 0.9978 (±0.00) | 0.9975 (±0.00) | 1.0000 (±0.00) | 0.9234 (±0.02) | 0.9778 (±0.01) | 0.9167 (±0.02) | 0.8920 (±0.02) | 0.9746 (±0.02) | 0.9691 (±0.02) | |
Augmented CNN | Accuracy | 99.98% (±0.04%) | 99.90% (±0.14%) | 99.97% (±0.06%) | 97.41% (±0.91%) | 98.56% (±1.14%) | 98.87% (±0.83%) | 96.81% (±1.01%) | 99.04% (±0.94%) | 99.68% (±0.45%) |
Precision | 0.9989 (±0.00) | 0.9966 (±0.00) | 0.9952 (±0.00) | 0.9388 (±0.01) | 0.9710 (±0.02) | 0.9167 (±0.02) | 0.9467 (±0.01) | 0.9831 (±0.01) | 0.9792 (±0.01) | |
Recall | 1.0000 (±0.00) | 1.0000 (±0.00) | 1.0000 (±0.00) | 0.9109 (±0.02) | 0.9853 (±0.01) | 0.8919 (±0.02) | 0.8511 (±0.02) | 0.9831 (±0.01) | 0.9792 (±0.01) | |
F1-score | 0.9994 (±0.00) | 0.9983 (±0.00) | 0.9976 (±0.00) | 0.9246 (±0.02) | 0.9781 (±0.01) | 0.9041 (±0.02) | 0.8964 (±0.02) | 0.9831 (±0.01) | 0.9792 (±0.01) | |
Bidirectional LSTM | Accuracy | 97.97% (±0.38%) | 99.23% (±0.39%) | 99.72% (±0.19%) | 96.66% (±1.03%) | 98.56% (±1.14%) | 98.70% (±0.89%) | 97.16% (±0.96%) | 98.32% (±1.23%) | 99.03% (±0.77%) |
Precision | 0.9052 (±0.01) | 0.9798 (±0.01) | 0.9660 (±0.01) | 0.8465 (±0.02) | 0.9853 (±0.01) | 0.8919 (±0.02) | 0.8404 (±0.02) | 0.9661 (±0.02) | 0.9375 (±0.02) | |
Recall | 0.9366 (±0.01) | 0.9873 (±0.00) | 0.9803 (±0.01) | 0.8976 (±0.02) | 0.9781 (±0.01) | 0.8919 (±0.02) | 0.9054 (±0.02) | 0.9702 (±0.02) | 0.9375 (±0.02) | |
F1-score | 0.9206 (±0.01) | 0.9835 (±0.01) | 0.9731 (±0.01) | 0.8713 (±0.02) | 0.9817 (±0.01) | 0.8919 (±0.02) | 0.8717 (±0.02) | 0.9682 (±0.02) | 0.9375 (±0.02) |
Organ datasets are split into three subsets for training (70%), testing (15%), and validation (15%). The n values correspond to the size of the sets. The highest values for each organ in each performance metric are bolded. Values in parentheses are within the 95% confidence interval rounded to two decimal places.