Table 1.
Quantitative (Justified Trust) and Qualitative (Explanation Satisfaction) comparison of CX-ToM with random guessing baseline, no explanation (NO-X) baseline, and other state-of-the-art XAI frameworks such as CAM, Grad-CAM, LIME, LRP, SmoothGrad, TCAV, CEM, and CVE
XAI framework | Justified trust (±std) | Explanation satisfaction (±std) |
||||
---|---|---|---|---|---|---|
Confidence | Usefulness | Appropriate detail | Understandability | Sufficiency | ||
Non-expert subject pool | ||||||
Random guessing | 6.6% | NA | NA | NA | NA | NA |
NO-X | 21.4 ± 2.7% | NA | NA | NA | NA | NA |
CAM (Zhou et al., 2016) | 24.0 ± 1.9% | 4.2 ± 1.8 | 3.6 ± 0.8 | 2.2 ± 1.9 | 3.2 ± 0.9 | 2.6 ± 1.3 |
Grad-CAM (Selvaraju et al., 2017a) | 29.2 ± 3.1% | 4.1 ± 1.1 | 3.2 ± 1.9 | 3.0 ± 1.6 | 4.2 ± 1.1 | 3.2 ± 1.0 |
LIME (Ribeiro et al., 2016) | 46.1 ± 1.2% | 5.1 ± 1.8 | 4.2 ± 1.6 | 3.9 ± 1.1 | 4.1 ± 2.0 | 4.3 ± 1.6 |
SHAP (Lundberg and Lee, 2017) | 40.9 ± 2.0% | 4.8 ± 3.0 | 3.9 ± 1.1 | 3.6 ± 1.9 | 3.8 ± 1.4 | 4.0 ± 2.3 |
LRP (Bach et al., 2015) | 31.1 ± 2.5% | 1.1 ± 2.2 | 2.8 ± 1.0 | 1.6 ± 1.7 | 2.8 ± 1.0 | 2.1 ± 1.8 |
SmoothGrad (Smilkov et al., 2017) | 37.6 ± 2.9% | 1.4 ± 1.0 | 2.2 ± 1.8 | 2.8 ± 1.0 | 3.1 ± 0.8 | 2.9 ± 0.8 |
TCAV (Kim et al., 2018) | 49.7 ± 3.3% | 3.6 ± 2.1 | 3.2 ± 1.8 | 3.3 ± 1.6 | 3.6 ± 2.1 | 3.9 ± 1.1 |
CEM (Dhurandhar et al., 2018) | 51.0 ± 2.1% | 4.1 ± 1.4 | 3.4 ± 1.4 | 3.1 ± 2.1 | 2.9 ± 0.9 | 3.3 ± 1.6 |
CVE (Goyal et al., 2019) | 50.9 ± 3.0% | 3.8 ± 1.9 | 3.1 ± 0.9 | 3.6 ± 2.1 | 4.1 ± 1.2 | 4.2 ± 1.2 |
Fault-lines without ToM | 69.1 ± 2.1% | 6.2 ± 1.2 | 6.6 ± 0.7 | 7.2 ± 0.9 | 7.1 ± 0.6 | 6.2 ± 0.8 |
CX-ToM (fault-lines with ToM) | 72.1 ± 1.1% | 6.9 ± 0.8 | 6.5 ± 0.9 | 7.8 ± 1.2 | 7.7 ± 0.2 | 6.9 ± 0.6 |
Expert subject pool | ||||||
NO-X | 28.1 ± 4.1% | NA | NA | NA | NA | NA |
CAM (Zhou et al., 2016) | 37.1 ± 3.9% | 3.2 ± 1.8 | 3.3 ± 1.4 | 3.1 ± 2.1 | 3.1 ± 1.8 | 2.9 ± 1.9 |
Grad-CAM (Selvaraju et al., 2017a) | 39.1 ± 2.1% | 3.7 ± 1.2 | 3.1 ± 2.2 | 2.7 ± 1.9 | 3.7 ± 1.1 | 3.4 ± 1.6 |
LIME (Ribeiro et al., 2016) | 42.1 ± 3.1% | 3.1 ± 2.2 | 3.0 ± 1.2 | 2.8 ± 1.9 | 3.1 ± 2.2 | 2.8 ± 1.7 |
LRP (Bach et al., 2015) | 51.1 ± 3.1% | 3.2 ± 4.1 | 3.5 ± 1.6 | 4.2 ± 1.5 | 4.3 ± 1.0 | 3.9 ± 0.9 |
SmoothGrad (Smilkov et al., 2017) | 40.7 ± 2.1% | 3.1 ± 1.0 | 2.9 ± 1.2 | 3.8 ± 1.5 | 3.3 ± 1.1 | 3.1 ± 1.0 |
TCAV (Kim et al., 2018) | 55.1 ± 3.3% | 3.9 ± 2.8 | 3.6 ± 1.6 | 4.1 ± 1.3 | 4.9 ± 1.2 | 3.9 ± 0.8 |
CEM (Dhurandhar et al., 2018) | 61.1 ± 2.2% | 4.8 ± 1.6 | 3.7 ± 1.6 | 4.0 ± 1.2 | 3.7 ± 1.0 | 4.0 ± 1.1 |
CVE (Goyal et al., 2019) | 64.5 ± 3.7% | 4.1 ± 2.3 | 3.9 ± 1.5 | 4.6 ± 1.5 | 4.5 ± 1.4 | 3.9 ± 1.2 |
Fault-lines without ToM | 70.5 ± 1.3% | 5.7 ± 1.1 | 4.9 ± 0.8 | 5.8 ± 1.2 | 6.9 ± 1.1 | 6.4 ± 1.0 |
CX-ToM (fault-lines with ToM) | 74.5 ± 0.7% | 6.1 ± 0.8 | 5.3 ± 0.4 | 5.9 ± 1.2 | 7.1 ± 0.8 | 6.9 ± 0.7 |