Table 2. Summary of knowledge distillation approaches that distills knowledge from parts other than or in addition to the soft labels of the teacher models to be used for training the student models.
In case of several students, results of student with largest size reduction are reported. In case of several datasets, dataset associated with the lowest accuracy reduction is recorded. Baseline models have the same size as the corresponding student models, but they were trained without the teacher models.
| Reference | Targeted architecture | Utilized data | Reduction in accuracy compared to teacher | Improvement in accuracy compared to baseline | Reduction in size |
|---|---|---|---|---|---|
| Offline distillation | |||||
| Lopes, Fenu & Starner (2017) | CNN | MNIST | 4.8% | 5.699% (decrease) | 50% |
| Yim et al. (2017) | ResNet | CIFAR-10 | 0.3043% (increase) | – | – |
| Gao et al. (2018) | ResNet | CIFAR-100 | 2.889% | 7.813% | 96.20% |
| Wang et al. (2019) | U-Net | Janelia (Peng et al., 2015) | – | – | 78.99% |
| He et al. (2019) | MobileNetV2 | PASCAL (Everingham et al., 2010) | 4.868% (mIOU) | – | 92.13% |
| Heo et al. (2019) | WRN | ImageNet to MIT scene (Quattoni & Torralba, 2009), | 6.191% (increase) | 14.123% | 70.66% |
| Li et al. (2019) | CNN | UIUC-Sports (Li et al., 2010) | 7.431% | 16.89% | 95.86% |
| Liu et al. (2019) | ResNet | CIFAR10 | 0.831% | 2.637% | 73.59% |
| Online distillation | |||||
| Zhou et al. (2018) | WRN | CIFAR-10 | 1.006% | 1.37% | 66% |
| Zhang et al. (2019) | ResNet18 | CIFAR100 | 13.72% | – | – |
| Kim et al. (2019) | CNN | CIFAR100 | 5.869% | – | – |
| Walawalkar, Shen & Savvides (2020) | ResNet | CIFAR10 | 1.019% | 1.095% | 96.36% |
| Chung et al. (2020) | WRN | CIFAR100 | 1.557% | 6.768% | 53.333% |