Skip to main content
. 2021 Apr 14;7:e474. doi: 10.7717/peerj-cs.474

Table 2. Summary of knowledge distillation approaches that distills knowledge from parts other than or in addition to the soft labels of the teacher models to be used for training the student models.

In case of several students, results of student with largest size reduction are reported. In case of several datasets, dataset associated with the lowest accuracy reduction is recorded. Baseline models have the same size as the corresponding student models, but they were trained without the teacher models.

Reference Targeted architecture Utilized data Reduction in accuracy compared to teacher Improvement in accuracy compared to baseline Reduction in size
Offline distillation
Lopes, Fenu & Starner (2017) CNN MNIST 4.8% 5.699% (decrease) 50%
Yim et al. (2017) ResNet CIFAR-10 0.3043% (increase)
Gao et al. (2018) ResNet CIFAR-100 2.889% 7.813% 96.20%
Wang et al. (2019) U-Net Janelia (Peng et al., 2015) 78.99%
He et al. (2019) MobileNetV2 PASCAL (Everingham et al., 2010) 4.868% (mIOU) 92.13%
Heo et al. (2019) WRN ImageNet to MIT scene (Quattoni & Torralba, 2009), 6.191% (increase) 14.123% 70.66%
Li et al. (2019) CNN UIUC-Sports (Li et al., 2010) 7.431% 16.89% 95.86%
Liu et al. (2019) ResNet CIFAR10 0.831% 2.637% 73.59%
Online distillation
Zhou et al. (2018) WRN CIFAR-10 1.006% 1.37% 66%
Zhang et al. (2019) ResNet18 CIFAR100 13.72%
Kim et al. (2019) CNN CIFAR100 5.869%
Walawalkar, Shen & Savvides (2020) ResNet CIFAR10 1.019% 1.095% 96.36%
Chung et al. (2020) WRN CIFAR100 1.557% 6.768% 53.333%