. 2021 Apr 14;7:e474. doi: 10.7717/peerj-cs.474

Table 2. Summary of knowledge distillation approaches that distills knowledge from parts other than or in addition to the soft labels of the teacher models to be used for training the student models.

In case of several students, results of student with largest size reduction are reported. In case of several datasets, dataset associated with the lowest accuracy reduction is recorded. Baseline models have the same size as the corresponding student models, but they were trained without the teacher models.

Reference	Targeted architecture	Utilized data	Reduction in accuracy compared to teacher	Improvement in accuracy compared to baseline	Reduction in size
Offline distillation
Lopes, Fenu & Starner (2017)	CNN	MNIST	4.8%	5.699% (decrease)	50%
Yim et al. (2017)	ResNet	CIFAR-10	0.3043% (increase)	–	–
Gao et al. (2018)	ResNet	CIFAR-100	2.889%	7.813%	96.20%
Wang et al. (2019)	U-Net	Janelia (Peng et al., 2015)	–	–	78.99%
He et al. (2019)	MobileNetV2	PASCAL (Everingham et al., 2010)	4.868% (mIOU)	–	92.13%
Heo et al. (2019)	WRN	ImageNet to MIT scene (Quattoni & Torralba, 2009),	6.191% (increase)	14.123%	70.66%
Li et al. (2019)	CNN	UIUC-Sports (Li et al., 2010)	7.431%	16.89%	95.86%
Liu et al. (2019)	ResNet	CIFAR10	0.831%	2.637%	73.59%
Online distillation
Zhou et al. (2018)	WRN	CIFAR-10	1.006%	1.37%	66%
Zhang et al. (2019)	ResNet18	CIFAR100	13.72%	–	–
Kim et al. (2019)	CNN	CIFAR100	5.869%	–	–
Walawalkar, Shen & Savvides (2020)	ResNet	CIFAR10	1.019%	1.095%	96.36%
Chung et al. (2020)	WRN	CIFAR100	1.557%	6.768%	53.333%