Back to Search Start Over

PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation.

Authors :
Hossain, Md Imtiaz
Akhter, Sharmen
Hong, Choong Seon
Huh, Eui-Nam
Source :
Applied Soft Computing; Jul2024, Vol. 159, pN.PAG-N.PAG, 1p
Publication Year :
2024

Abstract

Knowledge distillation is one of the most persuasive approaches to model compression that transfers the representational expertise from large deep-learning teacher models to a small student network. Although numerous techniques have been proposed to improve teacher representations at the logits level, no study has examined the weaknesses in the representations of the teacher at the feature level during distillation. On the other hand, in a trained deep-learning model, all the kernels are not uniformly activated to make specific predictions. Transferring this knowledge may result in a student learning a suboptimal intrinsic distribution and restrict the existing distillation methods from exploiting their highest potential. Motivated by these issues, this study analyses the generalization capability of teachers with or without a uniformly activated channel distribution. Preliminary investigations and theoretical analyses show that partly uniforming or smoothing feature maps offer improved representation that enriches the generalization capability. Based on these observations, it is hypothesized that distillation-based explicit supervision using smoothed feature maps and cross-entropy loss plays a significant role in improving generalization. Hence, this paper proposes a novel technique called P artly U nified R ecalibrated F eature (PURF) map distillation. The proposed method recalibrates the feature maps by intercommunicating the representational cues among nearest-neighbor channels. PURF increases the performance of state-of-the-art knowledge distillation (KD) methods across architectures by improving generalization, model compression, few-shot training, transferability, and robustness transfer on standard benchmark datasets. PURF achieves 1.51% average accuracy improvements on seven diverse architectures in image classification tasks. PURF increases the performance of the state-of-the-art knowledge distillation methods by an average accuracy of 1.91% across architectures. Moreover, PURF achieves an average of 2.02%, and 0.96% higher accuracy in transferability and robustness tasks, respectively, on standard benchmark datasets. • A novel technique, PURF, helps SOTA-KD methods maximize their potential. • Generalization technique without decreasing training performance. • Improves the performance of knowledge distillation tasks. • A generalization method by imposing explicit feature-level smoothness constraints. • An extensive analysis of feature-level smoothness as a regularization technique. • Can be employed as the top-head unit for the existing distillation techniques. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15684946
Volume :
159
Database :
Supplemental Index
Journal :
Applied Soft Computing
Publication Type :
Academic Journal
Accession number :
177288694
Full Text :
https://doi.org/10.1016/j.asoc.2024.111579