Annealing Knowledge Distillation
Benchmark (surveying)
DOI:
10.18653/v1/2021.eacl-main.212
Publication Date:
2021-10-20T07:16:31Z
AUTHORS (4)
ABSTRACT
Significant memory and computational requirements of large deep neural networks restricts their application on edge devices. Knowledge distillation (KD) is a prominent model compression technique for in which the knowledge trained teacher transferred to smaller student model. The success mainly attributed its training objective function, exploits soft-target information (also known as “dark knowledge”) besides given regular hard labels set. However, it shown literature that larger gap between networks, more difficult using distillation. To address this shortcoming, we propose an improved method (called Annealing-KD) by feeding rich provided teacher’s soft-targets incrementally efficiently. Our Annealing-KD based gradual transition over annealed generated at different temperatures iterative process; therefore, follow output step-by-step manner. This paper includes theoretical empirical evidence well practical experiments support effectiveness our method. We did comprehensive set tasks such image classification (CIFAR-10 100) NLP language inference with BERT-based models GLUE benchmark consistently got superior results.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (25)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....