Patient Knowledge Distillation for BERT Model Compression
FOS: Computer and information sciences
Computer Science - Machine Learning
Computer Science - Computation and Language
Computation and Language (cs.CL)
01 natural sciences
Machine Learning (cs.LG)
0105 earth and related environmental sciences
DOI:
10.48550/arxiv.1908.09355
Publication Date:
2019-01-01
AUTHORS (4)
ABSTRACT
Pre-trained language models such as BERT have proven to be highly effective for natural processing (NLP) tasks. However, the high demand computing resources in training hinders their application practice. In order alleviate this resource hunger large-scale model training, we propose a Patient Knowledge Distillation approach compress an original large (teacher) into equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use output last layer of teacher distillation, our student patiently learns multiple intermediate layers incremental extraction, following two strategies: ($i$) PKD-Last: learning $k$ layers; and ($ii$) PKD-Skip: every layers. These patient schemes enable exploitation rich information teacher's hidden layers, encourage learn imitate through multi-layer process. Empirically, translates improved results on NLP tasks with significant gain efficiency, without sacrificing accuracy.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....