Patient Knowledge Distillation for BERT Model Compression

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Computation and Language (cs.CL) 01 natural sciences Machine Learning (cs.LG) 0105 earth and related environmental sciences
DOI: 10.48550/arxiv.1908.09355 Publication Date: 2019-01-01
ABSTRACT
Pre-trained language models such as BERT have proven to be highly effective for natural processing (NLP) tasks. However, the high demand computing resources in training hinders their application practice. In order alleviate this resource hunger large-scale model training, we propose a Patient Knowledge Distillation approach compress an original large (teacher) into equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use output last layer of teacher distillation, our student patiently learns multiple intermediate layers incremental extraction, following two strategies: ($i$) PKD-Last: learning $k$ layers; and ($ii$) PKD-Skip: every layers. These patient schemes enable exploitation rich information teacher's hidden layers, encourage learn imitate through multi-layer process. Empirically, translates improved results on NLP tasks with significant gain efficiency, without sacrificing accuracy.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....