NFDI4DS | UHH-SEMS - Publication Details

Patient Knowledge Distillation for BERT Model Compression

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Computation and Language (cs.CL) 01 natural sciences Machine Learning (cs.LG) 0105 earth and related environmental sciences

DOI: 10.48550/arxiv.1908.09355 Publication Date: 2019-01-01

Abstract Supplemental Material References Cited by

AUTHORS (4)

Siqi Sun

Yu Cheng

Zhe Gan

Jun Liu

ABSTRACT

Pre-trained language models such as BERT have proven to be highly effective for natural processing (NLP) tasks. However, the high demand computing resources in training hinders their application practice. In order alleviate this resource hunger large-scale model training, we propose a Patient Knowledge Distillation approach compress an original large (teacher) into equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use output last layer of teacher distillation, our student patiently learns multiple intermediate layers incremental extraction, following two strategies: ($i$) PKD-Last: learning $k$ layers; and ($ii$) PKD-Skip: every layers. These patient schemes enable exploitation rich information teacher's hidden layers, encourage learn imitate through multi-layer process. Empirically, translates improved results on NLP tasks with significant gain efficiency, without sacrificing accuracy.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Patient Knowledge Distillation for BERT Model Compression

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....