Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
Overfitting
Pruning
Benchmark (surveying)
DOI:
10.48550/arxiv.2110.08190
Publication Date:
2021-01-01
AUTHORS (9)
ABSTRACT
Conventional wisdom in pruning Transformer-based language models is that reduces the model expressiveness and thus more likely to underfit rather than overfit. However, under trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, is: increases risk of overfitting when performed at fine-tuning phase. In this paper, aim address problem improve performance via progressive knowledge distillation with error-bound properties. We show for first time reducing can help effectiveness paradigm. Ablation studies experiments on GLUE benchmark our method outperforms leading competitors across different tasks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....