ShareBERT: Embeddings Are Capable of Learning Hidden Layers
DOI:
10.1609/aaai.v38i16.29781
Publication Date:
2024-03-25T11:50:39Z
AUTHORS (4)
ABSTRACT
The deployment of Pre-trained Language Models in memory-limited devices is hindered by their massive number parameters, which motivated the interest developing smaller architectures. Established works model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. In this work, we propose parameter-sharing method consists sharing parameters between embeddings hidden layers, enabling design near-zero parameter encoders. To demonstrate its effectiveness, an architecture called ShareBERT, can preserve up 95.5% BERT Base performances, using only 5M (21.9× fewer parameters) without help We empirically our proposal does not negatively affect capabilities it even beneficial for representation learning. Code will available at https://github.com/jchenghu/sharebert.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....