CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs
Bit (key)
DOI:
10.48550/arxiv.2405.17233
Publication Date:
2024-05-27
AUTHORS (7)
ABSTRACT
Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies LLM quantization. Firstly, K-Means clustering based algorithm is proposed that allows dynamic generation centroids each column parameter matrix. Secondly, design an outlier-guided precision search strategy which can dynamically assign varying bit-widths columns. Finally, outlier reservation scheme developed retain some parameters their original float point precision, trade off boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 Yi demonstrate our achieve state-of-the-art results across bit settings, especially extremely Code will be released soon.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....