NFDI4DS | UHH-SEMS - Publication Details

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

FOS: Computer and information sciences Computer Science - Machine Learning Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Image and Video Processing (eess.IV) FOS: Electrical engineering, electronic engineering, information engineering Electrical Engineering and Systems Science - Image and Video Processing Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2502.01770 Publication Date: 2025-01-01

Abstract Supplemental Material References Cited by

AUTHORS (11)

Horton, Mark

Molom-Ochir, Tergel

Liu, Peter

Gopal, Bhavna

Wei, Chiyue

Guo, Cong

Taylor, Brady

Fan, Deliang

Wang, Shan X.

Li, Hai

Chen, Yiran

ABSTRACT

Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....