Pushing the Limits of BFP on Narrow Precision LLM Inference
FOS: Computer and information sciences
Artificial Intelligence (cs.AI)
Computer Science - Artificial Intelligence
Hardware Architecture (cs.AR)
Computer Science - Hardware Architecture
DOI:
10.48550/arxiv.2502.00026
Publication Date:
2025-01-21
AUTHORS (6)
ABSTRACT
The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone LLM workloads. However, as sequence lengths grow, nonlinear such Attention, increasingly become performance bottlenecks due to quadratic complexity. These operations are predominantly executed using inefficient floating-point formats, which renders the system challenging optimize software efficiency hardware overhead. In this paper, we delve into limitations potential applying BFP operations. Given our findings, introduce hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced version, overcomes operation challenges with pivot-focus strategy for diverse data adaptive grouping flexible exponent sharing. (ii) DH-LUT, novel lookup table algorithm dedicated DBFP format. (iii) An RTL-level DBFP-based engine is implemented support DB-Attn, applicable FPGA ASIC. Results show that DB-Attn provides significant improvements negligible accuracy loss, achieving 74% GPU speedup on Softmax LLaMA 10x low overhead improvement over SOTA designs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....