NFDI4DS | UHH-SEMS - Publication Details

Pushing the Limits of BFP on Narrow Precision LLM Inference

FOS: Computer and information sciences Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Hardware Architecture (cs.AR) Computer Science - Hardware Architecture

DOI: 10.48550/arxiv.2502.00026 Publication Date: 2025-01-21

Abstract Supplemental Material References Cited by

AUTHORS (6)

Hui Wang

Yuan Cheng

Xiaomeng Han

Zhengpeng Zhao

Yang Da-wei

Zhe Jiang

ABSTRACT

The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone LLM workloads. However, as sequence lengths grow, nonlinear such Attention, increasingly become performance bottlenecks due to quadratic complexity. These operations are predominantly executed using inefficient floating-point formats, which renders the system challenging optimize software efficiency hardware overhead. In this paper, we delve into limitations potential applying BFP operations. Given our findings, introduce hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced version, overcomes operation challenges with pivot-focus strategy for diverse data adaptive grouping flexible exponent sharing. (ii) DH-LUT, novel lookup table algorithm dedicated DBFP format. (iii) An RTL-level DBFP-based engine is implemented support DB-Attn, applicable FPGA ASIC. Results show that DB-Attn provides significant improvements negligible accuracy loss, achieving 74% GPU speedup on Softmax LLaMA 10x low overhead improvement over SOTA designs.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Pushing the Limits of BFP on Narrow Precision LLM Inference

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....