Productively Generating a High-Performance Linear Algebra Library on FPGAs

DOI: 10.1145/3723046 Publication Date: 2025-03-11T16:50:21Z
ABSTRACT
Linear algebra computations can be greatly accelerated using spatial accelerators on FPGAs. As a standard building block of linear algebra applications, BLAS covers a wide range of compute patterns that vary vastly in data reuse, bottleneck resources, matrix storage layouts, and data types. However, existing implementations of BLAS routines on FPGAs are stuck in the dilemma of productivity and performance. They either require extensive human effort or fail to leverage the properties of routines for acceleration. We introduce Lasa, a framework composed of a programming model and a compiler, designed to address the dilemma by abstracting (for productivity) and specializing (for performance) the architecture of a spatial accelerator. The programming model realizes systolic arrays using uniform recurrence equations and space-time transforms. Streaming tensors, an intuitive dataflow-style abstraction, is proposed to uniformly describe the movement, storage, and transpose of input and output data across the spatial components. According to streaming tensors, a customized memory hierarchy is automatically built on an FPGA by our compiler. The compiler further specializes the architecture with transparent optimizations on FPGAs. Using this framework, we develop a complete BLAS library, demonstrating performance in parity with expert-written HLS code for BLAS level 3 routines, 76%-94% machine peak for level 1 and 2 routines, and 1.6X-13X speedup by leveraging the matrix properties such as symmetry, triangularity, and bandness.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (55)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....