NFDI4DS | UHH-SEMS - Publication Details

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

Application-specific integrated circuit Hardware acceleration Symmetric multiprocessor system Speedup

DOI: 10.1145/3686163 Publication Date: 2024-08-05T15:51:50Z

Abstract Supplemental Material References Cited by

AUTHORS (14)

Jinming Zhuang

Jason Lau

Hanchen Ye

Zhuoping Yang

Shixin Ji

Jack Lo

Kristof Denolf

Stephen Neuendorffer

Alex Jones

Jingtong Hu

Yiyu Shi

Deming Chen

Jason Cong

Peipei Zhou

ABSTRACT

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with high computation demands these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged promising platforms. For example, AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores programmable logic AI Engine processors optimized for AI/ML. An array 400 executing at 1 GHz can provide up to 6.4 TFLOPS performance 32-bit floating-point (FP32) data. However, machine models often contain large small MM operations. While operations be parallelized efficiently across many cores, typically cannot. We observe that some layers from BERT natural language processing model on a large, monolithic accelerator achieved less than 5% theoretical peak performance. Therefore, key question arises: How we design fully use abundant resources under limited communication bandwidth end-to-end applications multiple diverse sizes? identify biggest system throughput bottleneck resulting mismatch between massive various sizes application. resolve this problem, propose CHARM framework compose working concurrently different within includes analytical guide space exploration determine partitions layer scheduling. facilitate designs, automatically generates code, enabling thorough onboard verification. deploy four FP32, INT16, INT8 data types, including BERT, ViT, NCF, MLP, VCK190 evaluation board. Our experiments show achieve 1.46 TFLOPS, 1.61 1.74 2.94 inference MLP FP32 type, respectively, which obtain 5.29 \(\times\) , 32.51 1.00 gains compared accelerator. achieves maximum 1.91 TOPS, 1.18 4.06 5.81 TOPS INT16 type The by is 3.65 1.28 10.19 21.58 respectively. open-sourced our tools, detailed step-by-step guides reproduce all results presented article enable other users learn leverage tools their systems: https://github.com/arc-research-lab/CHARM .

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (54)

CITATIONS (3)

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications CROSSREF - Publications

PlumX Metrics

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....