NFDI4DS | UHH-SEMS - Publication Details

CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture

OPENALEX - Publications

Jinming Zhuang Jason Lau Hanchen Ye Zhuoping Yang Yubo Du and 8 more

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with high computation demands these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged promising platforms. For example, AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores programmable logic (PL) AI Engine processors (AIE) optimized for AI/ML. An array 400 executing at 1 GHz can theoretically provide up to...

10.1145/3543622.3573210 article EN cc-by-nc-sa 2023-02-10

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

OPENALEX - Publications

Jinming Zhuang Zhuoping Yang Shixin Ji Heng Huang Alex K. Jones and 3 more

With the increase in computation intensity of chip, mismatch between layer shapes and available resource significantly limits utilization chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize throughput. However, using could potentially execution latency. In work, we first systematically investigate two models: (1) sequentially (temporally) launch one monolithic accelerator, (2) spatially multiple accelerators. From observations, find...

10.1145/3626202.3637569 preprint EN cc-by-nc-sa 2024-04-01

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives

OPENALEX - Publications

Jinming Zhuang Zhuoping Yang Peipei Zhou

As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with logic(PL), CPUs, and dedicated AI engines (AIE) ASICs which has theoretical throughput up 6.4 TFLOPs FP32, 25.6 TOPs INT16 102.4 INT8. However, higher level makes it non-trivial achieve performance even well-studied applications like matrix-matrix multiply. In this paper, we provide...

10.1109/dac56929.2023.10247981 article EN 2023-07-09

ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

OPENALEX - Publications

Jinming Zhuang Shaojie Xiang Hongzheng Chen Niansong Zhang Zhuoping Yang and 3 more

10.1145/3706628.3708870 article EN cc-by-nc-sa 2025-02-26

Towards Accelerator Customization in Real-time Safety-critical Systems

OPENALEX - Publications

Shixin Ji X. Chen Wei Zhang Zhuoping Yang Jinming Zhuang and 6 more

10.1145/3706628.3708841 article EN 2025-02-26

EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

OPENALEX - Publications

Peiyan Dong Jinming Zhuang Zhuoping Yang Shixin Ji Yanyu Li and 7 more

While vision transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (<1 ms) is challenging. Current computing platforms like CPUs, GPUs, or FPGA-based solutions struggle to meet this deterministic low-latency requirement, even with quantized ViT models. Some approaches use pruning sparsity reduce the model size and latency, but often results accuracy loss. To address aforementioned constraints, work, we propose EQ-ViT, an...

10.1109/tcad.2024.3443692 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2024-11-01

MP-OPU: A Mixed Precision FPGA-based Overlay Processor for Convolutional Neural Networks

OPENALEX - Publications

Chen Wu Jinming Zhuang Kun Wang Lei He

Low precision quantization in convolutional neural network (CNN) inference has been proved effective for reducing computation complexity and bandwidth requirement. Mixed CNNs manage to benefit from low while maintaining accuracy. In this paper, we propose a Precision FPGA-based Overlay Processor (MP-OPU) fully leverage the advantages of mixed both conventional lightweight CNNs. The micro-architecture MP-OPU considers sharing core with weights activations improve efficiency. addition,...

10.1109/fpl53798.2021.00014 article EN 2021-08-01

Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous Chiplets

OPENALEX - Publications

Zhuoping Yang Shixin Ji Xingzhen Chen Jinming Zhuang Weifeng Zhang and 2 more

Fast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever-increasing computing demands in today's data centers. Heterogeneous with domain-specific architectures (DSAs) brings many opportunities when scaling up and out system. In particular, heterogeneous chiplet architecture is favored to keep system well reduce design complexity cost stemming from traditional monolithic chip design. However, how interconnect resources orchestrate chiplets...

10.1109/asp-dac58780.2024.10473961 article EN 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC) 2024-01-22

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

OPENALEX - Publications

Jinming Zhuang Jason Lau Hanchen Ye Zhuoping Yang Yubo Du and 8 more

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with high computation demands these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged promising platforms. For example, AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores programmable logic AI Engine processors optimized for AI/ML. With 400 AIEs, it provides up to 6.4 TFLOPs performance 32-bit...

10.48550/arxiv.2301.02359 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

AIM: Accelerating Arbitrary-Precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP

OPENALEX - Publications

Zhuoping Yang Jinming Zhuang Jiaqi Yin Cunxi Yu Alex K. Jones and 1 more

Arbitrary-precision integer multiplication is the core kernel of many applications including scientific computing, cryptographic algorithms, etc. Existing acceleration arbitrary-precision includes CPUs, GPUs, FPGAs, and ASICs. To leverage hardware intrinsics low-bit function units (32/64-bit), can be calculated using Karatsuba decomposition, Schoolbook decomposition by decomposing two large operands into several small operands, generating a set multiplications that processed either in...

10.1109/iccad57390.2023.10323754 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2023-10-28

Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous Chiplets

OPENALEX - Publications

Zhuoping Yang Shixin Ji Xingzhen Chen Jinming Zhuang Weifeng Zhang and 2 more

Fast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever-increasing computing demands in today's data centers. Heterogeneous with domain-specific architectures (DSAs) brings many opportunities when scaling up and out system. In particular, heterogeneous chiplet architecture is favored to keep system well reduce design complexity cost stemming from traditional monolithic chip design. However, how interconnect resources orchestrate chiplets...

10.48550/arxiv.2311.16417 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

REFRESH FPGAs: Sustainable FPGA Chiplet Architectures

OPENALEX - Publications

Peipei Zhou Jinming Zhuang Stephen Cahoon Yue Tang Zhuoping Yang and 4 more

There is a growing call for greater amounts of increasingly agile computational power edge and cloud infrastructure to serve the computationally complex needs ubiquitous computing devices. Thus, an important challenge addressing holistic environmental impacts these next-generation systems. To accomplish this, life-cycle view sustainability advancements necessary reduce such as greenhouse warming gas emissions from choices. Unfortunately, decadal efforts address operational energy efficiency...

10.1145/3634769.3634798 article EN cc-by-sa 2023-10-28

REFRESH FPGAs: Sustainable FPGA Chiplet Architectures

OPENALEX - Publications

Peipei Zhou Jinming Zhuang Stephen Cahoon Yue Tang Zhuoping Yang and 4 more

There is a growing call for greater amounts of increasingly agile computational power edge and cloud infrastructure to serve the computationally complex needs ubiquitous computing devices. Thus, an important challenge addressing holistic environmental impacts these next-generation systems. To accomplish this, life-cycle view sustainability advancements necessary reduce such as greenhouse warming gas emissions from choices. Unfortunately, decadal efforts address operational energy efficiency...

10.48550/arxiv.2312.02991 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

OPENALEX - Publications

Jinming Zhuang Jason Lau Hanchen Ye Zhuoping Yang Shixin Ji and 9 more

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with high computation demands these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged promising platforms. For example, AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores programmable logic AI Engine processors optimized for AI/ML. An array 400 executing at 1 GHz can provide up to 6.4 TFLOPS performance...

10.1145/3686163 article EN ACM Transactions on Reconfigurable Technology and Systems 2024-08-05

Amortizing Embodied Carbon Across Generations

OPENALEX - Publications

Shixin Ji Jinming Zhuang Zhuoping Yang Alex K. Jones Peipei Zhou

10.1109/igsc64514.2024.00021 article EN 2024-11-02

AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Programmable System-on-Chip

OPENALEX - Publications

Jinming Zhuang Zhuoping Yang Peipei Zhou

As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with logic (PL), CPUs, and dedicated AI engines (AIE) ASICs which has theoretical throughput up 6.4 TFLOPs FP32, 25.6 TOPs INT16 102.4 INT8. However, higher level makes it non-trivial achieve performance even well-studied applications like matrix-matrix multiply. In this paper, we provide...

10.48550/arxiv.2305.18698 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Additional Reviewers

OPENALEX - Publications

Heetaek Jeong Andrew Boutros Shashwat Shrivastava Louis Coulon Dina Mahmoud and 18 more

10.1109/fccm57271.2023.00008 article 2023-05-01

AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP

OPENALEX - Publications

Zhuoping Yang Jinming Zhuang Jiaqi Yin Cunxi Yu Alex K. Jones and 1 more

Arbitrary-precision integer multiplication is the core kernel of many applications in simulation, cryptography, etc. Existing acceleration arbitrary-precision includes CPUs, GPUs, FPGAs, and ASICs. Among these accelerators, FPGAs are promised to provide both good energy efficiency flexibility. Surprisingly, our implementations, FPGA has lowest efficiency, i.e., 0.29x CPU 0.17x GPU with same generation fabrication. Therefore, key questions arise: Where do gains CPUs GPUs come from? Can...

10.48550/arxiv.2309.12275 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01