Jinming Zhuang

ORCID: 0000-0003-3659-339X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Embedded Systems Design Techniques
  • Interconnection Networks and Systems
  • Low-power high-performance VLSI design
  • Ferroelectric and Negative Capacitance Devices
  • Advanced Memory and Neural Computing
  • Algorithms and Data Compression
  • Advanced Neural Network Applications
  • Green IT and Sustainability
  • Modular Robots and Swarm Intelligence
  • Network Packet Processing and Optimization
  • Real-Time Systems Scheduling
  • Environmental Impact and Sustainability
  • Advanced Materials and Mechanics
  • Image Enhancement Techniques
  • Real-time simulation and control systems
  • CCD and CMOS Imaging Sensors
  • Distributed and Parallel Computing Systems
  • Advanced Image and Video Retrieval Techniques

Brown University
2024-2025

University of Pittsburgh
2023-2024

John Brown University
2024

University of Maryland, College Park
2023

University of Electronic Science and Technology of China
2021

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with high computation demands these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged promising platforms. For example, AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores programmable logic (PL) AI Engine processors (AIE) optimized for AI/ML. An array 400 executing at 1 GHz can theoretically provide up to...

10.1145/3543622.3573210 article EN cc-by-nc-sa 2023-02-10

With the increase in computation intensity of chip, mismatch between layer shapes and available resource significantly limits utilization chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize throughput. However, using could potentially execution latency. In work, we first systematically investigate two models: (1) sequentially (temporally) launch one monolithic accelerator, (2) spatially multiple accelerators. From observations, find...

10.1145/3626202.3637569 preprint EN cc-by-nc-sa 2024-04-01

As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with logic(PL), CPUs, and dedicated AI engines (AIE) ASICs which has theoretical throughput up 6.4 TFLOPs FP32, 25.6 TOPs INT16 102.4 INT8. However, higher level makes it non-trivial achieve performance even well-studied applications like matrix-matrix multiply. In this paper, we provide...

10.1109/dac56929.2023.10247981 article EN 2023-07-09

While vision transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (<1 ms) is challenging. Current computing platforms like CPUs, GPUs, or FPGA-based solutions struggle to meet this deterministic low-latency requirement, even with quantized ViT models. Some approaches use pruning sparsity reduce the model size and latency, but often results accuracy loss. To address aforementioned constraints, work, we propose EQ-ViT, an...

10.1109/tcad.2024.3443692 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2024-11-01

Low precision quantization in convolutional neural network (CNN) inference has been proved effective for reducing computation complexity and bandwidth requirement. Mixed CNNs manage to benefit from low while maintaining accuracy. In this paper, we propose a Precision FPGA-based Overlay Processor (MP-OPU) fully leverage the advantages of mixed both conventional lightweight CNNs. The micro-architecture MP-OPU considers sharing core with weights activations improve efficiency. addition,...

10.1109/fpl53798.2021.00014 article EN 2021-08-01

Fast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever-increasing computing demands in today's data centers. Heterogeneous with domain-specific architectures (DSAs) brings many opportunities when scaling up and out system. In particular, heterogeneous chiplet architecture is favored to keep system well reduce design complexity cost stemming from traditional monolithic chip design. However, how interconnect resources orchestrate chiplets...

10.1109/asp-dac58780.2024.10473961 article EN 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC) 2024-01-22

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with high computation demands these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged promising platforms. For example, AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores programmable logic AI Engine processors optimized for AI/ML. With 400 AIEs, it provides up to 6.4 TFLOPs performance 32-bit...

10.48550/arxiv.2301.02359 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Arbitrary-precision integer multiplication is the core kernel of many applications including scientific computing, cryptographic algorithms, etc. Existing acceleration arbitrary-precision includes CPUs, GPUs, FPGAs, and ASICs. To leverage hardware intrinsics low-bit function units (32/64-bit), can be calculated using Karatsuba decomposition, Schoolbook decomposition by decomposing two large operands into several small operands, generating a set multiplications that processed either in...

10.1109/iccad57390.2023.10323754 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2023-10-28

Fast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever-increasing computing demands in today's data centers. Heterogeneous with domain-specific architectures (DSAs) brings many opportunities when scaling up and out system. In particular, heterogeneous chiplet architecture is favored to keep system well reduce design complexity cost stemming from traditional monolithic chip design. However, how interconnect resources orchestrate chiplets...

10.48550/arxiv.2311.16417 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

There is a growing call for greater amounts of increasingly agile computational power edge and cloud infrastructure to serve the computationally complex needs ubiquitous computing devices. Thus, an important challenge addressing holistic environmental impacts these next-generation systems. To accomplish this, life-cycle view sustainability advancements necessary reduce such as greenhouse warming gas emissions from choices. Unfortunately, decadal efforts address operational energy efficiency...

10.1145/3634769.3634798 article EN cc-by-sa 2023-10-28

There is a growing call for greater amounts of increasingly agile computational power edge and cloud infrastructure to serve the computationally complex needs ubiquitous computing devices. Thus, an important challenge addressing holistic environmental impacts these next-generation systems. To accomplish this, life-cycle view sustainability advancements necessary reduce such as greenhouse warming gas emissions from choices. Unfortunately, decadal efforts address operational energy efficiency...

10.48550/arxiv.2312.02991 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with high computation demands these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged promising platforms. For example, AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores programmable logic AI Engine processors optimized for AI/ML. An array 400 executing at 1 GHz can provide up to 6.4 TFLOPS performance...

10.1145/3686163 article EN ACM Transactions on Reconfigurable Technology and Systems 2024-08-05

As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with logic (PL), CPUs, and dedicated AI engines (AIE) ASICs which has theoretical throughput up 6.4 TFLOPs FP32, 25.6 TOPs INT16 102.4 INT8. However, higher level makes it non-trivial achieve performance even well-studied applications like matrix-matrix multiply. In this paper, we provide...

10.48550/arxiv.2305.18698 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Arbitrary-precision integer multiplication is the core kernel of many applications in simulation, cryptography, etc. Existing acceleration arbitrary-precision includes CPUs, GPUs, FPGAs, and ASICs. Among these accelerators, FPGAs are promised to provide both good energy efficiency flexibility. Surprisingly, our implementations, FPGA has lowest efficiency, i.e., 0.29x CPU 0.17x GPU with same generation fabrication. Therefore, key questions arise: Where do gains CPUs GPUs come from? Can...

10.48550/arxiv.2309.12275 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01
Coming Soon ...