NFDI4DS | UHH-SEMS - Publication Details

ZeroQ: A Novel Zero Shot Quantization Framework

OPENALEX - Publications

Yaohui Cai Zhewei Yao Zhen Dong Amir Gholami Michael W. Mahoney and 1 more

Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to original training dataset retraining during quantization. This often not possible applications with sensitive or proprietary data, e.g., due privacy security concerns. Existing zero-shot use different heuristics address this, but they result in poor performance, especially when quantizing ultralow precision. Here, we...

10.1109/cvpr42600.2020.01318 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs

OPENALEX - Publications

Qijing Huang Dequan Wang Zhen Dong Yizhao Gao Yaohui Cai and 4 more

Deploying deep learning models on embedded systems for computer vision tasks has been challenging due to limited compute resources and strict energy budgets. The majority of existing work focuses accelerating image classification, while other fundamental problems, such as object detection, have not adequately addressed. Compared with detection problems are more sensitive the spatial variance objects, therefore, require specialized convolutions aggregate information. To address this need,...

10.1145/3431920.3439295 preprint EN 2021-02-17

A Comprehensive Evaluation of FPGA-Based Spatial Acceleration of LLMs

OPENALEX - Publications

Hongzheng Chen Jiahao Zhang Yixiao Du Shaojie Xiang Zichao Yue and 3 more

Recent advancements in large language models (LLMs) have generated significant demands for efficient deployment inference workloads. Most existing approaches rely on temporal architectures that reuse hardware units different network layers and operators. However, these methods often encounter challenges achieving low latency due to considerable memory access overhead. This paper investigates the feasibility potential of model-specific spatial acceleration LLM FPGAs. Our approach involves...

10.1145/3626202.3637600 article EN 2024-04-01

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

OPENALEX - Publications

Hongzheng Chen Jiahao Zhang Yixiao Du Shaojie Xiang Zichao Yue and 3 more

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment inference workloads. While hardware accelerators Transformer-based been extensively studied, the majority existing approaches rely on temporal architectures that reuse units different network layers and operators. However, these methods often encounter challenges achieving low latency due to considerable memory access overhead. This paper...

10.1145/3656177 article EN ACM Transactions on Reconfigurable Technology and Systems 2024-04-04

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

OPENALEX - Publications

Jerry Chee Yaohui Cai Volodymyr Kuleshov Christopher De

This work studies post-training parameter quantization in large language models (LLMs). We introduce with incoherence processing (QuIP), a new method based on the insight that benefits from $\textit{incoherent}$ weight and Hessian matrices, i.e., weights being even magnitude directions which it is important to round them accurately unaligned coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing quadratic proxy objective; (2) efficient pre- post-processing...

10.48550/arxiv.2307.13304 preprint EN other-oa arXiv (Cornell University) 2023-01-01

SmoothE: Differentiable E-Graph Extraction

OPENALEX - Publications

Yaohui Cai Kaixin Yang Chenhui Deng Cunxi Yu Zhiru Zhang

10.1145/3669940.3707262 article EN 2025-02-03

ZeroQ: A Novel Zero Shot Quantization Framework

OPENALEX - Publications

Yaohui Cai Zhewei Yao Zhen Dong Amir Gholami Michael W. Mahoney and 1 more

Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to original training dataset retraining during quantization. This often not possible applications with sensitive or proprietary data, e.g., due privacy security concerns. Existing zero-shot use different heuristics address this, but they result in poor performance, especially when quantizing ultra-low precision. Here, we...

10.48550/arxiv.2001.00281 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

OPENALEX - Publications

Hongzheng Chen Jiahao Zhang Yixiao Du Shaojie Xiang Zichao Yue and 3 more

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment inference workloads. The majority existing approaches rely on temporal architectures that reuse hardware units different network layers and operators. However, these methods often encounter challenges achieving low latency due to considerable memory access overhead. This paper investigates the feasibility potential model-specific spatial...

10.48550/arxiv.2312.15159 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Algorithm-hardware Co-design for Deformable Convolution

OPENALEX - Publications

Qijing Huang Dequan Wang Yizhao Gao Yaohui Cai Zhen Dong and 3 more

FPGAs provide a flexible and efficient platform to accelerate rapidly-changing algorithms for computer vision. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, including object detection instance segmentation, have not been adequately addressed. Compared with problems are more sensitive the spatial variance objects, therefore, require specialized convolutions aggregate information. To address this, recent proposes dynamic...

10.1109/emc2-nips53020.2019.00019 article EN 2019-12-01

Algorithm-hardware Co-design for Deformable Convolution

OPENALEX - Publications

Qijing Huang Dequan Wang Yizhao Gao Yaohui Cai Zhen Dong and 3 more

FPGAs provide a flexible and efficient platform to accelerate rapidly-changing algorithms for computer vision. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, including object detection instance segmentation, have not been adequately addressed. Compared with problems are more sensitive the spatial variance objects, therefore, require specialized convolutions aggregate information. To address this, recent proposes dynamic...

10.48550/arxiv.2002.08357 preprint EN other-oa arXiv (Cornell University) 2020-01-01

SPADE: A Spectral Method for Black-Box Adversarial Robustness Evaluation

OPENALEX - Publications

Wuxinlin Cheng Chenhui Deng Zhiqiang Zhao Yaohui Cai Zhiru Zhang and 1 more

A black-box spectral method is introduced for evaluating the adversarial robustness of a given machine learning (ML) model. Our approach, named SPADE, exploits bijective distance mapping between input/output graphs constructed approximating manifolds corresponding to data. By leveraging generalized Courant-Fischer theorem, we propose SPADE score model, which proved be an upper bound best Lipschitz constant under manifold setting. To reveal most non-robust data samples highly vulnerable...

10.48550/arxiv.2102.03716 preprint EN other-oa arXiv (Cornell University) 2021-01-01

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs

OPENALEX - Publications

Zhen Dong Dequan Wang Qijing Huang Yizhao Gao Yaohui Cai and 4 more

Deploying deep learning models on embedded systems has been challenging due to limited computing resources. The majority of existing work focuses accelerating image classification, while other fundamental vision problems, such as object detection, have not adequately addressed. Compared with detection problems are more sensitive the spatial variance objects, and therefore, require specialized convolutions aggregate information. To address this need, recent introduces dynamic deformable...

10.48550/arxiv.2006.08357 preprint EN other-oa arXiv (Cornell University) 2020-01-01