NFDI4DS | UHH-SEMS - Publication Details

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

OPENALEX - Publications

Cong Guo Bo Yang Hsueh Jingwen Leng Yuxian Qiu Yue Guan and 5 more

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading irregular computations. Consequently, cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix As such, prior works usually modify or design completely new sparsity-optimized architectures exploiting sparsity. We propose an algorithm-software co-designed method that...

10.1109/sc41405.2020.00020 article EN 2020-11-01

Adversarial Defense Through Network Profiling Based Path Extraction

OPENALEX - Publications

Yuxian Qiu Jingwen Leng Cong Guo Quan Chen Chao Li and 2 more

Recently, researchers have started decomposing deep neural network models according to their semantics or functions. Recent work has shown the effectiveness of decomposed functional blocks for defending adversarial attacks, which add small input perturbation image fool DNN models. This proposes a profiling-based method decompose different blocks, lead effective path as new approach exploring DNNs' internal organization. Specifically, per-image can be aggregated class-level path, through we...

10.1109/cvpr.2019.00491 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Accelerating Sparse DNNs Based on Tiled GEMM

OPENALEX - Publications

Cong Guo Fengchen Xue Jingwen Leng Yuxian Qiu Yue Guan and 3 more

Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading irregular computations. Consequently, unstructured cannot achieve meaningful speedup on commodity hardware built for dense matrix Accelerators are usually modified or designed with structured sparsity-optimized architectures exploiting sparsity. For example, Ampere architecture introduces a tensor core, which...

10.1109/tc.2024.3365942 article EN IEEE Transactions on Computers 2024-02-14

Ptolemy: Architecture Support for Robust Deep Learning

OPENALEX - Publications

Yiming Gan Yuxian Qiu Jingwen Leng Minyi Guo Yuhao Zhu

Deep learning is vulnerable to adversarial attacks, where carefully-crafted input perturbations could mislead a well-trained Neural Network (DNN) produce incorrect results. Adversarial attacks jeopardize the safety, security, and privacy of DNN-enabled systems. Today's countermeasures either do not have capability detect samples at inference-time, or introduce prohibitively high overhead be practical inference-time.We propose Ptolemy, an algorithm-architecture co-designed system that detects...

10.1109/micro50266.2020.00031 article EN 2020-10-01

SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation

OPENALEX - Publications

Cong Guo Yuxian Qiu Jingwen Leng Xiaotian Gao Chen Zhang and 4 more

Quantization of deep neural networks (DNN) has been proven effective for compressing and accelerating DNN models. Data-free quantization (DFQ) is a promising approach without the original datasets under privacy-sensitive confidential scenarios. However, current DFQ solutions degrade accuracy, need synthetic data to calibrate networks, are time-consuming costly. This paper proposes an on-the-fly framework with sub-second time, called SQuant, which can quantize on inference-only devices low...

10.48550/arxiv.2202.07471 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Amanda: Unified Instrumentation Framework for Deep Neural Networks

OPENALEX - Publications

Yue Guan Yuxian Qiu Jingwen Leng Fan Yang Shuo Yu and 8 more

The success of deep neural networks (DNNs) has sparked efforts to analyze (e.g., tracing) and optimize pruning) them. These tasks have specific requirements ad-hoc implementations in current execution backends like TensorFlow/PyTorch, which require developers manage fragmented interfaces adapt their codes diverse models. In this study, we propose a new framework called Amanda streamline the development these tasks. We formalize implementation as network instrumentation, involves introducing...

10.1145/3617232.3624864 article EN cc-by-nc-nd 2024-04-17

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

OPENALEX - Publications

Cong Guo Bo Yang Hsueh Jingwen Leng Yuxian Qiu Yue Guan and 5 more

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading irregular computations. Consequently, cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix As such, prior works usually modify or design completely new sparsity-optimized architectures exploiting sparsity. We propose an algorithm-software co-designed method that...

10.48550/arxiv.2008.13006 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

OPENALEX - Publications

Han Zhao Quan Chen Yuxian Qiu Ming Wu Yao Shen and 3 more

Parallel computers now start to adopt Bandwidth-Asymmetric Memory architecture that consists of traditional DRAM memory and new High Bandwidth (HBM) for high bandwidth. However, existing task schedulers suffer from low bandwidth usage poor data locality problems in bandwidth-asymmetric architectures. To solve the two problems, we propose a Locality Aware Task-stealing (BATS) system, which an HBM-aware allocator, bandwidth-aware traffic balancer, hierarchical task-stealing scheduler....

10.1145/3291058 article EN ACM Transactions on Architecture and Code Optimization 2018-12-08

Low-Latency Proactive Continuous Vision

OPENALEX - Publications

Yiming Gan Yuxian Qiu Lele Chen Jingwen Leng Yuhao Zhu

Continuous vision is the cornerstone of a diverse range intelligent applications found on emerging computing platforms such as autonomous machines and Augmented Reality glasses. A critical issue in today's continuous systems their long end-to-end frame latency, which significantly impacts system agility user experience. We find that latency fundamentally caused by serialized execution model pipeline, whose key stages, including sensing, imaging, computations, execute sequentially, leading to latency.

10.1145/3410463.3414650 article EN 2020-09-30

Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training

OPENALEX - Publications

Cong Guo Yuxian Qiu Jingwen Leng Chen Zhang Ying Cao and 4 more

An activation function is an element-wise mathematical and plays a crucial role in deep neural networks (DNN). Many novel sophisticated functions have been proposed to improve the DNN accuracy but also consume massive memory training process with back-propagation. In this study, we propose nested forward automatic differentiation (Forward-AD), specifically for memory-efficient training. We deploy Forward-AD two widely-used learning frameworks, TensorFlow PyTorch, which support static dynamic...

10.1109/iccd56317.2022.00113 article EN 2022 IEEE 40th International Conference on Computer Design (ICCD) 2022-10-01

Accelerating Sparse DNNs Based on Tiled GEMM

OPENALEX - Publications

Cong Guo Fengchen Xue Jingwen Leng Yuxian Qiu Yue Guan and 3 more

Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading irregular computations. Consequently, unstructured cannot achieve meaningful speedup on commodity hardware built for dense matrix Accelerators are usually modified or designed with structured sparsity-optimized architectures exploiting sparsity. For example, Ampere architecture introduces a tensor core, which...

10.48550/arxiv.2402.10876 preprint EN arXiv (Cornell University) 2024-02-16

Adversarial Defense Through Network Profiling Based Path Extraction

OPENALEX - Publications

Yuxian Qiu Jingwen Leng Cong Guo Quan Chen Chao Li and 2 more

Recently, researchers have started decomposing deep neural network models according to their semantics or functions. Recent work has shown the effectiveness of decomposed functional blocks for defending adversarial attacks, which add small input perturbation image fool DNN models. This proposes a profiling-based method decompose different blocks, lead effective path as new approach exploring DNNs' internal organization. Specifically, per-image can be aggregated class-level path, through we...

10.48550/arxiv.1904.08089 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Ptolemy: Architecture Support for Robust Deep Learning

OPENALEX - Publications

Yiming Gan Yuxian Qiu Jingwen Leng Minyi Guo Yuhao Zhu

Deep learning is vulnerable to adversarial attacks, where carefully-crafted input perturbations could mislead a well-trained Neural Network produce incorrect results. Today's countermeasures attacks either do not have capability detect samples at inference time, or introduce prohibitively high overhead be practical time. We propose Ptolemy, an algorithm-architecture co-designed system that detects time with low and accuracy.We exploit the synergies between DNN imperative program execution:...

10.48550/arxiv.2008.09954 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Efficient Adaptive Activation Rounding for Post-Training Quantization

OPENALEX - Publications

Zhengyi Li Cong Guo Zhanda Zhu Yangjie Zhou Yuxian Qiu and 3 more

Post-training quantization attracts increasing attention due to its convenience in deploying quantized neural networks. Although rounding-to-nearest remains the prevailing method for DNN quantization, prior research has demonstrated suboptimal nature when applied weight quantization. They propose optimizing rounding schemes by leveraging output error rather than traditional error. Our study reveals that similar challenges also extend activation Despite easy generalization, lie dynamic of...

10.48550/arxiv.2208.11945 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

SVSoC: Speculative Vision Systems-on-a-Chip

OPENALEX - Publications

Yiming Gan Yuxian Qiu Jingwen Leng Yuhao Zhu

Frame latency in continuous vision significantly impacts the agility of intelligent machines that interact with environment via cameras. However, today's systems limit frame due to their fundamental sequential execution model. We propose a speculative model along two mechanisms enable practical speculation. present SVSoC, new mobile Systems-on-a-chip (SoC) architecture augments conventional SoCs speculation capability. Under same energy budget, SVSoC achieves 14.3 35.4 percent reduction...

10.1109/lca.2019.2903241 article EN IEEE Computer Architecture Letters 2019-01-01

Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training

OPENALEX - Publications

Cong Guo Yuxian Qiu Jingwen Leng Chen Zhang Ying Cao and 4 more

An activation function is an element-wise mathematical and plays a crucial role in deep neural networks (DNN). Many novel sophisticated functions have been proposed to improve the DNN accuracy but also consume massive memory training process with back-propagation. In this study, we propose nested forward automatic differentiation (Forward-AD), specifically for memory-efficient training. We deploy Forward-AD two widely-used learning frameworks, TensorFlow PyTorch, which support static dynamic...

10.48550/arxiv.2209.10778 preprint EN cc-by arXiv (Cornell University) 2022-01-01