NFDI4DS | UHH-SEMS - Publication Details

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

OPENALEX - Publications

Ritchie Zhao Weinan Song Wentao Zhang Tianwei Xing Jeng-Hau Lin and 3 more

Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications typically run on clusters CPUs or GPUs. Studies into FPGA acceleration workloads has achieved reductions power energy consumption. However, large GPUs modern FPGAs throughput, existence compatible deep learning frameworks give significant advantage...

10.1145/3020078.3021741 article EN 2017-02-02

Serving DNNs in Real Time at Datacenter Scale with Project Brainwave

OPENALEX - Publications

Eric S. Chung Jeremy Fowers Kalin Ovtcharov Michael Papamichael Adrian M. Caulfield and 38 more

To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsofts principal infrastructure AI serving in real time, accelerates neural network (DNN) inferencing major services such as Bings intelligent search features Azure. Exploiting distributed model parallelism pinning over low-latency microservices, Brainwave serves state-of-the-art, pre-trained DNN models with high...

10.1109/mm.2018.022071131 article EN IEEE Micro 2018-03-01

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting

OPENALEX - Publications

Ritchie Zhao Yuwei Hu Jordan Dotzel Christopher De Zhiru Zhang

Quantization can improve the execution latency and energy efficiency of neural networks on both commodity GPUs specialized accelerators. The majority existing literature focuses training quantized DNNs, while this work examines less-studied topic quantizing a floating-point model without (re)training. DNN weights activations follow bell-shaped distribution post-training, practical hardware uses linear quantization grid. This leads to challenges in dealing with outliers distribution. Prior...

10.48550/arxiv.1901.09504 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Rosetta

OPENALEX - Publications

Yuan Zhou Udit Gupta Steve Dai Ritchie Zhao Nitish Srivastava and 7 more

Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at register-transfer level. With increasing adoption HLS design methodology continued advances optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate stress-test new techniques, (3) establish meaningful...

10.1145/3174243.3174255 article EN 2018-02-15

The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips

OPENALEX - Publications

Scott Davidson Shaolin Xie Christopher Torng Khalid Al-Hawai Austin Rovinski and 15 more

Rapidly emerging workloads require rapidly developed chips. The Celerity 16-nm open-source SoC was implemented in nine months using an architectural trifecta to minimize development time: a general-purpose tier comprised of Linux-capable RISC-V cores, massively parallel tiled manycore array that can be scaled arbitrary sizes, and specialization uses high-level synthesis (HLS) create algorithmic neural-network accelerator. These tiers are tied together with efficient heterogeneous remote...

10.1109/mm.2018.022071133 article EN IEEE Micro 2018-03-01

A Parallel Bandit-Based Approach for Autotuning FPGA Compilation

OPENALEX - Publications

Chang Xu Gai Liu Ritchie Zhao Stephen Yang Guojie Luo and 1 more

Mainstream FPGA CAD tools provide an extensive collection of optimization options that have a significant impact on the quality final design. These together create enormous and complex design space cannot effectively be explored by human effort alone. Instead, we propose to search this parameter using autotuning, which is popular approach in compiler domain. Specifically, study effectiveness applying multi-armed bandit (MAB) technique automatically tune for complete compilation flow from RTL...

10.1145/3020078.3021747 article EN 2017-02-02

Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis

OPENALEX - Publications

Steve Dai Ritchie Zhao Gai Liu S Srinath Udit Gupta and 2 more

Current pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural data hazards because they are conservatively assumed to always occur the synthesized pipeline. To enable high-throughput of irregular loops, we study problem augmenting HLS application-specific dynamic hazard resolution, examine its implications on...

10.1145/3020078.3021754 article EN 2017-02-02

Building Efficient Deep Neural Networks With Unitary Group Convolutions

OPENALEX - Publications

Ritchie Zhao Yuwei Hu Jordan Dotzel Christopher De Zhiru Zhang

We propose unitary group convolutions (UGConvs), a building block for CNNs which compose convolution with transforms in feature space to learn richer set of representations than alone. UGConvs generalize two disparate ideas CNN architecture, channel shuffling (i.e. ShuffleNet) and block-circulant networks CirCNN), provide unifying insights that lead deeper understanding each technique. experimentally demonstrate dense can outperform DNN accuracy. On the other hand, different exhibit...

10.1109/cvpr.2019.01156 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

ElasticFlow: A complexity-effective approach for pipelining irregular loop nests

OPENALEX - Publications

Mingxing Tan Gai Liu Ritchie Zhao Steve Dai Zhiru Zhang

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive iterations. However, existing HLS techniques provide inadequate support for irregular nests that contain dynamic-bound inner loops, where unrolling is either very expensive or not even applicable. To overcome this major limitation, we propose ElasticFlow, a novel architectural approach capable dynamically distributing loops an array processing...

10.1109/iccad.2015.7372553 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2015-11-01

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

OPENALEX - Publications

Mingxing Tan Gai Liu Ritchie Zhao Steve Dai Zhiru Zhang

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive iterations. However, existing HLS techniques provide inadequate support for irregular nests that contain dynamic-bound inner loops, where unrolling is either very expensive or not even applicable. To overcome this major limitation, we propose ElasticFlow, a novel architectural approach capable dynamically distributing loops an array processing...

10.5555/2840819.2840831 article EN International Conference on Computer Aided Design 2015-11-02

With Shared Microexponents, A Little Shifting Goes a Long Way

OPENALEX - Publications

Bita Darvish Rouhani Ritchie Zhao Venmugil Elango Rasoul Shafipour Mathew Hall and 17 more

This paper introduces Block Data Representations (BDR), a framework for exploring and evaluating wide spectrum of narrow-precision formats deep learning. It enables comparison popular quantization standards, through BDR, new based on shared microexponents (MX) are identified, which outperform other state-of-the-art approaches, including floating-point block floating-point. MX utilizes multiple levels scaling with ultra-fine factors in the hardware. The effectiveness is demonstrated...

10.1145/3579371.3589351 article EN 2023-06-16

Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines

OPENALEX - Publications

Chongyu Qu Ritchie Zhao Ye Yu Bin Liu Tianyuan Yao and 4 more

Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied "fake quantization", which simulates lower operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, potential deploying real...

10.48550/arxiv.2501.17343 preprint EN arXiv (Cornell University) 2025-01-28

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

OPENALEX - Publications

Payman Behnam Yaosheng Fu Ritchie Zhao Po-An Tsai Zhiding Yu and 1 more

Transformer-based Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, size of grows proportionally with input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free compression strategy designed specifically reduce demand RocketKV contains two consecutive stages. In first stage, it performs coarse-grain eviction sequence tokens SnapKV++,...

10.48550/arxiv.2502.14051 preprint EN arXiv (Cornell University) 2025-02-19

ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

OPENALEX - Publications

Mengting Ai Tianxin Wei Yifan Chen Zhichen Zeng Ritchie Zhao and 5 more

10.1145/3690624.3709196 article EN 2025-04-04

Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration

OPENALEX - Publications

Jeng-Hau Lin Tianwei Xing Ritchie Zhao Zhiru Zhang Mani Srivastava and 2 more

State-of-the-art convolutional neural networks are enormously costly in both compute and memory, demanding massively parallel GPUs for execution. Such strain the computational capabilities energy available to embedded mobile processing platforms, restricting their use many important applications. In this paper, we propose BCNN with Separable Filters (BCNNw/SF), which applies Singular Value Decomposition (SVD) on kernels further reduce storage complexity. We provide a closed form of gradient...

10.1109/cvprw.2017.48 article EN 2017-07-01

Evaluating Celerity: A 16-nm 695 Giga-RISC-V Instructions/s Manycore Processor With Synthesizable PLL

OPENALEX - Publications

Austin Rovinski Bandhav Veluri Anuj Rao Tutu Ajayi Julian Puscar and 16 more

This letter presents a 16-nm 496-core RISC-V network-onchip (NoC). The mesh achieves 1.4 GHz at 0.98 V, yielding peak throughput of 695 Giga instructions/s (GRVIS), energy efficiency 314.89 GRVIS/W, and record 825320 CoreMark benchmark score. Unlike previously reported [1], this new score was obtained without modifying the core code. main feature is NoC architecture, which uses only 1881 μm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup>...

10.1109/lssc.2019.2953847 article EN publisher-specific-oa IEEE Solid-State Circuits Letters 2019-11-25

Area-efficient pipelining for FPGA-targeted high-level synthesis

OPENALEX - Publications

Ritchie Zhao Mingxing Tan Steve Dai Zhiru Zhang

Traditional techniques for pipeline scheduling in high-level synthesis FPGAs assume an additive delay model where each operation incurs a pre-characterized delay. While good approximation some types, this fails to consider technology mapping, group of logic operations can be mapped single look-up table (LUT) and together incur one LUT worth We propose exact formulation the throughput-constrained, mapping-aware problem FPGA-targeted with area minimization being primary objective. By taking...

10.1145/2744769.2744801 article EN 2015-06-02

Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations

OPENALEX - Publications

Yichi Zhang Ritchie Zhao Weizhe Hua Nayun Xu G. Edward Suh and 1 more

We propose precision gating (PG), an end-to-end trainable dynamic dual-precision quantization technique for deep neural networks. PG computes most features in a low and only small proportion of important higher to preserve accuracy. The proposed approach is applicable variety DNN architectures significantly reduces the computational cost execution with almost no accuracy loss. Our experiments indicate that achieves excellent results on CNNs, including statically compressed mobile-friendly...

10.48550/arxiv.2002.07136 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Enabling adaptive loop pipelining in high-level synthesis

OPENALEX - Publications

Steve Dai Gai Liu Ritchie Zhao Zhiru Zhang

Loop pipelining is an important optimization in high-level synthesis (HLS) because it allows successive loop iterations to be overlapped during execution. While current HLS approach achieves high performance for loops with regular and statically analyzable program patterns, remains challenging pipeline irregular memory accesses, dependence unbalanced workload. The lack of support dynamic behaviors results conservatively synthesized pipelines that sacrifice maintaining presumed regularity. In...

10.1109/acssc.2017.8335152 article EN 2017-10-01

A 1.4 GHz 695 Giga Risc-V Inst/s 496-Core Manycore Processor With Mesh On-Chip Network and an All-Digital Synthesized PLL in 16nm CMOS

OPENALEX - Publications

Austin Rovinski Chun Zhao Khalid Al-Hawaj Paul Gao Shaolin Xie and 16 more

This paper presents a 16 nm 496-core RISC-V network-on-chip (NoC). The mesh achieves 1.4 GHz at 0.98 V, yielding peak of 695 Giga instructions/s (GRVIS) and record 812,350 CoreMark benchmark score. main feature is the NoC architecture, which uses only 1881 μm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> per router node, enables highly scalable dense compute, provides up to 361 Tb/s aggregate bandwidth.

10.23919/vlsic.2019.8778031 article EN Symposium on VLSI Circuits 2019-06-01

Improving high-level synthesis with decoupled data structure optimization

OPENALEX - Publications

Ritchie Zhao Gai Liu S Srinath Christopher Batten Zhiru Zhang

Existing high-level synthesis (HLS) tools are mostly effective on algorithm-dominated programs that only use primitive data structures such as fixed size arrays and queues. However, many widely used priority queues, heaps, trees feature complex member methods with data-dependent work irregular memory access patterns. These can be inlined to their call sites, but this does not address the aforementioned issues may further complicate conventional HLS optimizations, resulting in a...

10.1145/2897937.2898030 article EN 2016-05-25

Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests

OPENALEX - Publications

Gai Liu Mingxing Tan Steve Dai Ritchie Zhao Zhiru Zhang

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive iterations. While existing HLS techniques obtain good performance with low complexity for regular nests, they provide inadequate support effectively synthesizing irregular nests. For nests dynamic-bound inner loops, current require unrolling which is either very expensive in resource or even inapplicable due dynamic bounds. To address this major...

10.1109/tcad.2017.2664067 article EN publisher-specific-oa IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2017-02-07