- Advanced Neural Network Applications
- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Anomaly Detection Techniques and Applications
- Adversarial Robustness in Machine Learning
- Neural Networks and Applications
- Domain Adaptation and Few-Shot Learning
- Human Pose and Action Recognition
- Image and Signal Denoising Methods
- Advanced Data Compression Techniques
- Advanced Image and Video Retrieval Techniques
- Big Data and Digital Economy
- Model Reduction and Neural Networks
- Medical Imaging and Analysis
- Image Enhancement Techniques
- Radiation Effects in Electronics
- Advanced Bandit Algorithms Research
- Machine Learning and Algorithms
- Protein Degradation and Inhibitors
- CCD and CMOS Imaging Sensors
- Numerical Methods and Algorithms
- Network Packet Processing and Optimization
- Neural Networks and Reservoir Computing
- Medical Image Segmentation Techniques
Nvidia (United States)
2025
Microsoft (United States)
2020-2023
Cornell University
2015-2019
Microsoft (Finland)
2018
Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications typically run on clusters CPUs or GPUs. Studies into FPGA acceleration workloads has achieved reductions power energy consumption. However, large GPUs modern FPGAs throughput, existence compatible deep learning frameworks give significant advantage...
To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsofts principal infrastructure AI serving in real time, accelerates neural network (DNN) inferencing major services such as Bings intelligent search features Azure. Exploiting distributed model parallelism pinning over low-latency microservices, Brainwave serves state-of-the-art, pre-trained DNN models with high...
Quantization can improve the execution latency and energy efficiency of neural networks on both commodity GPUs specialized accelerators. The majority existing literature focuses training quantized DNNs, while this work examines less-studied topic quantizing a floating-point model without (re)training. DNN weights activations follow bell-shaped distribution post-training, practical hardware uses linear quantization grid. This leads to challenges in dealing with outliers distribution. Prior...
Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at register-transfer level. With increasing adoption HLS design methodology continued advances optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate stress-test new techniques, (3) establish meaningful...
Rapidly emerging workloads require rapidly developed chips. The Celerity 16-nm open-source SoC was implemented in nine months using an architectural trifecta to minimize development time: a general-purpose tier comprised of Linux-capable RISC-V cores, massively parallel tiled manycore array that can be scaled arbitrary sizes, and specialization uses high-level synthesis (HLS) create algorithmic neural-network accelerator. These tiers are tied together with efficient heterogeneous remote...
Mainstream FPGA CAD tools provide an extensive collection of optimization options that have a significant impact on the quality final design. These together create enormous and complex design space cannot effectively be explored by human effort alone. Instead, we propose to search this parameter using autotuning, which is popular approach in compiler domain. Specifically, study effectiveness applying multi-armed bandit (MAB) technique automatically tune for complete compilation flow from RTL...
Current pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural data hazards because they are conservatively assumed to always occur the synthesized pipeline. To enable high-throughput of irregular loops, we study problem augmenting HLS application-specific dynamic hazard resolution, examine its implications on...
We propose unitary group convolutions (UGConvs), a building block for CNNs which compose convolution with transforms in feature space to learn richer set of representations than alone. UGConvs generalize two disparate ideas CNN architecture, channel shuffling (i.e. ShuffleNet) and block-circulant networks CirCNN), provide unifying insights that lead deeper understanding each technique. experimentally demonstrate dense can outperform DNN accuracy. On the other hand, different exhibit...
Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive iterations. However, existing HLS techniques provide inadequate support for irregular nests that contain dynamic-bound inner loops, where unrolling is either very expensive or not even applicable. To overcome this major limitation, we propose ElasticFlow, a novel architectural approach capable dynamically distributing loops an array processing...
Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive iterations. However, existing HLS techniques provide inadequate support for irregular nests that contain dynamic-bound inner loops, where unrolling is either very expensive or not even applicable. To overcome this major limitation, we propose ElasticFlow, a novel architectural approach capable dynamically distributing loops an array processing...
This paper introduces Block Data Representations (BDR), a framework for exploring and evaluating wide spectrum of narrow-precision formats deep learning. It enables comparison popular quantization standards, through BDR, new based on shared microexponents (MX) are identified, which outperform other state-of-the-art approaches, including floating-point block floating-point. MX utilizes multiple levels scaling with ultra-fine factors in the hardware. The effectiveness is demonstrated...
Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied "fake quantization", which simulates lower operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, potential deploying real...
Transformer-based Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, size of grows proportionally with input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free compression strategy designed specifically reduce demand RocketKV contains two consecutive stages. In first stage, it performs coarse-grain eviction sequence tokens SnapKV++,...
State-of-the-art convolutional neural networks are enormously costly in both compute and memory, demanding massively parallel GPUs for execution. Such strain the computational capabilities energy available to embedded mobile processing platforms, restricting their use many important applications. In this paper, we propose BCNN with Separable Filters (BCNNw/SF), which applies Singular Value Decomposition (SVD) on kernels further reduce storage complexity. We provide a closed form of gradient...
This letter presents a 16-nm 496-core RISC-V network-onchip (NoC). The mesh achieves 1.4 GHz at 0.98 V, yielding peak throughput of 695 Giga instructions/s (GRVIS), energy efficiency 314.89 GRVIS/W, and record 825320 CoreMark benchmark score. Unlike previously reported [1], this new score was obtained without modifying the core code. main feature is NoC architecture, which uses only 1881 μm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup>...
Traditional techniques for pipeline scheduling in high-level synthesis FPGAs assume an additive delay model where each operation incurs a pre-characterized delay. While good approximation some types, this fails to consider technology mapping, group of logic operations can be mapped single look-up table (LUT) and together incur one LUT worth We propose exact formulation the throughput-constrained, mapping-aware problem FPGA-targeted with area minimization being primary objective. By taking...
We propose precision gating (PG), an end-to-end trainable dynamic dual-precision quantization technique for deep neural networks. PG computes most features in a low and only small proportion of important higher to preserve accuracy. The proposed approach is applicable variety DNN architectures significantly reduces the computational cost execution with almost no accuracy loss. Our experiments indicate that achieves excellent results on CNNs, including statically compressed mobile-friendly...
Loop pipelining is an important optimization in high-level synthesis (HLS) because it allows successive loop iterations to be overlapped during execution. While current HLS approach achieves high performance for loops with regular and statically analyzable program patterns, remains challenging pipeline irregular memory accesses, dependence unbalanced workload. The lack of support dynamic behaviors results conservatively synthesized pipelines that sacrifice maintaining presumed regularity. In...
This paper presents a 16 nm 496-core RISC-V network-on-chip (NoC). The mesh achieves 1.4 GHz at 0.98 V, yielding peak of 695 Giga instructions/s (GRVIS) and record 812,350 CoreMark benchmark score. main feature is the NoC architecture, which uses only 1881 μm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> per router node, enables highly scalable dense compute, provides up to 361 Tb/s aggregate bandwidth.
Existing high-level synthesis (HLS) tools are mostly effective on algorithm-dominated programs that only use primitive data structures such as fixed size arrays and queues. However, many widely used priority queues, heaps, trees feature complex member methods with data-dependent work irregular memory access patterns. These can be inlined to their call sites, but this does not address the aforementioned issues may further complicate conventional HLS optimizations, resulting in a...
Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive iterations. While existing HLS techniques obtain good performance with low complexity for regular nests, they provide inadequate support effectively synthesizing irregular nests. For nests dynamic-bound inner loops, current require unrolling which is either very expensive in resource or even inapplicable due dynamic bounds. To address this major...