NFDI4DS | UHH-SEMS - Publication Details

FPGA/DNN Co-Design

OPENALEX - Publications

Cong Hao Xiaofan Zhang Yuhong Li Sitao Huang Jinjun Xiong and 3 more

While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources edge-scale FPGA devices also makes it challenging deployment. In this paper, we propose a simultaneous FPGA/DNN co-design methodology with both bottom-up top-down approaches: hardware-oriented model search accuracy, accelerator design considering DNN-specific characteristics. We build an automatic flow, including Auto-DNN engine...

10.1145/3316781.3317829 article EN 2019-05-23

High-performance video content recognition with long-term recurrent convolutional network for FPGA

OPENALEX - Publications

Xiaofan Zhang Xinheng Liu Anand Ramachandran Chuanhao Zhuge Shibin Tang and 4 more

FPGA is a promising candidate for the acceleration of Deep Neural Networks (DNN) with improved latency and energy consumption compared to CPU GPU-based implementations. DNNs use sequences layers regular computation that are well suited HLS-based design FPGA. However, optimizing large neural networks under resource constraints still key challenge. HLS must manage on-chip computation, buffering resources, off-chip memory accesses minimize total latency. In this paper, we present framework uses...

10.23919/fpl.2017.8056833 article EN 2017-09-01

Efficient GPU Spatial-Temporal Multitasking

OPENALEX - Publications

Yun Liang Huynh Phung Huynh Kyle Rupnow Rick Siow Mong Goh Deming Chen

Heterogeneous computing nodes are now pervasive throughout computing, and GPUs have emerged as a leading device for application acceleration. tremendous potential data-parallel applications, the emergence of has led to proliferation GPU-accelerated applications. This also systems in which many applications competing access GPU resources, efficient utilization resources is critical system performance. Prior techniques temporal multitasking can be employed with well, but not all kernels make...

10.1109/tpds.2014.2313342 article EN IEEE Transactions on Parallel and Distributed Systems 2014-03-25

Improving high level synthesis optimization opportunity through polyhedral transformations

OPENALEX - Publications

Wei Zuo Yun Liang Peng Li Kyle Rupnow Deming Chen and 1 more

High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises performance and energy efficiency designs with a lower barrier to entry in design expertise, shorter time. State-of-the-art high now includes wide variety powerful optimizations that implement efficient hardware. These can some most features generally performed manual including parallel units, pipelining execution both within unit between fine-grained data...

10.1145/2435264.2435271 article EN 2013-02-11

High‐Level Synthesis: Productivity, Performance, and Software Constraints

OPENALEX - Publications

Yun Liang Kyle Rupnow Yinan Li Dongbo Min N. Minh and 1 more

FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort FPGA implementations remains high—often order of magnitude larger than using high‐level languages. Instead this time‐consuming process, synthesis (HLS) tools generate hardware from algorithm descriptions in languages such as C/C++ SystemC. Such reduce effort: more compact less error prone. HLS promise development abstracted software designer knowledge...

10.1155/2012/649057 article EN cc-by Journal of Electrical and Computer Engineering 2012-01-01

Scientific Application Acceleration with Reconfigurable Functional Units

OPENALEX - Publications

Kyle Rupnow Keith D. Underwood Katherine Compton

While scientific applications in the past were limited by floating point computations, modern use more unstructured formulations. These have a significant percentage of integer computation - increasingly limiting factor application performance. In real employed at Sandia National Labs, computations constitute on average 37% operations, forming large and complex dataflow graphs. Reconfigurable functional units (RFUs) are particularly attractive accelerator for these graphs because they can...

10.1109/fccm.2007.14 article EN 2007-04-01

High Level Synthesis of Complex Applications

OPENALEX - Publications

Xinheng Liu Yao Chen Tan Nguyen Swathi Gurumani Kyle Rupnow and 1 more

High level synthesis (HLS) is gaining wider acceptance for hardware design due to its higher productivity and better space exploration features. In recent years, HLS techniques flows have also advanced significantly, as a result, many new FPGA designs are developed with HLS. However, despite studies using HLS, the size complexity of such applications remain generally small, it not well understood how optimize large, complex reference code. Typical benchmark contain somewhere between 100 1400...

10.1145/2847263.2847274 article EN 2016-02-04

SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems

OPENALEX - Publications

Xiaofan Zhang Haoming Lu Cong Hao Jiachen Li Bowen Cheng and 7 more

Object detection and tracking are challenging tasks for resource-constrained embedded systems. While these among the most compute-intensive from artificial intelligence domain, they only allowed to use limited computation memory resources on devices. In meanwhile, such implementations often required satisfy additional demanding requirements as real-time response, high-throughput performance, reliable inference accuracy. To overcome challenges, we propose SkyNet, a hardware-efficient neural...

10.48550/arxiv.1909.09709 preprint EN other-oa arXiv (Cornell University) 2019-01-01

High level synthesis of stereo matching: Productivity, performance, and software constraints

OPENALEX - Publications

Kyle Rupnow Yun Liang Yinan Li Dongbo Min N. Minh and 1 more

FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort FPGA implementations remains - often order of magnitude larger than using level languages. Instead this time-consuming process, synthesis (HLS) tools generate hardware from languages (HLL) such as C/C++/SystemC. Such reduce effort: descriptions more compact less error prone. HLS promise development abstracted software designer knowledge the...

10.1109/fpt.2011.6132716 article EN 2011-12-01

A real-time 3D sound localization system with miniature microphone array for virtual reality

OPENALEX - Publications

Shengkui Zhao Saima Ahmed Yun Liang Kyle Rupnow Deming Chen and 1 more

This paper presents a real-time three-dimensional (3D) wideband sound localization system designed with miniature XYZO microphone array. Unlike the conventional arrays for using only omnidirectional microphones, presented array is both bidirectional (pressure gradient) and microphones. Therefore, has significantly reduced size known as world's smallest design 3D source in air. In this paper, we describe configuration perform calibration. For localization, provide studies on output model of...

10.1109/iciea.2012.6361029 article EN 2012-07-01

Machine learning on FPGAs to face the IoT revolution

OPENALEX - Publications

Xiaofan Zhang Anand Ramachandran Chuanhao Zhuge Di He Wei Zuo and 3 more

FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...

10.5555/3199700.3199810 article EN International Conference on Computer Aided Design 2017-11-13

Platform choices and design demands for IoT platforms: cost, power, and performance tradeoffs

OPENALEX - Publications

Deming Chen Jason Cong Swathi Gurumani Wen‐mei Hwu Kyle Rupnow and 1 more

The rise of the Internet Things has led to an explosion sensor computing platforms. complexity and applications IoT devices range from simple in vending machines complex, interactive artificial intelligence smart vehicles drones. Developers target more aggressive objectives protect market share through feature differentiation; they just choose between low-cost, low-performance CPU-based systems, high-performance custom platforms with hardware accelerators including GPUs FPGAs. Both designs...

10.1049/iet-cps.2016.0020 article EN cc-by-nd IET Cyber-Physical Systems Theory & Applications 2016-12-01

Hardware Acceleration of the Pair-HMM Algorithm for DNA Variant Calling

OPENALEX - Publications

Sitao Huang J. Manikandan Anand Ramachandran Kyle Rupnow Wen‐mei Hwu and 1 more

With the advent of several accurate and sophisticated statistical algorithms pipelines for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing data into biologically meaningful information further clinical analysis processing. However, given large volume involved, even modestly complex would require a prohibitively long time complete. Hence need hour explore non-conventional implementation platforms accelerate genomics research. In this work, we present an...

10.1145/3020078.3021749 article EN 2017-02-02

A study of high-level synthesis: Promises and challenges

OPENALEX - Publications

Kyle Rupnow Yun Liang Yinan Li Deming Chen

A wide variety of application domains such as networking, computer vision, and cryptography target FPGA platforms to meet computation demand energy consumption constraints. However, design effort for implementations in hardware description languages (HDLs) remains high - often an order magnitude larger than using level (HLLs). Instead development HDLs, synthesis (HLS) tools generate from algorithm descriptions HLLs C/C++/SystemC. HLS promise reduced without the detailed knowledge...

10.1109/asicon.2011.6157401 article EN 2011-10-01

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization

OPENALEX - Publications

Zheng Cui Yun Liang Kyle Rupnow Deming Chen

Graphics processing units (GPUs) are increasingly critical for general-purpose parallel performance. GPU hardware is composed of many streaming multiprocessors, each which employs the single-instruction multiple-data (SIMD) execution style. This massively architecture allows GPUs to execute tens thousands threads in parallel. Thus, architectures efficiently heavily data-parallel applications. However, due this SIMD style, resource utilization and thus overall performance can be significantly...

10.1109/ipdps.2012.18 article EN 2012-05-01

Fast and effective placement and routing directed high-level synthesis for FPGAs

OPENALEX - Publications

Hongbin Zheng Swathi Gurumani Kyle Rupnow Deming Chen

Achievable frequency (fmax) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays (FPGA), because of its impact on design latency and throughput. Fmax limited by critical path delay, which highly influenced lower-level details the circuit implementation such as technology mapping, placement routing. However, high-level synthesis~(HLS) flows, it challenging to evaluate real delay at behavioral level. Current HLS flows typically use module pre-characterization...

10.1145/2554688.2554775 article EN 2014-02-18

Debugging and verifying SoC designs through effective cross-layer hardware-software co-simulation

OPENALEX - Publications

Keith A. Campbell Leon He Liwei Yang Swathi Gurumani Kyle Rupnow and 1 more

Verification of modern day electronic circuits has become the bottleneck for timely delivery complex SoC designs. We develop a novel cross-layer hardware/software co-simulation framework that can effectively debug and verify an design. combine high-level C/C++ software with cycle-accurate SystemC hardware, uniquely identify various types bugs, help hardware designer localize them. Experimental results show we are able to detect aid in localization logic bugs from both specifications as well...

10.1145/2897937.2898002 article EN 2016-05-25

Machine learning on FPGAs to face the IoT revolution

OPENALEX - Publications

Xiaofan Zhang Anand Ramachandran Chuanhao Zhuge Di He Wei Zuo and 3 more

FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...

10.1109/iccad.2017.8203862 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2017-11-01

Block, Drop or Roll(back): Alternative Preemption Methods for RH Multi-Tasking

OPENALEX - Publications

Kyle Rupnow Wenyin Fu Katherine Compton

Save and restore of context data is traditionally used in process preemption multi-tasking operating systems. Multi-tasking, by consequence, preemption, key to effective CPU sharing. However, it much more expensive save reconfigurable hardware than traditional software. The configuration current state comprises a large amount data, making the transfer long operation. In this paper, we explore alternatives operation for multi-tasking. We compare system performance three alternate policies...

10.1109/fccm.2009.30 article EN 2009-01-01

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

OPENALEX - Publications

Yao Chen Swathi Gurumani Yun Liang Guofeng Li Donghui Guo and 2 more

High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation independent computation cores. HLS tools can effectively translate many threads present in parallel descriptions into independent, optimized The generated hardware cores often heavily share data produce outputs independently. As number instantiated grows, off-chip memory bandwidth may be insufficient to meet demand. Hence, a...

10.1109/tvlsi.2015.2497259 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2015-12-09

High-level synthesis with behavioral level multi-cycle path analysis

OPENALEX - Publications

Hongbin Zheng Swathi Gurumani Liwei Yang Deming Chen Kyle Rupnow

High-level synthesis (HLS) tools generate register transfer level (RTL) hardware descriptions through a process of resource allocation, scheduling and binding. Intuitively, RTL quality influences the logic quality. Specifically, achievable clock rate, area, latency in cycles will be determined by description. However, not all paths should receive equal effort - multi-cycle represent an opportunity to spend elsewhere achieve better design In this paper, we perform optimisation on chained...

10.1109/fpl.2013.6645541 article EN 2013-09-01

SkyNet: A Champion Model for DAC-SDC on Low Power Object Detection

OPENALEX - Publications

Xiaofan Zhang Cong Hao Haoming Lu Jiachen Li Yuhong Li and 7 more

Developing artificial intelligence (AI) at the edge is always challenging, since devices have limited computation capability and memory resources but need to meet demanding requirements, such as real-time processing, high throughput performance, inference accuracy. To overcome these challenges, we propose SkyNet, an extremely lightweight DNN with 12 convolutional (Conv) layers only 1.82 megabyte (MB) of parameters following a bottom-up design approach. SkyNet demonstrated in 56th IEEE/ACM...

10.48550/arxiv.1906.10327 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Machine learning on FPGAs to face the IoT revolution

OPENALEX - Publications

Xiaofan Zhang Anand Ramachandran Chuanhao Zhuge Di He Wei Zuo and 3 more

FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...

10.1109/iccad.2017.8203875 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2017-11-01

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization

OPENALEX - Publications

Yun Liang Muhammad T. Satria Kyle Rupnow Deming Chen

Graphic processing units (GPUs) are composed of a group single-instruction multiple data (SIMD) streaming multiprocessors (SMs). GPUs able to efficiently execute highly parallel tasks through SIMD execution on the SMs. However, if those threads take diverging control paths, all divergent paths executed serially. In worst case, every thread takes different path and architecture is used serially by each thread. This flow divergence problem well known in GPU development; code transformation,...

10.1109/tcad.2015.2501303 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2015-11-17