Kyle Rupnow

ORCID: 0000-0003-2908-2225
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Embedded Systems Design Techniques
  • Parallel Computing and Optimization Techniques
  • Interconnection Networks and Systems
  • CCD and CMOS Imaging Sensors
  • Advanced Neural Network Applications
  • VLSI and Analog Circuit Testing
  • Distributed and Parallel Computing Systems
  • Advanced Image and Video Retrieval Techniques
  • Advanced Data Storage Technologies
  • Advanced Memory and Neural Computing
  • Advanced Vision and Imaging
  • Speech and Audio Processing
  • Hearing Loss and Rehabilitation
  • Real-Time Systems Scheduling
  • Advanced Adaptive Filtering Techniques
  • Algorithms and Data Compression
  • Video Surveillance and Tracking Methods
  • Real-time simulation and control systems
  • VLSI and FPGA Design Techniques
  • Machine Learning in Bioinformatics
  • Radiation Effects in Electronics
  • Evolutionary Algorithms and Applications
  • Formal Methods in Verification
  • Cloud Computing and Resource Management
  • Software Testing and Debugging Techniques

Advanced Digital Sciences Center
2011-2017

University of Illinois Urbana-Champaign
2012-2015

Digital Science (United States)
2015

Nanyang Technological University
2013-2014

Agency for Science, Technology and Research
2011-2012

University of Wisconsin–Madison
2006-2011

Sandia National Laboratories
2007-2011

Sandia National Laboratories California
2006

While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources edge-scale FPGA devices also makes it challenging deployment. In this paper, we propose a simultaneous FPGA/DNN co-design methodology with both bottom-up top-down approaches: hardware-oriented model search accuracy, accelerator design considering DNN-specific characteristics. We build an automatic flow, including Auto-DNN engine...

10.1145/3316781.3317829 article EN 2019-05-23

FPGA is a promising candidate for the acceleration of Deep Neural Networks (DNN) with improved latency and energy consumption compared to CPU GPU-based implementations. DNNs use sequences layers regular computation that are well suited HLS-based design FPGA. However, optimizing large neural networks under resource constraints still key challenge. HLS must manage on-chip computation, buffering resources, off-chip memory accesses minimize total latency. In this paper, we present framework uses...

10.23919/fpl.2017.8056833 article EN 2017-09-01

Heterogeneous computing nodes are now pervasive throughout computing, and GPUs have emerged as a leading device for application acceleration. tremendous potential data-parallel applications, the emergence of has led to proliferation GPU-accelerated applications. This also systems in which many applications competing access GPU resources, efficient utilization resources is critical system performance. Prior techniques temporal multitasking can be employed with well, but not all kernels make...

10.1109/tpds.2014.2313342 article EN IEEE Transactions on Parallel and Distributed Systems 2014-03-25

High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises performance and energy efficiency designs with a lower barrier to entry in design expertise, shorter time. State-of-the-art high now includes wide variety powerful optimizations that implement efficient hardware. These can some most features generally performed manual including parallel units, pipelining execution both within unit between fine-grained data...

10.1145/2435264.2435271 article EN 2013-02-11

FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort FPGA implementations remains high—often order of magnitude larger than using high‐level languages. Instead this time‐consuming process, synthesis (HLS) tools generate hardware from algorithm descriptions in languages such as C/C++ SystemC. Such reduce effort: more compact less error prone. HLS promise development abstracted software designer knowledge...

10.1155/2012/649057 article EN cc-by Journal of Electrical and Computer Engineering 2012-01-01

While scientific applications in the past were limited by floating point computations, modern use more unstructured formulations. These have a significant percentage of integer computation - increasingly limiting factor application performance. In real employed at Sandia National Labs, computations constitute on average 37% operations, forming large and complex dataflow graphs. Reconfigurable functional units (RFUs) are particularly attractive accelerator for these graphs because they can...

10.1109/fccm.2007.14 article EN 2007-04-01

High level synthesis (HLS) is gaining wider acceptance for hardware design due to its higher productivity and better space exploration features. In recent years, HLS techniques flows have also advanced significantly, as a result, many new FPGA designs are developed with HLS. However, despite studies using HLS, the size complexity of such applications remain generally small, it not well understood how optimize large, complex reference code. Typical benchmark contain somewhere between 100 1400...

10.1145/2847263.2847274 article EN 2016-02-04

Object detection and tracking are challenging tasks for resource-constrained embedded systems. While these among the most compute-intensive from artificial intelligence domain, they only allowed to use limited computation memory resources on devices. In meanwhile, such implementations often required satisfy additional demanding requirements as real-time response, high-throughput performance, reliable inference accuracy. To overcome challenges, we propose SkyNet, a hardware-efficient neural...

10.48550/arxiv.1909.09709 preprint EN other-oa arXiv (Cornell University) 2019-01-01

FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort FPGA implementations remains - often order of magnitude larger than using level languages. Instead this time-consuming process, synthesis (HLS) tools generate hardware from languages (HLL) such as C/C++/SystemC. Such reduce effort: descriptions more compact less error prone. HLS promise development abstracted software designer knowledge the...

10.1109/fpt.2011.6132716 article EN 2011-12-01

This paper presents a real-time three-dimensional (3D) wideband sound localization system designed with miniature XYZO microphone array. Unlike the conventional arrays for using only omnidirectional microphones, presented array is both bidirectional (pressure gradient) and microphones. Therefore, has significantly reduced size known as world's smallest design 3D source in air. In this paper, we describe configuration perform calibration. For localization, provide studies on output model of...

10.1109/iciea.2012.6361029 article EN 2012-07-01

FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...

10.5555/3199700.3199810 article EN International Conference on Computer Aided Design 2017-11-13

The rise of the Internet Things has led to an explosion sensor computing platforms. complexity and applications IoT devices range from simple in vending machines complex, interactive artificial intelligence smart vehicles drones. Developers target more aggressive objectives protect market share through feature differentiation; they just choose between low-cost, low-performance CPU-based systems, high-performance custom platforms with hardware accelerators including GPUs FPGAs. Both designs...

10.1049/iet-cps.2016.0020 article EN cc-by-nd IET Cyber-Physical Systems Theory & Applications 2016-12-01

With the advent of several accurate and sophisticated statistical algorithms pipelines for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing data into biologically meaningful information further clinical analysis processing. However, given large volume involved, even modestly complex would require a prohibitively long time complete. Hence need hour explore non-conventional implementation platforms accelerate genomics research. In this work, we present an...

10.1145/3020078.3021749 article EN 2017-02-02

A wide variety of application domains such as networking, computer vision, and cryptography target FPGA platforms to meet computation demand energy consumption constraints. However, design effort for implementations in hardware description languages (HDLs) remains high - often an order magnitude larger than using level (HLLs). Instead development HDLs, synthesis (HLS) tools generate from algorithm descriptions HLLs C/C++/SystemC. HLS promise reduced without the detailed knowledge...

10.1109/asicon.2011.6157401 article EN 2011-10-01

Graphics processing units (GPUs) are increasingly critical for general-purpose parallel performance. GPU hardware is composed of many streaming multiprocessors, each which employs the single-instruction multiple-data (SIMD) execution style. This massively architecture allows GPUs to execute tens thousands threads in parallel. Thus, architectures efficiently heavily data-parallel applications. However, due this SIMD style, resource utilization and thus overall performance can be significantly...

10.1109/ipdps.2012.18 article EN 2012-05-01

Achievable frequency (fmax) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays (FPGA), because of its impact on design latency and throughput. Fmax limited by critical path delay, which highly influenced lower-level details the circuit implementation such as technology mapping, placement routing. However, high-level synthesis~(HLS) flows, it challenging to evaluate real delay at behavioral level. Current HLS flows typically use module pre-characterization...

10.1145/2554688.2554775 article EN 2014-02-18

Verification of modern day electronic circuits has become the bottleneck for timely delivery complex SoC designs. We develop a novel cross-layer hardware/software co-simulation framework that can effectively debug and verify an design. combine high-level C/C++ software with cycle-accurate SystemC hardware, uniquely identify various types bugs, help hardware designer localize them. Experimental results show we are able to detect aid in localization logic bugs from both specifications as well...

10.1145/2897937.2898002 article EN 2016-05-25

FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...

10.1109/iccad.2017.8203862 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2017-11-01

Save and restore of context data is traditionally used in process preemption multi-tasking operating systems. Multi-tasking, by consequence, preemption, key to effective CPU sharing. However, it much more expensive save reconfigurable hardware than traditional software. The configuration current state comprises a large amount data, making the transfer long operation. In this paper, we explore alternatives operation for multi-tasking. We compare system performance three alternate policies...

10.1109/fccm.2009.30 article EN 2009-01-01

High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation independent computation cores. HLS tools can effectively translate many threads present in parallel descriptions into independent, optimized The generated hardware cores often heavily share data produce outputs independently. As number instantiated grows, off-chip memory bandwidth may be insufficient to meet demand. Hence, a...

10.1109/tvlsi.2015.2497259 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2015-12-09

High-level synthesis (HLS) tools generate register transfer level (RTL) hardware descriptions through a process of resource allocation, scheduling and binding. Intuitively, RTL quality influences the logic quality. Specifically, achievable clock rate, area, latency in cycles will be determined by description. However, not all paths should receive equal effort - multi-cycle represent an opportunity to spend elsewhere achieve better design In this paper, we perform optimisation on chained...

10.1109/fpl.2013.6645541 article EN 2013-09-01

Developing artificial intelligence (AI) at the edge is always challenging, since devices have limited computation capability and memory resources but need to meet demanding requirements, such as real-time processing, high throughput performance, inference accuracy. To overcome these challenges, we propose SkyNet, an extremely lightweight DNN with 12 convolutional (Conv) layers only 1.82 megabyte (MB) of parameters following a bottom-up design approach. SkyNet demonstrated in 56th IEEE/ACM...

10.48550/arxiv.1906.10327 preprint EN other-oa arXiv (Cornell University) 2019-01-01

FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...

10.1109/iccad.2017.8203875 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2017-11-01

Graphic processing units (GPUs) are composed of a group single-instruction multiple data (SIMD) streaming multiprocessors (SMs). GPUs able to efficiently execute highly parallel tasks through SIMD execution on the SMs. However, if those threads take diverging control paths, all divergent paths executed serially. In worst case, every thread takes different path and architecture is used serially by each thread. This flow divergence problem well known in GPU development; code transformation,...

10.1109/tcad.2015.2501303 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2015-11-17
Coming Soon ...