- Embedded Systems Design Techniques
- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- CCD and CMOS Imaging Sensors
- Advanced Neural Network Applications
- VLSI and Analog Circuit Testing
- Distributed and Parallel Computing Systems
- Advanced Image and Video Retrieval Techniques
- Advanced Data Storage Technologies
- Advanced Memory and Neural Computing
- Advanced Vision and Imaging
- Speech and Audio Processing
- Hearing Loss and Rehabilitation
- Real-Time Systems Scheduling
- Advanced Adaptive Filtering Techniques
- Algorithms and Data Compression
- Video Surveillance and Tracking Methods
- Real-time simulation and control systems
- VLSI and FPGA Design Techniques
- Machine Learning in Bioinformatics
- Radiation Effects in Electronics
- Evolutionary Algorithms and Applications
- Formal Methods in Verification
- Cloud Computing and Resource Management
- Software Testing and Debugging Techniques
Advanced Digital Sciences Center
2011-2017
University of Illinois Urbana-Champaign
2012-2015
Digital Science (United States)
2015
Nanyang Technological University
2013-2014
Agency for Science, Technology and Research
2011-2012
University of Wisconsin–Madison
2006-2011
Sandia National Laboratories
2007-2011
Sandia National Laboratories California
2006
While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources edge-scale FPGA devices also makes it challenging deployment. In this paper, we propose a simultaneous FPGA/DNN co-design methodology with both bottom-up top-down approaches: hardware-oriented model search accuracy, accelerator design considering DNN-specific characteristics. We build an automatic flow, including Auto-DNN engine...
FPGA is a promising candidate for the acceleration of Deep Neural Networks (DNN) with improved latency and energy consumption compared to CPU GPU-based implementations. DNNs use sequences layers regular computation that are well suited HLS-based design FPGA. However, optimizing large neural networks under resource constraints still key challenge. HLS must manage on-chip computation, buffering resources, off-chip memory accesses minimize total latency. In this paper, we present framework uses...
Heterogeneous computing nodes are now pervasive throughout computing, and GPUs have emerged as a leading device for application acceleration. tremendous potential data-parallel applications, the emergence of has led to proliferation GPU-accelerated applications. This also systems in which many applications competing access GPU resources, efficient utilization resources is critical system performance. Prior techniques temporal multitasking can be employed with well, but not all kernels make...
High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises performance and energy efficiency designs with a lower barrier to entry in design expertise, shorter time. State-of-the-art high now includes wide variety powerful optimizations that implement efficient hardware. These can some most features generally performed manual including parallel units, pipelining execution both within unit between fine-grained data...
FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort FPGA implementations remains high—often order of magnitude larger than using high‐level languages. Instead this time‐consuming process, synthesis (HLS) tools generate hardware from algorithm descriptions in languages such as C/C++ SystemC. Such reduce effort: more compact less error prone. HLS promise development abstracted software designer knowledge...
While scientific applications in the past were limited by floating point computations, modern use more unstructured formulations. These have a significant percentage of integer computation - increasingly limiting factor application performance. In real employed at Sandia National Labs, computations constitute on average 37% operations, forming large and complex dataflow graphs. Reconfigurable functional units (RFUs) are particularly attractive accelerator for these graphs because they can...
High level synthesis (HLS) is gaining wider acceptance for hardware design due to its higher productivity and better space exploration features. In recent years, HLS techniques flows have also advanced significantly, as a result, many new FPGA designs are developed with HLS. However, despite studies using HLS, the size complexity of such applications remain generally small, it not well understood how optimize large, complex reference code. Typical benchmark contain somewhere between 100 1400...
Object detection and tracking are challenging tasks for resource-constrained embedded systems. While these among the most compute-intensive from artificial intelligence domain, they only allowed to use limited computation memory resources on devices. In meanwhile, such implementations often required satisfy additional demanding requirements as real-time response, high-throughput performance, reliable inference accuracy. To overcome challenges, we propose SkyNet, a hardware-efficient neural...
FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort FPGA implementations remains - often order of magnitude larger than using level languages. Instead this time-consuming process, synthesis (HLS) tools generate hardware from languages (HLL) such as C/C++/SystemC. Such reduce effort: descriptions more compact less error prone. HLS promise development abstracted software designer knowledge the...
This paper presents a real-time three-dimensional (3D) wideband sound localization system designed with miniature XYZO microphone array. Unlike the conventional arrays for using only omnidirectional microphones, presented array is both bidirectional (pressure gradient) and microphones. Therefore, has significantly reduced size known as world's smallest design 3D source in air. In this paper, we describe configuration perform calibration. For localization, provide studies on output model of...
FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...
The rise of the Internet Things has led to an explosion sensor computing platforms. complexity and applications IoT devices range from simple in vending machines complex, interactive artificial intelligence smart vehicles drones. Developers target more aggressive objectives protect market share through feature differentiation; they just choose between low-cost, low-performance CPU-based systems, high-performance custom platforms with hardware accelerators including GPUs FPGAs. Both designs...
With the advent of several accurate and sophisticated statistical algorithms pipelines for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing data into biologically meaningful information further clinical analysis processing. However, given large volume involved, even modestly complex would require a prohibitively long time complete. Hence need hour explore non-conventional implementation platforms accelerate genomics research. In this work, we present an...
A wide variety of application domains such as networking, computer vision, and cryptography target FPGA platforms to meet computation demand energy consumption constraints. However, design effort for implementations in hardware description languages (HDLs) remains high - often an order magnitude larger than using level (HLLs). Instead development HDLs, synthesis (HLS) tools generate from algorithm descriptions HLLs C/C++/SystemC. HLS promise reduced without the detailed knowledge...
Graphics processing units (GPUs) are increasingly critical for general-purpose parallel performance. GPU hardware is composed of many streaming multiprocessors, each which employs the single-instruction multiple-data (SIMD) execution style. This massively architecture allows GPUs to execute tens thousands threads in parallel. Thus, architectures efficiently heavily data-parallel applications. However, due this SIMD style, resource utilization and thus overall performance can be significantly...
Achievable frequency (fmax) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays (FPGA), because of its impact on design latency and throughput. Fmax limited by critical path delay, which highly influenced lower-level details the circuit implementation such as technology mapping, placement routing. However, high-level synthesis~(HLS) flows, it challenging to evaluate real delay at behavioral level. Current HLS flows typically use module pre-characterization...
Verification of modern day electronic circuits has become the bottleneck for timely delivery complex SoC designs. We develop a novel cross-layer hardware/software co-simulation framework that can effectively debug and verify an design. combine high-level C/C++ software with cycle-accurate SystemC hardware, uniquely identify various types bugs, help hardware designer localize them. Experimental results show we are able to detect aid in localization logic bugs from both specifications as well...
FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...
Save and restore of context data is traditionally used in process preemption multi-tasking operating systems. Multi-tasking, by consequence, preemption, key to effective CPU sharing. However, it much more expensive save reconfigurable hardware than traditional software. The configuration current state comprises a large amount data, making the transfer long operation. In this paper, we explore alternatives operation for multi-tasking. We compare system performance three alternate policies...
High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation independent computation cores. HLS tools can effectively translate many threads present in parallel descriptions into independent, optimized The generated hardware cores often heavily share data produce outputs independently. As number instantiated grows, off-chip memory bandwidth may be insufficient to meet demand. Hence, a...
High-level synthesis (HLS) tools generate register transfer level (RTL) hardware descriptions through a process of resource allocation, scheduling and binding. Intuitively, RTL quality influences the logic quality. Specifically, achievable clock rate, area, latency in cycles will be determined by description. However, not all paths should receive equal effort - multi-cycle represent an opportunity to spend elsewhere achieve better design In this paper, we perform optimisation on chained...
Developing artificial intelligence (AI) at the edge is always challenging, since devices have limited computation capability and memory resources but need to meet demanding requirements, such as real-time processing, high throughput performance, inference accuracy. To overcome these challenges, we propose SkyNet, an extremely lightweight DNN with 12 convolutional (Conv) layers only 1.82 megabyte (MB) of parameters following a bottom-up design approach. SkyNet demonstrated in 56th IEEE/ACM...
FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU GPU-based implementations. High-level synthesis (HLS) is an effective design flow DNNs due productivity, debugging, space exploration ability. However, optimizing large neural networks under resource constraints still a key challenge. In this paper, we present series techniques implementing on high performance efficiency. These include the use...
Graphic processing units (GPUs) are composed of a group single-instruction multiple data (SIMD) streaming multiprocessors (SMs). GPUs able to efficiently execute highly parallel tasks through SIMD execution on the SMs. However, if those threads take diverging control paths, all divergent paths executed serially. In worst case, every thread takes different path and architecture is used serially by each thread. This flow divergence problem well known in GPU development; code transformation,...