- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Parallel Computing and Optimization Techniques
- Advanced Neural Network Applications
- VLSI and FPGA Design Techniques
- CCD and CMOS Imaging Sensors
- VLSI and Analog Circuit Testing
- Real-Time Systems Scheduling
- Low-power high-performance VLSI design
- Advanced Image and Video Retrieval Techniques
- Robotics and Sensor-Based Localization
- Plasma Diagnostics and Applications
- Cellular Automata and Applications
- Advanced Signal Processing Techniques
- Petri Nets in System Modeling
- Robotics and Automated Systems
- Innovation Diffusion and Forecasting
- Economic and Technological Innovation
- Network Security and Intrusion Detection
- Network Packet Processing and Optimization
- Sparse and Compressive Sensing Techniques
- Machine Learning and Data Classification
- Domain Adaptation and Few-Shot Learning
- Fusion materials and technologies
- Network Time Synchronization Technologies
China Telecom (China)
2025
China Telecom
2025
University of South China
2025
University of California, Los Angeles
2018-2023
Amazon (United States)
2023
Beijing Institute of Technology
2022
Laboratoire d'Analyse et d'Architecture des Systèmes
2021
University of California System
2020
East China University of Technology
2020
Hebei University
2019
In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and been among the most powerful widely used techniques computer vision. However, CNN-based are com-putational-intensive resource-consuming, thus hard to be integrated into embedded systems such as smart phones, glasses, robots. FPGA is one promising platforms for accelerating CNN, but limited bandwidth on-chip memory size limit performance accelerator CNN.
With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these computing platforms are becoming widely available, they very difficult program especially As a result, use has been limited small subset programmers specialized knowledge. To tackle this challenge, we introduce HeteroCL, programming infrastructure...
Automatic systolic array generation has long been an interesting topic due to the need reduce lengthy development cycles of manual designs. Existing automatic approach builds dependency graphs from algorithms, and iteratively maps computation nodes in graph into processing elements (PEs) with time stamps that specify sequences operate within PE. There are a number previous works implemented idea generated designs for ASICs. However, all these relied on human intervention usually inferior...
While systolic array architectures have the potential to deliver tremendous performance, it is notoriously challenging customize an efficient processor for a target application. Designing arrays requires knowledge both high-level characteristics of application and low-level hardware details, thus making demanding inefficient process. To relieve users from manual iterative trial-and-error process, we present AutoSA, end-to-end compilation framework generating on FPGA. AutoSA based polyhedral...
Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable clock frequency between HLS-generated and handcrafted RTL one. A key factor that limits timing quality HLS outputs is difficulty accurately estimating interconnect delay at level. Unfortunately, this problem becomes even worse when large designs are implemented on latest multi-die FPGAs, where die-crossing interconnects incur high penalty.
With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures layers, but without proper optimizations, their efficiency drops dramatically reasons: (1) the different dimensions within same-type (2) convolution layers especially transposed dilated convolutions, (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into...
FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall time by co-optimizing HLS (C-to-RTL) and back-end physical implementation (RTL-to-bitstream). We propose split approach based on pipelining flexibility at level, which allows us to partition designs for parallel placement routing then stitch separate partitions together. outline number of technical challenges address them breaking boundaries between different...
The irregularity of recent Convolutional Neural Network (CNN) models such as less data reuse and parallelism due to the extensive network pruning simplification creates new challenges for FPGA acceleration. Furthermore, without proper optimization, there could be significant overheads when integrating FPGAs into existing machine learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous studies. However, our study shows that naive integration TensorFlow lead up...
Systolic algorithms are one of the killer applications on spatial architectures such as FPGAs and CGRAs. However, it requires a tremendous amount human effort to design implement high-performance systolic array for given algorithm using traditional RTL-based methodology. On other hand, existing high-level synthesis (HLS) tools either (1) force programmers do "micro-coding" where too many optimizations must be carried out through tedious code restructuring insertion vendor-specific pragmas,...
In this article, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, provides set of convenient APIs allows users easily express flexible and complex inter-task communication structures. Second, adopts coarse-grained floorplanning step during HLS compilation for accurate pipelining potential critical paths. addition, implements several...
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and popular for field-programmable gate array (FPGA) accelerators in many application domains recent years, thanks to its competitive quality of results (QoR) short development cycles compared with the traditional register-transfer level design approach. Yet, limited by sequential C semantics, it remains challenging adopt same highly productive programming approach other domains, where coarse-grained tasks run parallel communicate...
Using a sample of 58 million $J/\ensuremath{\psi}$ events collected with the BESII detector at BEPC, more than 100 000 $J/\ensuremath{\psi}\ensuremath{\rightarrow}p\overline{p}{\ensuremath{\pi}}^{0}$ are selected, and detailed partial wave analysis is performed. The branching fraction determined to be...
With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from acceleration. However, we found that it is not easy fully utilize available bandwidth when developing some with high-level synthesis (HLS) tools. due limitation existing HLS tools accessing HBM board's large number independent channels. In this paper, measure performance three representative...
Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. In this work, we study the timing issues in diverse set of realistic and complex FPGA HLS (1) We observe that almost all cases degradation is caused broadcast structures compiler. (2) classify three major types broadcasts HLS-generated designs, including high-fanout data signals, pipeline flow control signals synchronization for concurrent modules. (3) reveal number...
Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. We study the timing issues in diverse set of nine realistic HLS designs and observe that most cases degradation is related signal broadcast structures. In this work, we classify common types designs, including data two control broadcast: pipeline synchronization broadcast. further identify several limitations current tools, which lead improper handling broadcasts. First,...
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and popular for field-programmable gate array (FPGA) accelerators in many application domains recent years, thanks to its competitive quality of result (QoR) short development cycle compared with the traditional register-transfer level (RTL) design approach. Yet, limited by sequential C semantics, it remains challenging adopt same highly productive programming approach other domains, where coarse-grained tasks run parallel...
Many researchers studying the performance tuning of systolic arrays have based their works on oversimplified assumptions like considering only divisors for loop tiling or pruning off-chip data communication to reduce design space. In this paper, we present a comprehensive space exploration tool named Odyssey array optimization. results show that limiting factors problem size can cause up 39% loss, and movement miss optimal designs. We tested using various matrix multiplication convolution...
Abstract This paper introduces a whole project of augmented reality interaction mode based on netty communication method, determines the main technical difficulties in implementation process, and finds solution by means system architecture design protocol development. In this paper, framework is adopted to deal with network IO business logic, between each terminal device, virtual environment server established through long connection, adopts establishes device so as realize fast natural...
As deep learning is pervasive in modern applications, many frameworks are presented for practitioners to develop and train DNN models rapidly. Meanwhile, as training large becomes a trend recent years, the throughput memory footprint getting crucial. Accordingly, optimizing workloads with compiler optimizations inevitable more attentions. However, existing compilers (DLCs) mainly target inference do not incorporate holistic optimizations, such automatic differentiation mixed precision,...