Jie Wang

ORCID: 0009-0005-4657-7977
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Embedded Systems Design Techniques
  • Interconnection Networks and Systems
  • Parallel Computing and Optimization Techniques
  • Advanced Neural Network Applications
  • VLSI and FPGA Design Techniques
  • CCD and CMOS Imaging Sensors
  • VLSI and Analog Circuit Testing
  • Real-Time Systems Scheduling
  • Low-power high-performance VLSI design
  • Advanced Image and Video Retrieval Techniques
  • Robotics and Sensor-Based Localization
  • Plasma Diagnostics and Applications
  • Cellular Automata and Applications
  • Advanced Signal Processing Techniques
  • Petri Nets in System Modeling
  • Robotics and Automated Systems
  • Innovation Diffusion and Forecasting
  • Economic and Technological Innovation
  • Network Security and Intrusion Detection
  • Network Packet Processing and Optimization
  • Sparse and Compressive Sensing Techniques
  • Machine Learning and Data Classification
  • Domain Adaptation and Few-Shot Learning
  • Fusion materials and technologies
  • Network Time Synchronization Technologies

China Telecom (China)
2025

China Telecom
2025

University of South China
2025

University of California, Los Angeles
2018-2023

Amazon (United States)
2023

Beijing Institute of Technology
2022

Laboratoire d'Analyse et d'Architecture des Systèmes
2021

University of California System
2020

East China University of Technology
2020

Hebei University
2019

In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and been among the most powerful widely used techniques computer vision. However, CNN-based are com-putational-intensive resource-consuming, thus hard to be integrated into embedded systems such as smart phones, glasses, robots. FPGA is one promising platforms for accelerating CNN, but limited bandwidth on-chip memory size limit performance accelerator CNN.

10.1145/2847263.2847265 article EN 2016-02-04

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these computing platforms are becoming widely available, they very difficult program especially As a result, use has been limited small subset programmers specialized knowledge. To tackle this challenge, we introduce HeteroCL, programming infrastructure...

10.1145/3289602.3293910 article EN 2019-02-20

Automatic systolic array generation has long been an interesting topic due to the need reduce lengthy development cycles of manual designs. Existing automatic approach builds dependency graphs from algorithms, and iteratively maps computation nodes in graph into processing elements (PEs) with time stamps that specify sequences operate within PE. There are a number previous works implemented idea generated designs for ASICs. However, all these relied on human intervention usually inferior...

10.1145/3240765.3240838 article EN 2018-11-05

While systolic array architectures have the potential to deliver tremendous performance, it is notoriously challenging customize an efficient processor for a target application. Designing arrays requires knowledge both high-level characteristics of application and low-level hardware details, thus making demanding inefficient process. To relieve users from manual iterative trial-and-error process, we present AutoSA, end-to-end compilation framework generating on FPGA. AutoSA based polyhedral...

10.1145/3431920.3439292 article EN 2021-02-17

Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable clock frequency between HLS-generated and handcrafted RTL one. A key factor that limits timing quality HLS outputs is difficulty accurately estimating interconnect delay at level. Unfortunately, this problem becomes even worse when large designs are implemented on latest multi-die FPGAs, where die-crossing interconnects incur high penalty.

10.1145/3431920.3439289 article EN 2021-02-17

With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures layers, but without proper optimizations, their efficiency drops dramatically reasons: (1) the different dimensions within same-type (2) convolution layers especially transposed dilated convolutions, (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into...

10.1145/3570928 article EN ACM Transactions on Reconfigurable Technology and Systems 2022-12-20

FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall time by co-optimizing HLS (C-to-RTL) and back-end physical implementation (RTL-to-bitstream). We propose split approach based on pipelining flexibility at level, which allows us to partition designs for parallel placement routing then stitch separate partitions together. outline number of technical challenges address them breaking boundaries between different...

10.1145/3490422.3502361 article EN 2022-02-11

The irregularity of recent Convolutional Neural Network (CNN) models such as less data reuse and parallelism due to the extensive network pruning simplification creates new challenges for FPGA acceleration. Furthermore, without proper optimization, there could be significant overheads when integrating FPGAs into existing machine learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous studies. However, our study shows that naive integration TensorFlow lead up...

10.1145/3373087.3375321 article EN 2020-02-23

Systolic algorithms are one of the killer applications on spatial architectures such as FPGAs and CGRAs. However, it requires a tremendous amount human effort to design implement high-performance systolic array for given algorithm using traditional RTL-based methodology. On other hand, existing high-level synthesis (HLS) tools either (1) force programmers do "micro-coding" where too many optimizations must be carried out through tedious code restructuring insertion vendor-specific pragmas,...

10.1145/3400302.3415644 article EN 2020-11-02

In this article, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, provides set of convenient APIs allows users easily express flexible and complex inter-task communication structures. Second, adopts coarse-grained floorplanning step during HLS compilation for accurate pipelining potential critical paths. addition, implements several...

10.1145/3609335 article EN cc-by ACM Transactions on Reconfigurable Technology and Systems 2023-09-18

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and popular for field-programmable gate array (FPGA) accelerators in many application domains recent years, thanks to its competitive quality of results (QoR) short development cycles compared with the traditional register-transfer level design approach. Yet, limited by sequential C semantics, it remains challenging adopt same highly productive programming approach other domains, where coarse-grained tasks run parallel communicate...

10.1109/fccm51124.2021.00032 article EN 2021-05-01

Using a sample of 58 million $J/\ensuremath{\psi}$ events collected with the BESII detector at BEPC, more than 100 000 $J/\ensuremath{\psi}\ensuremath{\rightarrow}p\overline{p}{\ensuremath{\pi}}^{0}$ are selected, and detailed partial wave analysis is performed. The branching fraction determined to be...

10.1103/physrevd.80.052004 article EN Physical review. D. Particles, fields, gravitation, and cosmology/Physical review. D, Particles, fields, gravitation, and cosmology 2009-09-17

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from acceleration. However, we found that it is not easy fully utilize available bandwidth when developing some with high-level synthesis (HLS) tools. due limitation existing HLS tools accessing HBM board's large number independent channels. In this paper, measure performance three representative...

10.48550/arxiv.2010.06075 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. In this work, we study the timing issues in diverse set of realistic and complex FPGA HLS (1) We observe that almost all cases degradation is caused broadcast structures compiler. (2) classify three major types broadcasts HLS-generated designs, including high-fanout data signals, pipeline flow control signals synchronization for concurrent modules. (3) reveal number...

10.1109/dac18072.2020.9218718 article EN 2020-07-01

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. We study the timing issues in diverse set of nine realistic HLS designs and observe that most cases degradation is related signal broadcast structures. In this work, we classify common types designs, including data two control broadcast: pipeline synchronization broadcast. further identify several limitations current tools, which lead improper handling broadcasts. First,...

10.1145/3373087.3375332 article EN 2020-02-23

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and popular for field-programmable gate array (FPGA) accelerators in many application domains recent years, thanks to its competitive quality of result (QoR) short development cycle compared with the traditional register-transfer level (RTL) design approach. Yet, limited by sequential C semantics, it remains challenging adopt same highly productive programming approach other domains, where coarse-grained tasks run parallel...

10.1145/3431920.3439470 preprint EN 2021-02-17

Many researchers studying the performance tuning of systolic arrays have based their works on oversimplified assumptions like considering only divisors for loop tiling or pruning off-chip data communication to reduce design space. In this paper, we present a comprehensive space exploration tool named Odyssey array optimization. results show that limiting factors problem size can cause up 39% loss, and movement miss optimal designs. We tested using various matrix multiplication convolution...

10.1109/dac56929.2023.10248016 article EN 2023-07-09

Abstract This paper introduces a whole project of augmented reality interaction mode based on netty communication method, determines the main technical difficulties in implementation process, and finds solution by means system architecture design protocol development. In this paper, framework is adopted to deal with network IO business logic, between each terminal device, virtual environment server established through long connection, adopts establishes device so as realize fast natural...

10.1088/1742-6596/1575/1/012015 article EN Journal of Physics Conference Series 2020-06-01

As deep learning is pervasive in modern applications, many frameworks are presented for practitioners to develop and train DNN models rapidly. Meanwhile, as training large becomes a trend recent years, the throughput memory footprint getting crucial. Accordingly, optimizing workloads with compiler optimizations inevitable more attentions. However, existing compilers (DLCs) mainly target inference do not incorporate holistic optimizations, such automatic differentiation mixed precision,...

10.48550/arxiv.2303.04759 preprint EN cc-by arXiv (Cornell University) 2023-01-01
Coming Soon ...