- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Interconnection Networks and Systems
- Cloud Computing and Resource Management
- Embedded Systems Design Techniques
- Advanced Measurement and Detection Methods
- Radiation Effects in Electronics
- Advanced Sensor and Control Systems
- Matrix Theory and Algorithms
- Underwater Acoustics Research
- Low-power high-performance VLSI design
- Water Systems and Optimization
- Seismic Waves and Analysis
- Advanced Neural Network Applications
- Optimization and Packing Problems
- Calcium Carbonate Crystallization and Inhibition
- High-Voltage Power Transmission Systems
- Inertial Sensor and Navigation
- Spacecraft and Cryogenic Technologies
- Real-Time Systems Scheduling
- Advanced Numerical Analysis Techniques
- Water Quality Monitoring Technologies
- Optical Systems and Laser Technology
- Smart Grid and Power Systems
- Computer Graphics and Visualization Techniques
Shanghai University of Engineering Science
2025
Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou)
2025
NARI Group (China)
2020
Luoyang Institute of Science and Technology
2019
North Carolina State University
2014-2018
SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As result, numerous attempts have made to optimize on GPUs leverage their massive computational throughput. Although the previous work shown impressive progress, load imbalance high memory bandwidth remain critical performance bottlenecks for SpMV. In this paper, we present our novel solutions these problems. First, devise new format, called blocked compressed common coordinate (BCCOO),...
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources providing high-bandwidth and low-latency accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes limited capacity D-caches a performance energy bottleneck, especially memory-intensive applications. We observe that memory access streams to many applications contain...
SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As result, numerous attempts have made to optimize on GPUs leverage their massive computational throughput. Although the previous work shown impressive progress, load imbalance high memory bandwidth remain critical performance bottlenecks for SpMV. In this paper, we present our novel solutions these problems. First, devise new format, called blocked compressed common coordinate (BCCOO),...
Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize vast resources. One solution improve resource utilization is concurrent execution (CKE). Early CKE mainly targets leftover However, fails optimize does not provide fairness among kernels. Spatial multitasking assigns subset streaming multiprocessors (SMs) each kernel. Although achieving better...
Distributed Acoustic Sensing (DAS) offers numerous advantages, including resistance to electromagnetic interference, long-range dynamic monitoring, dense spatial sensing, and low deployment costs. We initially deployed a water–land DAS system at the Xinfengjiang (XFJ) Reservoir in Guangdong Province, China, monitor earthquake events. Environmental noise analysis identified three distinct zones based on conditions: periodic 18 Hz signals near surface-laid segments, attenuated low-frequency...
The scaled spherical wave expansion (SSWE) method was effectively applied to reconstruct the target sound field from measurements conducted in non-anechoic environments. Unlike traditional (SWE), which requires careful selection of optimal cutoff order balance accuracy and computational efficiency, SSWE approach eliminates this challenge by introducing a scaling factor that adjusts coefficients. In addition, simplified formulation developed specifically for sources with rigid surfaces,...
On-chip caches are commonly used in computer systems to hide long off-chip memory access latencies. To manage on-chip caches, either software-managed or hardware-managed schemes can be employed. State-of-art accelerators, such as the NVIDIA Fermi Kepler GPUs and Intel's forthcoming MIC "Knights Landing" (KNL), support both aka. shared (GPUs) near L1 data (D-caches). Furthermore, D-cache on a GPU utilize same physical storage their capacity configured at runtime (same for KNL). In this paper,...
Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip resources, including register files, shared memory, and data caches, is critical application performance. However, explicitly managing GPU resources a non-trivial task for developers. More importantly, as vary among different generations, performance portability has become daunting challenge. In this paper, we tackle problem with...
The high amount of memory requests from massive threads may easily cause cache contention and cache-miss-related resource congestion on GPUs. This paper proposes a simple yet effective performance model to estimate the impact as function number warps/thread blocks (TBs) bypass cache. Then we design hardware-based dynamic warp/thread-block level GPU bypassing scheme, which achieves 1.68x speedup average set memory-intensive benchmarks over baseline. Compared prior works, our scheme 21.6%...
Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip resources, including register files, shared memory, and data caches, is critical application performance. However, explicitly managing GPU resources a non-trivial task for developers. More importantly, as vary among different generations, performance portability has become daunting challenge. In this paper, we tackle problem with...
Interrupt-driven embedded software is widely used in aerospace, automotive electronics, medical equipment, IoT, and other industrial fields. This type of usually programmed with interrupts to interact hardware respond external stimuli on time. However, uncertain interleaving execution may cause concurrency bugs, resulting task failure or serious safety issues. A deep understanding real-world bugs will significantly improve the ability techniques combating such as bug detection, testing fixing.
This paper describes a 3D computer architecture designed to achieve the lowest possible power consumption for "embedded applications" like radar and signal processing. It introduces several unique concepts including low-power SIMD tile, memories, 2.5D interconnect that is circuit switched so it can be tuned at run-time specific application. When conservatively projected 7 nm node, simulations of show potential exceeding 75 GFLOPS/W, about 20x better than today's CPUs GPUs. translates 13...
Caches are universally used in computing systems to hide long off-chip memory access latencies. Unlike CPUs, massive threads running simultaneously on GPUs bring a tremendous pressure hierarchy. As result, the limitation of cache resources becomes bottleneck for GPU exploit thread-level parallelism (TLP) and memory-level (MLP) achieve high performance. In this paper, we propose mechanism bypass L1D L2 based availability resources. Our proposed is observation that huge number stalls coming...
Concurrency bugs are common in interrupt-driven programs, which widely used safety-critical areas. These often caused by incorrect data sharing among tasks and interrupts. Therefore, analysis is crucial to reason about the concurrency behaviours of programs. Due variety access forms, existing tools suffer from both extensive false positives negatives while applying This paper presents SpecChecker-ISA, a tool that provides sound precise for embedded software. The uses memory model...
In this study, we demonstrate that the performance may be undermined in state-of-the-art intra-SM sharing schemes for concurrent kernel execution (CKE) on GPUs, due to interference among kernels. We highlight cache partitioning techniques proposed CPUs are not effective GPUs. Then propose balance memory accesses and limit number of inflight instructions issued from kernels reduce pipeline stalls. Our significantly improve two schemes, Warped-Slicer SMK.
Abstract According to the characteristic that structural health monitoring have many sensors and signal types, with DSP FPGA used as core controller, a data collection system was designed. The designed can realize multi-channel parallel collection, flexible port configuration good scalability, which meets requirements of high-speed processing real-time online monitoring. acquisition has advantages high performance, low cost convenient application. In addition, wide sampling frequency range,...