NFDI4DS | UHH-SEMS - Publication Details

yaSpMV

OPENALEX - Publications

Shengen Yan Chao Li Yunquan Zhang Huiyang Zhou

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As result, numerous attempts have made to optimize on GPUs leverage their massive computational throughput. Although the previous work shown impressive progress, load imbalance high memory bandwidth remain critical performance bottlenecks for SpMV. In this paper, we present our novel solutions these problems. First, devise new format, called blocked compressed common coordinate (BCCOO),...

10.1145/2555243.2555255 article EN 2014-02-06

Locality-Driven Dynamic GPU Cache Bypassing

OPENALEX - Publications

Chao Li Shuaiwen Leon Song Hongwen Dai Albert Sidelnik Siva Kumar Sastry Hari and 1 more

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources providing high-bandwidth and low-latency accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes limited capacity D-caches a performance energy bottleneck, especially memory-intensive applications. We observe that memory access streams to many applications contain...

10.1145/2751205.2751237 article EN 2015-06-02

yaSpMV

OPENALEX - Publications

Shengen Yan Chao Li Yunquan Zhang Huiyang Zhou

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As result, numerous attempts have made to optimize on GPUs leverage their massive computational throughput. Although the previous work shown impressive progress, load imbalance high memory bandwidth remain critical performance bottlenecks for SpMV. In this paper, we present our novel solutions these problems. First, devise new format, called blocked compressed common coordinate (BCCOO),...

10.1145/2692916.2555255 article EN ACM SIGPLAN Notices 2014-02-06

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

OPENALEX - Publications

Hongwen Dai Zhen Lin Chao Li Chen Zhao Fei Wang and 2 more

Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize vast resources. One solution improve resource utilization is concurrent execution (CKE). Early CKE mainly targets leftover However, fails optimize does not provide fairness among kernels. Spatial multitasking assigns subset streaming multiprocessors (SMs) each kernel. Although achieving better...

10.1109/hpca.2018.00027 article EN 2018-02-01

Deploying an Integrated Fiber Optic Sensing System for Seismo-Acoustic Monitoring: A Two-Year Continuous Field Trial in Xinfengjiang

OPENALEX - Publications

Siyuan Cang Min Xu J W Chen Chao Li Kan Gao and 8 more

Distributed Acoustic Sensing (DAS) offers numerous advantages, including resistance to electromagnetic interference, long-range dynamic monitoring, dense spatial sensing, and low deployment costs. We initially deployed a water–land DAS system at the Xinfengjiang (XFJ) Reservoir in Guangdong Province, China, monitor earthquake events. Environmental noise analysis identified three distinct zones based on conditions: periodic 18 Hz signals near surface-laid segments, attenuated low-frequency...

10.3390/jmse13020368 article EN cc-by Journal of Marine Science and Engineering 2025-02-17

A High-Responsivity Subsurface Buoy System With a Fiber-Optic Acoustic Vector Sensor for Continuous Low-Frequency Acoustic Monitoring in the Deep Ocean: Development and Sea Trials

OPENALEX - Publications

Chao Li Xiaoming Cui Haocai Huang Yong Zhou Min Xu and 7 more

10.1109/joe.2025.3545248 article EN IEEE Journal of Oceanic Engineering 2025-01-01

Between racialized and racializing: Chinese students’ dual experiences of racism in Tanzania

OPENALEX - Publications

Y. X. Jiang Kun Dai Chao Li

10.1007/s10734-025-01462-8 article EN cc-by Higher Education 2025-05-14

On the use of the scaled spherical wave expansions for recovering the target sound field in non-anechoic spaces

OPENALEX - Publications

Chao Li D. Hu Yongchang Li Yuan Liu

The scaled spherical wave expansion (SSWE) method was effectively applied to reconstruct the target sound field from measurements conducted in non-anechoic environments. Unlike traditional (SWE), which requires careful selection of optimal cutoff order balance accuracy and computational efficiency, SSWE approach eliminates this challenge by introducing a scaling factor that adjusts coefficients. In addition, simplified formulation developed specifically for sources with rigid surfaces,...

10.1063/5.0272873 article EN cc-by AIP Advances 2025-05-01

Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

OPENALEX - Publications

Chao Li Yi Yang Hongwen Dai Shengen Yan Frank Mueller and 1 more

On-chip caches are commonly used in computer systems to hide long off-chip memory access latencies. To manage on-chip caches, either software-managed or hardware-managed schemes can be employed. State-of-art accelerators, such as the NVIDIA Fermi Kepler GPUs and Intel's forthcoming MIC "Knights Landing" (KNL), support both aka. shared (GPUs) near L1 data (D-caches). Furthermore, D-cache on a GPU utilize same physical storage their capacity configured at runtime (same for KNL). In this paper,...

10.1109/ispass.2014.6844487 article EN 2014-03-01

Automatic data placement into GPU on-chip memory resources

OPENALEX - Publications

Chao Li Yi Yang Zhen Lin Huiyang Zhou

Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip resources, including register files, shared memory, and data caches, is critical application performance. However, explicitly managing GPU resources a non-trivial task for developers. More importantly, as vary among different generations, performance portability has become daunting challenge. In this paper, we tackle problem with...

10.5555/2738600.2738604 article EN Symposium on Code Generation and Optimization 2015-02-07

A model-driven approach to warp/thread-block level GPU cache bypassing

OPENALEX - Publications

Hongwen Dai Chao Li Huiyang Zhou Saurabh Gupta Christos Kartsaklis and 1 more

The high amount of memory requests from massive threads may easily cause cache contention and cache-miss-related resource congestion on GPUs. This paper proposes a simple yet effective performance model to estimate the impact as function number warps/thread blocks (TBs) bypass cache. Then we design hardware-based dynamic warp/thread-block level GPU bypassing scheme, which achieves 1.68x speedup average set memory-intensive benchmarks over baseline. Compared prior works, our scheme 21.6%...

10.1145/2897937.2897966 article EN 2016-05-25

Automatic data placement into GPU on-chip memory resources

OPENALEX - Publications

Chao Li Yi Yang Zhen Lin Huiyang Zhou

Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip resources, including register files, shared memory, and data caches, is critical application performance. However, explicitly managing GPU resources a non-trivial task for developers. More importantly, as vary among different generations, performance portability has become daunting challenge. In this paper, we tackle problem with...

10.1109/cgo.2015.7054184 article EN 2015-02-01

An Empirical Study on Concurrency Bugs in Interrupt-Driven Embedded Software

OPENALEX - Publications

Chao Li Rui Chen Boxiang Wang Zhixuan Wang Tingting Yu and 3 more

Interrupt-driven embedded software is widely used in aerospace, automotive electronics, medical equipment, IoT, and other industrial fields. This type of usually programmed with interrupts to interact hardware respond external stimuli on time. However, uncertain interleaving execution may cause concurrency bugs, resulting task failure or serious safety issues. A deep understanding real-world bugs will significantly improve the ability techniques combating such as bug detection, testing fixing.

10.1145/3597926.3598140 article EN 2023-07-12

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

OPENALEX - Publications

Yi Yang Chao Li Huiyang Zhou

10.1007/s11390-015-1500-y article EN Journal of Computer Science and Technology 2015-01-01

3D-enabled customizable embedded computer (3DECC)

OPENALEX - Publications

Paul D. Franzon Eric Rotenberg James Tuck Huiyang Zhou W. Rhett Davis and 7 more

This paper describes a 3D computer architecture designed to achieve the lowest possible power consumption for "embedded applications" like radar and signal processing. It introduces several unique concepts including low-power SIMD tile, memories, 2.5D interconnect that is circuit switched so it can be tuned at run-time specific application. When conservatively projected 7 nm node, simulations of show potential exceeding 75 GFLOPS/W, about 20x better than today's CPUs GPUs. translates 13...

10.1109/3dic.2014.7152143 article EN 2014-12-01

RACB: Resource Aware Cache Bypass on GPUs

OPENALEX - Publications

Hongwen Dai Christos Kartsaklis Chao Li Tomislav Janjusic Huiyang Zhou

Caches are universally used in computing systems to hide long off-chip memory access latencies. Unlike CPUs, massive threads running simultaneously on GPUs bring a tremendous pressure hierarchy. As result, the limitation of cache resources becomes bottleneck for GPU exploit thread-level parallelism (TLP) and memory-level (MLP) achieve high performance. In this paper, we propose mechanism bypass L1D L2 based availability resources. Our proposed is observation that huge number stalls coming...

10.1109/sbac-padw.2014.14 article EN 2014-10-01

SpecChecker-ISA: a data sharing analyzer for interrupt-driven embedded software

OPENALEX - Publications

Boxiang Wang Rui Chen Chao Li Tingting Yu Dongdong Gao and 1 more

Concurrency bugs are common in interrupt-driven programs, which widely used safety-critical areas. These often caused by incorrect data sharing among tasks and interrupts. Therefore, analysis is crucial to reason about the concurrency behaviours of programs. Due variety access forms, existing tools suffer from both extensive false positives negatives while applying This paper presents SpecChecker-ISA, a tool that provides sound precise for embedded software. The uses memory model...

10.1145/3533767.3543295 article EN 2022-07-15

POSTER: Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

OPENALEX - Publications

Hongwen Dai Zhen Lin Chao Li Chen Zhao Fei Wang and 2 more

In this study, we demonstrate that the performance may be undermined in state-of-the-art intra-SM sharing schemes for concurrent kernel execution (CKE) on GPUs, due to interference among kernels. We highlight cache partitioning techniques proposed CPUs are not effective GPUs. Then propose balance memory accesses and limit number of inflight instructions issued from kernels reduce pipeline stalls. Our significantly improve two schemes, Warped-Slicer SMK.

10.1109/pact.2017.30 article EN 2017-09-01

Research on Data Acquisition System of EMU Structure Health Monitoring Based on DSP and FPGA

OPENALEX - Publications

Chao Li Bin Zhang

Abstract According to the characteristic that structural health monitoring have many sensors and signal types, with DSP FPGA used as core controller, a data collection system was designed. The designed can realize multi-channel parallel collection, flexible port configuration good scalability, which meets requirements of high-speed processing real-time online monitoring. acquisition has advantages high performance, low cost convenient application. In addition, wide sampling frequency range,...

10.1088/1742-6596/1544/1/012172 article EN Journal of Physics Conference Series 2020-05-01

Multi-Sensor Space Target Orbit Forecast Data Fusion Algorithm

OPENALEX - Publications

Chao Li Liu Yun-jiang Xiaopeng Yang Hengyang Zhang Zengping Chen

10.1166/sl.2011.1673 article EN Sensor Letters 2011-08-01