NFDI4DS | UHH-SEMS - Publication Details

Sheng Ma

ORCID: 0000-0003-1710-4060

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5100760813

Research Areas

Parallel Computing and Optimization Techniques
Interconnection Networks and Systems
Advanced Memory and Neural Computing
Advanced Neural Network Applications
Advanced Data Storage Technologies
Embedded Systems Design Techniques
Low-power high-performance VLSI design
Brain Tumor Detection and Classification
Distributed and Parallel Computing Systems
Data Mining Algorithms and Applications
Ferroelectric and Negative Capacitance Devices
Advancements in Battery Materials
Data Management and Algorithms
VLSI and FPGA Design Techniques
Numerical Methods and Algorithms
Software-Defined Networks and 5G
CCD and CMOS Imaging Sensors
3D IC and TSV technologies
Graphene research and applications
Cloud Computing and Resource Management
Radiation Effects in Electronics
Network Packet Processing and Optimization
Adversarial Robustness in Machine Learning
Time Series Analysis and Forecasting
Landslides and related hazards

National University of Defense Technology
2015-2024

Shenyang Institute of Engineering
2024

China University of Geosciences (Beijing)
2023-2024

University of Macau
2022

Changsha University
2021

Harbin Engineering University
2019

Centre for High Performance Computing
2016

University of Toronto
2012

IBM Research - Thomas J. Watson Research Center
2002-2003

IBM (United States)
2002-2003

Mining partially periodic event patterns with unknown periods

OPENALEX - Publications

Sheng Ma Joseph L. Hellerstein

Periodic behavior is common in real-world applications. However many cases, periodicities are partial that they present only intermittently. The authors study such intermittent patterns, which refer to as p-patterns. formulation of p-patterns takes into account imprecise time information (e.g., due unsynchronized clocks distributed environments), noisy data extraneous events), and shifts phase and/or periods. We structure mining for two sub-tasks: (1) finding the periods (2) temporal...

10.1109/icde.2001.914829 article EN 2002-11-13

DBAR

OPENALEX - Publications

Sheng Ma Natalie Enright Jerger Zhiying Wang

With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness limits performance due to poor network congestion avoidance. Globally attack this issue by introducing propagation obtain status information beyond neighboring nodes. However, they may suffer from intra- inter-application...

10.1145/2000064.2000113 article EN 2011-06-04

Predicting rare events in temporal domains

OPENALEX - Publications

Ricardo Vilalta Sheng Ma

Temporal data mining aims at finding patterns in historical data. Our work proposes an approach to extract temporal from predict the occurrence of target events, such as computer attacks on host networks, or fraudulent transactions financial institutions. problem formulation exhibits two major challenges: 1) we assume events being characterized by categorical features and displaying uneven inter-arrival times; assumption falls outside scope classical time-series analysis, 2) are highly...

10.1109/icdm.2002.1183991 article EN 2003-06-26

Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip

OPENALEX - Publications

Sheng Ma Natalie Enright Jerger Zhiying Wang

Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design fully adaptive routing algorithms. First, based on previous deadlock-avoidance theories require conservative VC re-allocation scheme: can be re-allocated when it is empty, which limits performance. We propose novel scheme, whole packet forwarding (WPF), allows non-empty re-allocated. WPF leverages observation that...

10.1109/hpca.2012.6169049 article EN 2012-02-01

Optimizing value prediction for ILP processors: A design space exploration approach

OPENALEX - Publications

Ling Yang Zhong Zheng Libo Huang Run Yan Sheng Ma and 2 more

10.1016/j.vlsi.2025.102402 article EN Integration 2025-03-01

Leaving One Slot Empty: Flit Bubble Flow Control for Torus Cache-Coherent NoCs

OPENALEX - Publications

Sheng Ma Zhiying Wang Zonglin Liu Natalie Enright Jerger

Short and long packets co-exist in cache-coherent NoCs. Existing designs for torus networks do not efficiently handle variable-size packets. For deadlock free operations, a design uses two VCs, which negatively affects the router frequency. Some optimizations use one VC. Yet, they regard all as maximum-length packets, inefficiently utilizing precious buffers. We propose flit bubble flow control (FBFC), maintains flit-size buffer slot to avoid deadlock. FBFC VC, does treat short ones. It...

10.1109/tc.2013.2295523 article EN IEEE Transactions on Computers 2014-01-31

Supporting efficient collective communication in NoCs

OPENALEX - Publications

Sheng Ma Natalie Enright Jerger Zhiying Wang

Across many architectures and parallel programming paradigms, collective communication plays a key role in performance correctness. Hardware support is necessary to prevent important from becoming system bottleneck. Support for multicast Networks-on-Chip (NoCs) has achieved substantial throughput improvements power savings. In this paper, we explore reduction or many-to-one operations. As case study, focus on acknowledgement messages (ACK) that must be collected directory protocol before...

10.1109/hpca.2012.6168953 article EN 2012-02-01

Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support

OPENALEX - Publications

Libo Huang Sheng Ma Li Shen Zhiying Wang Nong Xiao

Binary64 arithmetic is rapidly becoming inadequate to cope with today's large-scale computations due an accumulation of errors. Therefore, binary128 now required increase the accuracy and reliability these computations. At same time, obvious trend emerging in modern processors extend their instruction sets by allowing single multiple data (SIMD) execution, which can significantly accelerate data-parallel applications. To address combined demands mentioned above, this paper presents...

10.1109/tc.2011.77 article EN IEEE Transactions on Computers 2011-04-06

Novel Flow Control for Fully Adaptive Routing in Cache-Coherent NoCs

OPENALEX - Publications

Sheng Ma Zhiying Wang Natalie Enright Jerger Li Shen Nong Xiao

Routing algorithms for cache-coherent NoCs only have limited VCs at their disposal, which poses challenges to the design of routing algorithms. Existing fully adaptive apply conservative VC re-allocation: empty can be re-allocated, limits performance. We propose two novel flow control designs. First, whole packet forwarding (WPF) re-allocates a nonempty if has enough free buffers an entire packet. WPF does not induce deadlock algorithm is deadlock-free using re-allocation. It important...

10.1109/tpds.2013.166 article EN IEEE Transactions on Parallel and Distributed Systems 2013-06-28

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

OPENALEX - Publications

Rui Xu Sheng Ma Yaohua Wang Xinhai Chen Yang Guo

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. biggest advantage its simple and efficient design principle. Without complicated control dataflow, accelerators with can calculate traditional convolution very efficiently. However, this also brings new challenges to array. When computing special types convolution, such as small-scale or depthwise processing element (PE) utilization rate decreases sharply. main reason...

10.1145/3460776 article EN ACM Transactions on Architecture and Code Optimization 2021-07-17

DBAR

OPENALEX - Publications

Sheng Ma Natalie Enright Jerger Zhiying Wang

10.1145/2024723.2000113 article EN ACM SIGARCH Computer Architecture News 2011-06-04

A high performance reliable NoC router

OPENALEX - Publications

Lu Wang Sheng Ma Chen Li Wei Chen Zhiying Wang

10.1016/j.vlsi.2016.10.016 article EN Integration 2016-10-19

Heterogeneous Systolic Array Architecture for Compact CNNs Hardware Accelerators

OPENALEX - Publications

Rui Xu Sheng Ma Yaohua Wang Yang Guo Dongsheng Li and 1 more

Compact convolutional neural networks have become a hot research topic. However, we find that the systolic array accelerators are extremely inefficient in dealing with compact models, especially when processing depthwise layers networks. To make arrays more efficient for networks, propose heterogeneous (HeSA) architecture. It introduces elements support multiple modes of dataflow, which can further exploit reuse data chance and without changing scale or structure nave array. By increasing...

10.1109/tpds.2021.3129647 article EN cc-by IEEE Transactions on Parallel and Distributed Systems 2021-01-01

RHS-TRNG: A Resilient High-Speed True Random Number Generator Based on STT-MTJ Device

OPENALEX - Publications

Siqing Fu Tiejun Li Chunyuan Zhang Hanqing Li Sheng Ma and 3 more

High-quality random numbers are very critical to many fields such as cryptography, finance, and scientific simulation, which calls for the design of reliable true number generators (TRNGs). Limited by entropy source, throughput, reliability, system integration, existing TRNG designs difficult be deployed in real computing systems greatly accelerate target applications. This study proposes a circuit named resilient high-speed (RHS)-TRNG based on spin-transfer torque magnetic tunnel junction...

10.1109/tvlsi.2023.3298327 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2023-08-21

Landslide Susceptibility Prediction Using Machine Learning Methods: A Case Study of Landslides in the Yinghu Lake Basin in Shaanxi

OPENALEX - Publications

Sheng Ma Jian Chen Saier Wu Yurou Li

Landslide susceptibility prediction (LSP) is the basis for risk management and plays an important role in social sustainability. However, modeling process of LSP constrained by various factors. This paper approaches effect landslide data integrity, machine-learning (ML) models, non-landslide sample-selection methods on accuracy LSP, taking Yinghu Lake Basin Ankang City, Shaanxi Province, as example. First, previous inventory (totaling 46) updated 46 + 176) were established through...

10.3390/su152215836 article EN Sustainability 2023-11-10

EPHA: An Energy-efficient Parallel Hybrid Architecture for ANNs and SNNs

OPENALEX - Publications

Yunping Zhao Sheng Ma Hengzhu Liu Libo Huang

Artificial neural networks (ANNs) and spiking (SNNs) are two general approaches to achieve artificial intelligence (AI). The former have been widely used in academia industry fields; the latter, SNNs, more similar biological can realize ultra-low power consumption, thus received widespread research attention. However, due their fundamental differences computation formula information coding, methods often require different incompatible platforms. Alongside development of AI, a platform that...

10.1145/3643134 article EN cc-by ACM Transactions on Design Automation of Electronic Systems 2024-01-25

Inventory and Spatial Distribution of Landslides on the Eastern Slope of Gongga Mountain, Southwest China

OPENALEX - Publications

Runze Ge Jian Chen Sheng Ma Huarong Tan

The eastern slope of Gongga Mountain is located in the mountainous region Southwestern China, which has strong geologic tectonics that leads to frequent landslide hazards. A large number such landslides were induced by 2022 Luding Ms 6.8 earthquake. Therefore, it necessary identify spatial distribution region. In this paper, Google Earth platform and GF-1 GF-6 satellite imagery used construct new pre-earthquake co-seismic landslides. Then, we analyzed relationship between conditioning...

10.3390/rs16183360 article EN cc-by Remote Sensing 2024-09-10

A low-cost conflict-free NoC for GPGPUs

OPENALEX - Publications

Xia Zhao Sheng Ma Yu-xi Liu Lieven Eeckhout Zhiying Wang

As integrated circuits are limited by hardware resources, reducing cost while maintaining the performance becomes especially important. In this article, we propose a conflict-free NoC (cfNoC) for GPGPU request network. The cfNoC eliminates (i) conflicts among different columns deploying an exclusive subnet each column, and (ii) inside same column using token-based mechanism. elimination of allows to exploit channel widths maintain cost. Compared with baseline mesh 1 VC, our work reduces...

10.1145/2897937.2897963 article EN 2016-05-25

A comprehensive comparison between virtual cut-through and wormhole routers for cache coherent Network on-Chips

OPENALEX - Publications

Peng Wang Sheng Ma Hongyi Lu Zhiying Wang

A basic design aspect of cache coherent Networks-on-Chip (NoCs) is the flow control mechanism. Since minimum buffer size virtual cut-through (VCT) switching larger than that wormhole one, VCT traditionally regarded as an inefficient NoC type. Yet, scaling semiconductor technology shrinks transistor size, and reduces criticality amount for designs; may becomes a promising candidate. This paper performs comprehensive comparison between switching. Based on detailed RTL-level implementations, we...

10.1587/elex.11.20140496 article EN IEICE Electronics Express 2014-01-01

Verification of application of the 2.5D method in high-speed trimaran vertical motion and added resistance prediction

OPENALEX - Publications

Wenyang Duan S.M. Wang Sheng Ma

10.1016/j.oceaneng.2019.106177 article EN Ocean Engineering 2019-07-13

A heterogeneous low-cost and low-latency Ring-Chain network for GPGPUs

OPENALEX - Publications

Xia Zhao Sheng Ma Chen Li Lieven Eeckhout Zhiying Wang

To achieve high throughput, core count in compute accelerators such as General-Purpose Graphics Processing Units (GPGPUs) increases continuously. The communication demand of these cores boosts the for a low-latency packet switched network. As latency is mainly composed per-hop latency, contention and serialization favorable Network-on-Chip (NoC) design should efficiently decrease three contributors to meet while keeping hardware cost low. In this paper, we first make two observations about...

10.1109/iccd.2016.7753329 article EN 2022 IEEE 40th International Conference on Computer Design (ICCD) 2016-10-01

Priority-Based PCIe Scheduling for Multi-Tenant Multi-GPU Systems

OPENALEX - Publications

Chen Li Jun Yang Yifan Sun Lingling Jin Lingjie Xu and 5 more

Multi-GPU systems are widely used in data centers to provide significant speedups compute-intensive workloads such as deep neural network training. However, limited PCIe bandwidth between the CPU and multiple GPUs becomes a major performance bottleneck. We observe that relying on traditional Round-Robin-based scheduling policy can result severe competition stall execution of GPUs. In this article, we propose priority-based which aims overlap transfers GPU for different applications alleviate...

10.1109/lca.2019.2955119 article EN IEEE Computer Architecture Letters 2019-07-01

SIF: Overcoming the limitations of SIMD devices via implicit permutation

OPENALEX - Publications

Libo Huang Li Shen Zhiying Wang Wei Shi Nong Xiao and 1 more

SIMD devices have gained widespread acceptance in modern microprocessor designs for their superior performance multimedia applications. However, there are three remaining limitations to the efficient utilization of general-purpose computer systems: memory alignment, data reorganization and control flow. This paper presents SIF, an interface framework that addresses these shortcomings without modifying existing ISA. It is designed around a permutation vector register file (PVRF) it adds new...

10.1109/hpca.2010.5416631 article EN 2010-01-01

Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs

OPENALEX - Publications

Sheng Ma Natalie Enright Jerger Zhiying Wang Mingche Lai Libo Huang

To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of algorithm design include adaptivity, path selection strategy, VC allocation, isolation, and hardware implementation cost; these are not independent. contribution this work lies in the novel Destination-Based Selection Strategy (DBSS), which targets interference that can arise many-core systems running consolidation workloads. In process design, we holistically consider all to...

10.1109/tc.2012.201 article EN IEEE Transactions on Computers 2012-08-20

Coming Soon ...