Sheng Ma

ORCID: 0000-0003-1710-4060
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Interconnection Networks and Systems
  • Advanced Memory and Neural Computing
  • Advanced Neural Network Applications
  • Advanced Data Storage Technologies
  • Embedded Systems Design Techniques
  • Low-power high-performance VLSI design
  • Brain Tumor Detection and Classification
  • Distributed and Parallel Computing Systems
  • Data Mining Algorithms and Applications
  • Ferroelectric and Negative Capacitance Devices
  • Advancements in Battery Materials
  • Data Management and Algorithms
  • VLSI and FPGA Design Techniques
  • Numerical Methods and Algorithms
  • Software-Defined Networks and 5G
  • CCD and CMOS Imaging Sensors
  • 3D IC and TSV technologies
  • Graphene research and applications
  • Cloud Computing and Resource Management
  • Radiation Effects in Electronics
  • Network Packet Processing and Optimization
  • Adversarial Robustness in Machine Learning
  • Time Series Analysis and Forecasting
  • Landslides and related hazards

National University of Defense Technology
2015-2024

Shenyang Institute of Engineering
2024

China University of Geosciences (Beijing)
2023-2024

University of Macau
2022

Changsha University
2021

Harbin Engineering University
2019

Centre for High Performance Computing
2016

University of Toronto
2012

IBM Research - Thomas J. Watson Research Center
2002-2003

IBM (United States)
2002-2003

Periodic behavior is common in real-world applications. However many cases, periodicities are partial that they present only intermittently. The authors study such intermittent patterns, which refer to as p-patterns. formulation of p-patterns takes into account imprecise time information (e.g., due unsynchronized clocks distributed environments), noisy data extraneous events), and shifts phase and/or periods. We structure mining for two sub-tasks: (1) finding the periods (2) temporal...

10.1109/icde.2001.914829 article EN 2002-11-13

With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness limits performance due to poor network congestion avoidance. Globally attack this issue by introducing propagation obtain status information beyond neighboring nodes. However, they may suffer from intra- inter-application...

10.1145/2000064.2000113 article EN 2011-06-04

Temporal data mining aims at finding patterns in historical data. Our work proposes an approach to extract temporal from predict the occurrence of target events, such as computer attacks on host networks, or fraudulent transactions financial institutions. problem formulation exhibits two major challenges: 1) we assume events being characterized by categorical features and displaying uneven inter-arrival times; assumption falls outside scope classical time-series analysis, 2) are highly...

10.1109/icdm.2002.1183991 article EN 2003-06-26

Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design fully adaptive routing algorithms. First, based on previous deadlock-avoidance theories require conservative VC re-allocation scheme: can be re-allocated when it is empty, which limits performance. We propose novel scheme, whole packet forwarding (WPF), allows non-empty re-allocated. WPF leverages observation that...

10.1109/hpca.2012.6169049 article EN 2012-02-01

Short and long packets co-exist in cache-coherent NoCs. Existing designs for torus networks do not efficiently handle variable-size packets. For deadlock free operations, a design uses two VCs, which negatively affects the router frequency. Some optimizations use one VC. Yet, they regard all as maximum-length packets, inefficiently utilizing precious buffers. We propose flit bubble flow control (FBFC), maintains flit-size buffer slot to avoid deadlock. FBFC VC, does treat short ones. It...

10.1109/tc.2013.2295523 article EN IEEE Transactions on Computers 2014-01-31

Across many architectures and parallel programming paradigms, collective communication plays a key role in performance correctness. Hardware support is necessary to prevent important from becoming system bottleneck. Support for multicast Networks-on-Chip (NoCs) has achieved substantial throughput improvements power savings. In this paper, we explore reduction or many-to-one operations. As case study, focus on acknowledgement messages (ACK) that must be collected directory protocol before...

10.1109/hpca.2012.6168953 article EN 2012-02-01

Binary64 arithmetic is rapidly becoming inadequate to cope with today's large-scale computations due an accumulation of errors. Therefore, binary128 now required increase the accuracy and reliability these computations. At same time, obvious trend emerging in modern processors extend their instruction sets by allowing single multiple data (SIMD) execution, which can significantly accelerate data-parallel applications. To address combined demands mentioned above, this paper presents...

10.1109/tc.2011.77 article EN IEEE Transactions on Computers 2011-04-06

Routing algorithms for cache-coherent NoCs only have limited VCs at their disposal, which poses challenges to the design of routing algorithms. Existing fully adaptive apply conservative VC re-allocation: empty can be re-allocated, limits performance. We propose two novel flow control designs. First, whole packet forwarding (WPF) re-allocates a nonempty if has enough free buffers an entire packet. WPF does not induce deadlock algorithm is deadlock-free using re-allocation. It important...

10.1109/tpds.2013.166 article EN IEEE Transactions on Parallel and Distributed Systems 2013-06-28

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. biggest advantage its simple and efficient design principle. Without complicated control dataflow, accelerators with can calculate traditional convolution very efficiently. However, this also brings new challenges to array. When computing special types convolution, such as small-scale or depthwise processing element (PE) utilization rate decreases sharply. main reason...

10.1145/3460776 article EN ACM Transactions on Architecture and Code Optimization 2021-07-17

With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness limits performance due to poor network congestion avoidance. Globally attack this issue by introducing propagation obtain status information beyond neighboring nodes. However, they may suffer from intra- inter-application...

10.1145/2024723.2000113 article EN ACM SIGARCH Computer Architecture News 2011-06-04

Compact convolutional neural networks have become a hot research topic. However, we find that the systolic array accelerators are extremely inefficient in dealing with compact models, especially when processing depthwise layers networks. To make arrays more efficient for networks, propose heterogeneous (HeSA) architecture. It introduces elements support multiple modes of dataflow, which can further exploit reuse data chance and without changing scale or structure nave array. By increasing...

10.1109/tpds.2021.3129647 article EN cc-by IEEE Transactions on Parallel and Distributed Systems 2021-01-01

High-quality random numbers are very critical to many fields such as cryptography, finance, and scientific simulation, which calls for the design of reliable true number generators (TRNGs). Limited by entropy source, throughput, reliability, system integration, existing TRNG designs difficult be deployed in real computing systems greatly accelerate target applications. This study proposes a circuit named resilient high-speed (RHS)-TRNG based on spin-transfer torque magnetic tunnel junction...

10.1109/tvlsi.2023.3298327 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2023-08-21

Landslide susceptibility prediction (LSP) is the basis for risk management and plays an important role in social sustainability. However, modeling process of LSP constrained by various factors. This paper approaches effect landslide data integrity, machine-learning (ML) models, non-landslide sample-selection methods on accuracy LSP, taking Yinghu Lake Basin Ankang City, Shaanxi Province, as example. First, previous inventory (totaling 46) updated 46 + 176) were established through...

10.3390/su152215836 article EN Sustainability 2023-11-10

Artificial neural networks (ANNs) and spiking (SNNs) are two general approaches to achieve artificial intelligence (AI). The former have been widely used in academia industry fields; the latter, SNNs, more similar biological can realize ultra-low power consumption, thus received widespread research attention. However, due their fundamental differences computation formula information coding, methods often require different incompatible platforms. Alongside development of AI, a platform that...

10.1145/3643134 article EN cc-by ACM Transactions on Design Automation of Electronic Systems 2024-01-25

The eastern slope of Gongga Mountain is located in the mountainous region Southwestern China, which has strong geologic tectonics that leads to frequent landslide hazards. A large number such landslides were induced by 2022 Luding Ms 6.8 earthquake. Therefore, it necessary identify spatial distribution region. In this paper, Google Earth platform and GF-1 GF-6 satellite imagery used construct new pre-earthquake co-seismic landslides. Then, we analyzed relationship between conditioning...

10.3390/rs16183360 article EN cc-by Remote Sensing 2024-09-10

As integrated circuits are limited by hardware resources, reducing cost while maintaining the performance becomes especially important. In this article, we propose a conflict-free NoC (cfNoC) for GPGPU request network. The cfNoC eliminates (i) conflicts among different columns deploying an exclusive subnet each column, and (ii) inside same column using token-based mechanism. elimination of allows to exploit channel widths maintain cost. Compared with baseline mesh 1 VC, our work reduces...

10.1145/2897937.2897963 article EN 2016-05-25

A basic design aspect of cache coherent Networks-on-Chip (NoCs) is the flow control mechanism. Since minimum buffer size virtual cut-through (VCT) switching larger than that wormhole one, VCT traditionally regarded as an inefficient NoC type. Yet, scaling semiconductor technology shrinks transistor size, and reduces criticality amount for designs; may becomes a promising candidate. This paper performs comprehensive comparison between switching. Based on detailed RTL-level implementations, we...

10.1587/elex.11.20140496 article EN IEICE Electronics Express 2014-01-01

To achieve high throughput, core count in compute accelerators such as General-Purpose Graphics Processing Units (GPGPUs) increases continuously. The communication demand of these cores boosts the for a low-latency packet switched network. As latency is mainly composed per-hop latency, contention and serialization favorable Network-on-Chip (NoC) design should efficiently decrease three contributors to meet while keeping hardware cost low. In this paper, we first make two observations about...

10.1109/iccd.2016.7753329 article EN 2022 IEEE 40th International Conference on Computer Design (ICCD) 2016-10-01

Multi-GPU systems are widely used in data centers to provide significant speedups compute-intensive workloads such as deep neural network training. However, limited PCIe bandwidth between the CPU and multiple GPUs becomes a major performance bottleneck. We observe that relying on traditional Round-Robin-based scheduling policy can result severe competition stall execution of GPUs. In this article, we propose priority-based which aims overlap transfers GPU for different applications alleviate...

10.1109/lca.2019.2955119 article EN IEEE Computer Architecture Letters 2019-07-01

SIMD devices have gained widespread acceptance in modern microprocessor designs for their superior performance multimedia applications. However, there are three remaining limitations to the efficient utilization of general-purpose computer systems: memory alignment, data reorganization and control flow. This paper presents SIF, an interface framework that addresses these shortcomings without modifying existing ISA. It is designed around a permutation vector register file (PVRF) it adds new...

10.1109/hpca.2010.5416631 article EN 2010-01-01

To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of algorithm design include adaptivity, path selection strategy, VC allocation, isolation, and hardware implementation cost; these are not independent. contribution this work lies in the novel Destination-Based Selection Strategy (DBSS), which targets interference that can arise many-core systems running consolidation workloads. In process design, we holistically consider all to...

10.1109/tc.2012.201 article EN IEEE Transactions on Computers 2012-08-20
Coming Soon ...