- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Advanced Memory and Neural Computing
- Advanced Neural Network Applications
- Advanced Data Storage Technologies
- Embedded Systems Design Techniques
- Low-power high-performance VLSI design
- Brain Tumor Detection and Classification
- Distributed and Parallel Computing Systems
- Data Mining Algorithms and Applications
- Ferroelectric and Negative Capacitance Devices
- Advancements in Battery Materials
- Data Management and Algorithms
- VLSI and FPGA Design Techniques
- Numerical Methods and Algorithms
- Software-Defined Networks and 5G
- CCD and CMOS Imaging Sensors
- 3D IC and TSV technologies
- Graphene research and applications
- Cloud Computing and Resource Management
- Radiation Effects in Electronics
- Network Packet Processing and Optimization
- Adversarial Robustness in Machine Learning
- Time Series Analysis and Forecasting
- Landslides and related hazards
National University of Defense Technology
2015-2024
Shenyang Institute of Engineering
2024
China University of Geosciences (Beijing)
2023-2024
University of Macau
2022
Changsha University
2021
Harbin Engineering University
2019
Centre for High Performance Computing
2016
University of Toronto
2012
IBM Research - Thomas J. Watson Research Center
2002-2003
IBM (United States)
2002-2003
Periodic behavior is common in real-world applications. However many cases, periodicities are partial that they present only intermittently. The authors study such intermittent patterns, which refer to as p-patterns. formulation of p-patterns takes into account imprecise time information (e.g., due unsynchronized clocks distributed environments), noisy data extraneous events), and shifts phase and/or periods. We structure mining for two sub-tasks: (1) finding the periods (2) temporal...
With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness limits performance due to poor network congestion avoidance. Globally attack this issue by introducing propagation obtain status information beyond neighboring nodes. However, they may suffer from intra- inter-application...
Temporal data mining aims at finding patterns in historical data. Our work proposes an approach to extract temporal from predict the occurrence of target events, such as computer attacks on host networks, or fraudulent transactions financial institutions. problem formulation exhibits two major challenges: 1) we assume events being characterized by categorical features and displaying uneven inter-arrival times; assumption falls outside scope classical time-series analysis, 2) are highly...
Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design fully adaptive routing algorithms. First, based on previous deadlock-avoidance theories require conservative VC re-allocation scheme: can be re-allocated when it is empty, which limits performance. We propose novel scheme, whole packet forwarding (WPF), allows non-empty re-allocated. WPF leverages observation that...
Short and long packets co-exist in cache-coherent NoCs. Existing designs for torus networks do not efficiently handle variable-size packets. For deadlock free operations, a design uses two VCs, which negatively affects the router frequency. Some optimizations use one VC. Yet, they regard all as maximum-length packets, inefficiently utilizing precious buffers. We propose flit bubble flow control (FBFC), maintains flit-size buffer slot to avoid deadlock. FBFC VC, does treat short ones. It...
Across many architectures and parallel programming paradigms, collective communication plays a key role in performance correctness. Hardware support is necessary to prevent important from becoming system bottleneck. Support for multicast Networks-on-Chip (NoCs) has achieved substantial throughput improvements power savings. In this paper, we explore reduction or many-to-one operations. As case study, focus on acknowledgement messages (ACK) that must be collected directory protocol before...
Binary64 arithmetic is rapidly becoming inadequate to cope with today's large-scale computations due an accumulation of errors. Therefore, binary128 now required increase the accuracy and reliability these computations. At same time, obvious trend emerging in modern processors extend their instruction sets by allowing single multiple data (SIMD) execution, which can significantly accelerate data-parallel applications. To address combined demands mentioned above, this paper presents...
Routing algorithms for cache-coherent NoCs only have limited VCs at their disposal, which poses challenges to the design of routing algorithms. Existing fully adaptive apply conservative VC re-allocation: empty can be re-allocated, limits performance. We propose two novel flow control designs. First, whole packet forwarding (WPF) re-allocates a nonempty if has enough free buffers an entire packet. WPF does not induce deadlock algorithm is deadlock-free using re-allocation. It important...
The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. biggest advantage its simple and efficient design principle. Without complicated control dataflow, accelerators with can calculate traditional convolution very efficiently. However, this also brings new challenges to array. When computing special types convolution, such as small-scale or depthwise processing element (PE) utilization rate decreases sharply. main reason...
With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness limits performance due to poor network congestion avoidance. Globally attack this issue by introducing propagation obtain status information beyond neighboring nodes. However, they may suffer from intra- inter-application...
Compact convolutional neural networks have become a hot research topic. However, we find that the systolic array accelerators are extremely inefficient in dealing with compact models, especially when processing depthwise layers networks. To make arrays more efficient for networks, propose heterogeneous (HeSA) architecture. It introduces elements support multiple modes of dataflow, which can further exploit reuse data chance and without changing scale or structure nave array. By increasing...
High-quality random numbers are very critical to many fields such as cryptography, finance, and scientific simulation, which calls for the design of reliable true number generators (TRNGs). Limited by entropy source, throughput, reliability, system integration, existing TRNG designs difficult be deployed in real computing systems greatly accelerate target applications. This study proposes a circuit named resilient high-speed (RHS)-TRNG based on spin-transfer torque magnetic tunnel junction...
Landslide susceptibility prediction (LSP) is the basis for risk management and plays an important role in social sustainability. However, modeling process of LSP constrained by various factors. This paper approaches effect landslide data integrity, machine-learning (ML) models, non-landslide sample-selection methods on accuracy LSP, taking Yinghu Lake Basin Ankang City, Shaanxi Province, as example. First, previous inventory (totaling 46) updated 46 + 176) were established through...
Artificial neural networks (ANNs) and spiking (SNNs) are two general approaches to achieve artificial intelligence (AI). The former have been widely used in academia industry fields; the latter, SNNs, more similar biological can realize ultra-low power consumption, thus received widespread research attention. However, due their fundamental differences computation formula information coding, methods often require different incompatible platforms. Alongside development of AI, a platform that...
The eastern slope of Gongga Mountain is located in the mountainous region Southwestern China, which has strong geologic tectonics that leads to frequent landslide hazards. A large number such landslides were induced by 2022 Luding Ms 6.8 earthquake. Therefore, it necessary identify spatial distribution region. In this paper, Google Earth platform and GF-1 GF-6 satellite imagery used construct new pre-earthquake co-seismic landslides. Then, we analyzed relationship between conditioning...
As integrated circuits are limited by hardware resources, reducing cost while maintaining the performance becomes especially important. In this article, we propose a conflict-free NoC (cfNoC) for GPGPU request network. The cfNoC eliminates (i) conflicts among different columns deploying an exclusive subnet each column, and (ii) inside same column using token-based mechanism. elimination of allows to exploit channel widths maintain cost. Compared with baseline mesh 1 VC, our work reduces...
A basic design aspect of cache coherent Networks-on-Chip (NoCs) is the flow control mechanism. Since minimum buffer size virtual cut-through (VCT) switching larger than that wormhole one, VCT traditionally regarded as an inefficient NoC type. Yet, scaling semiconductor technology shrinks transistor size, and reduces criticality amount for designs; may becomes a promising candidate. This paper performs comprehensive comparison between switching. Based on detailed RTL-level implementations, we...
To achieve high throughput, core count in compute accelerators such as General-Purpose Graphics Processing Units (GPGPUs) increases continuously. The communication demand of these cores boosts the for a low-latency packet switched network. As latency is mainly composed per-hop latency, contention and serialization favorable Network-on-Chip (NoC) design should efficiently decrease three contributors to meet while keeping hardware cost low. In this paper, we first make two observations about...
Multi-GPU systems are widely used in data centers to provide significant speedups compute-intensive workloads such as deep neural network training. However, limited PCIe bandwidth between the CPU and multiple GPUs becomes a major performance bottleneck. We observe that relying on traditional Round-Robin-based scheduling policy can result severe competition stall execution of GPUs. In this article, we propose priority-based which aims overlap transfers GPU for different applications alleviate...
SIMD devices have gained widespread acceptance in modern microprocessor designs for their superior performance multimedia applications. However, there are three remaining limitations to the efficient utilization of general-purpose computer systems: memory alignment, data reorganization and control flow. This paper presents SIF, an interface framework that addresses these shortcomings without modifying existing ISA. It is designed around a permutation vector register file (PVRF) it adds new...
To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of algorithm design include adaptivity, path selection strategy, VC allocation, isolation, and hardware implementation cost; these are not independent. contribution this work lies in the novel Destination-Based Selection Strategy (DBSS), which targets interference that can arise many-core systems running consolidation workloads. In process design, we holistically consider all to...