Chunshu Wu

ORCID: 0009-0006-2039-0853
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Graph Neural Networks
  • Parallel Computing and Optimization Techniques
  • Advanced Neural Network Applications
  • Nanopore and Nanochannel Transport Studies
  • Advanced Memory and Neural Computing
  • Graph Theory and Algorithms
  • Ferroelectric and Negative Capacitance Devices
  • Adversarial Robustness in Machine Learning
  • Interconnection Networks and Systems
  • Quantum-Dot Cellular Automata
  • Quantum Computing Algorithms and Architecture
  • Low-power high-performance VLSI design
  • Machine Learning in Materials Science
  • Embedded Systems Design Techniques
  • Human Pose and Action Recognition
  • Radiation Effects in Electronics
  • Indoor and Outdoor Localization Technologies
  • Advanced Computational Techniques and Applications
  • Facial Trauma and Fracture Management
  • Gait Recognition and Analysis
  • Software-Defined Networks and 5G
  • Economic theories and models
  • Neural Networks and Reservoir Computing
  • Human Mobility and Location-Based Analysis
  • Data Management and Algorithms

University of Rochester
2024-2025

Boston University
2019-2024

Dalian University of Technology
2014

Deep learning systems have been successfully applied to Euclidean data such as images, video, and audio. In many applications, however, information their relationships are better expressed with graphs. Graph Convolutional Networks (GCNs) appear be a promising approach efficiently learn from graph structures, having shown advantages in critical applications. As other deep modalities, hardware acceleration is critical. The challenge that real-world graphs often extremely large unbalanced; this...

10.1109/micro50266.2020.00079 preprint EN 2020-10-01

Graph Convolutional Networks (GCNs) have drawn tremendous attention in the past three years. Compared with other deep learning modalities, high-performance hardware acceleration of GCNs is as critical but even more challenging. The hurdles arise from poor data locality and redundant computation due to large size, high sparsity, irregular non-zero distribution real-world graphs.

10.1145/3466752.3480113 article EN 2021-10-17

The implementation of Molecular Dynamics (MD) on FPGAs has received substantial attention. Previous work, however, consisted either proof-of-concept implementations components, usually the range-limited force; full systems, but with much work shared by host CPU; or prototype demonstrations, e.g., using OpenCL, that neither implement a whole system nor have competitive performance. In this paper, we present what believe to be first full-scale FPGA-based simulation engine, and show its...

10.1145/3295500.3356179 article EN 2019-11-07

Deep Learning Recommendation Models (DLRMs) are important applications in various domains and have evolved into one of the largest most machine learning applications. With their trillions parameters necessarily exceeding high bandwidth memory (HBM) capacity GPUs, ever more massive DLRMs require large-scale multi-node systems for distributed training inference. However, these all suffer from all-to-all communication bottleneck, which limits scalability.

10.1145/3577193.3593724 article EN 2023-06-20

Binarized Neural Networks (BNN), which significantly reduce computational complexity and memory demand, have shown potential in cost- power-restricted domains, such as IoT smart edge-devices, where reaching certain accuracy bars is sufficient real-time highly desired. In this article, we demonstrate that the highly-condensed BNN model can be shrunk by dynamically pruning irregular redundant edges. Based on two new observations BNN-specific properties, an out-of-order (OoO) architecture,...

10.1109/tpds.2020.3013637 article EN publisher-specific-oa IEEE Transactions on Parallel and Distributed Systems 2020-08-03

High inference latency seriously limits the deployment of DNNs in real-time domains such as autonomous driving, robotic control, and many others. To address this emerging challenge, researchers have proposed approximate with reduced precision, e.g., Binarized Neural Networks (BNNs). While BNNs can be built to little loss accuracy, reduction still has much room for improvement. In paper, we propose a single-FPGA-based BNN accelerator that achieves microsecond-level ultra-low-latency ImageNet,...

10.1109/asap.2019.00-43 article EN 2019-07-01

Binarized Neural Networks (BNN) have drawn tremendous attention due to significantly reduced computational complexity and memory demand. They especially shown great potential in cost- power-restricted domains, such as IoT smart edge-devices, where reaching a certain accuracy bar is often sufficient, real-time highly desired.

10.1145/3330345.3330386 article EN 2019-06-18

In the last decade, Artificial Intelligence (AI) through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Many types DNNs have been continue to be developed, including Convolutional (CNNs), Recurrent (RNNs), Graph (GNNs). The overall problem for all these (NNs) is that their target applications generally pose stringent constraints on latency throughput, while also having strict accuracy requirements. There many previous efforts in...

10.1109/hpec49654.2021.9622877 article EN 2021-09-20

Some communication switches, e.g., the Mellanox SHArP and those in IBM BlueGene clusters, are augmented to process packets at application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability diverse dynamic workloads. Recently, a new type of programmable packet processor, uses high-level languages, P4, has emerged as possible candidate. P4-based fall short certain applications, including machine learning, where capabilities not...

10.1145/3577193.3593739 article EN 2023-06-20

Conducting long-timescale simulations of small molecules using Molecular Dynamics (MD) is crucial in drug design. However, traditional methods to accelerate the process, including ASICs or GPUs, have limitations. ASIC solutions are not always generally available, while GPU may scale when processing molecules. FPGAs both communication processors and accelerators, with tight coupling between these capabilities, so could be used address strong scaling this domain.

10.1145/3581784.3607100 article EN 2023-10-30

Long timescale Molecular Dynamics (MD) simulation of small molecules is crucial in drug design and basic science. To accelerate a data set that executed for large number iterations, high-efficiency required. Recent work this domain has demonstrated among COTS devices only FPGA-centric clusters can scale beyond few processors. The problem addressed here that, as the on-chip processors increased from fewer than 10 into hundreds, previous intra-chip routing solutions are no longer viable. We...

10.1109/tc.2024.3375613 article EN IEEE Transactions on Computers 2024-03-14

Quantized Neural Networks (QNNs) have drawn tremendous attention since - when compared with Convolution (CNNs) they often dramatically reduce computation, communication, and storage demands negligible loss in accuracy. To find an optimal balance between performance accuracy, developers use different data-widths for layers channels. Given this large parameter space, it is challenging to design a QNN accelerator which generally efficient various flexible model configurations. In paper we...

10.1109/hpec43674.2020.9286194 article EN 2020-09-22

With the current pandemic, central role that Molecular Dynamics simulation (MD) plays in drug discovery makes advances MD performance urgent. Recent work has demonstrated among COTS devices only FPGA-centric clusters can scale beyond a few processors for relevant targets; other shown single FPGA compares favorably to of GPU. In this study we demonstrate an additional factor 4× be achieved which results 5× speed up over The problem addressed is designs last decade no longer when number...

10.1109/fccm51124.2021.00024 article EN 2021-05-01

FPGA-accelerated molecular dynamics (MD) research dates back to almost two decades ago and is still being actively studied. MD on FPGA clusters, however, in its initial phase with only small systems built limited performance studies. Given the cost of building accelerator (as we show) number plausible architectures, a thorough study needed. In particular, investigate both GPU/FPGA hybrid clusters. The latter are potentially attractive given broad availability GPU clusters use GPUs for MD,...

10.1109/hpec49654.2021.9622838 article EN 2021-09-20

FPGA-based SmartNICs offer great potential to significantly improve the performance of high-performance computing and warehouse data processing by tightly coupling support for reconfigurable data-intensive computation with cross-node communication thereby mitigating von Neumann bottleneck. Existing work however has generally been limited in that it assumes an accelerator model where kernels are offloaded most control tasks left CPUs. This leads frequent waiting reduced scaling challenges. In...

10.1109/fpl57034.2022.00071 article EN 2022-08-01

Molecular Dynamics simulation (MD) has been thought a promising FPGA application for many years, especially with clusters of tightly coupled FPGAs where the large-scale, general-purpose, low-latency interconnects provide communication capability not available any other COTS computing technology. Parallelization one part MD computation, 3D FFT, studied previously; likely cluster sizes, however, range-limited computation (RL) is more challenging. The motivation here that direct replication...

10.1109/hpec43674.2020.9286146 article EN 2020-09-22

Deep learning systems have been successfully applied to Euclidean data such as images, video, and audio. In many applications, however, information their relationships are better expressed with graphs. Graph Convolutional Networks (GCNs) appear be a promising approach efficiently learn from graph structures, having shown advantages in critical applications. As other deep modalities, hardware acceleration is critical. The challenge that real-world graphs often extremely large unbalanced; this...

10.48550/arxiv.1908.10834 preprint EN other-oa arXiv (Cornell University) 2019-01-01

In N-body applications, the efficient evaluation of range-limited forces depends on applying certain constraints, including a cut-off radius and force symmetry (Newton's Third Law). When computing pair-wise in parallel, finding optimal mapping particles computations to memories processors is surprisingly challenging, but can result greatly reduced data movement computation. Despite FPGAs having distinct compute model (BRAMs/network/pipelines) from CPUs ASICs, mappings have not previously...

10.1109/fpl57034.2022.00026 article EN 2022-08-01

Communication switches have sometimes been augmented to process collectives, e.g., in the IBM BlueGene and Mellanox SHArP switches. In this work, we find that there is a great acceleration opportunity through further augmentation of accelerate more complex functions combine communication with computation. We consider three types such functions. The first fully-fused collectives built by fusing multiple existing like Allreduce Alltoall. second semi-fused combining collective another third are...

10.1145/3650200.3656616 article EN cc-by 2024-05-30

10.1109/tpds.2024.3434347 article EN IEEE Transactions on Parallel and Distributed Systems 2024-07-26

Network communication is increasingly becoming the performance bottleneck for scaled-out HPC and warehouse applications, as enormous CPU processing devoted to packet processing, contributing long latencies. To reduce this latency, advanced network interface cards known SmartNICs have been introduced handle networking functions. Dozens of commercial FPGA-based released (e.g., [1] – [3] see surveys [4] , [5] ). Other developed also with aim near-network [6] [9] . There prior art that uses...

10.1109/fccm53951.2022.9786193 article EN 2022-05-15
Coming Soon ...