NFDI4DS | UHH-SEMS - Publication Details

AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing

OPENALEX - Publications

Tong Geng Ang Li Runbin Shi Chunshu Wu Tianqi Wang and 6 more

Deep learning systems have been successfully applied to Euclidean data such as images, video, and audio. In many applications, however, information their relationships are better expressed with graphs. Graph Convolutional Networks (GCNs) appear be a promising approach efficiently learn from graph structures, having shown advantages in critical applications. As other deep modalities, hardware acceleration is critical. The challenge that real-world graphs often extremely large unbalanced; this...

10.1109/micro50266.2020.00079 preprint EN 2020-10-01

I-GCN: A Graph Convolutional Network Accelerator with Runtime Locality Enhancement through Islandization

OPENALEX - Publications

Tong Geng Chunshu Wu Yongan Zhang Cheng Tan Chenhao Xie and 4 more

Graph Convolutional Networks (GCNs) have drawn tremendous attention in the past three years. Compared with other deep learning modalities, high-performance hardware acceleration of GCNs is as critical but even more challenging. The hurdles arise from poor data locality and redundant computation due to large size, high sparsity, irregular non-zero distribution real-world graphs.

10.1145/3466752.3480113 article EN 2021-10-17

Fully integrated FPGA molecular dynamics simulations

OPENALEX - Publications

Chen Yang Tong Geng Tianqi Wang Rushi Patel Qingqing Xiong and 7 more

The implementation of Molecular Dynamics (MD) on FPGAs has received substantial attention. Previous work, however, consisted either proof-of-concept implementations components, usually the range-limited force; full systems, but with much work shared by host CPU; or prototype demonstrations, e.g., using OpenCL, that neither implement a whole system nor have competitive performance. In this paper, we present what believe to be first full-scale FPGA-based simulation engine, and show its...

10.1145/3295500.3356179 article EN 2019-11-07

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

OPENALEX - Publications

Anqi Guo Yuchen Hao Chunshu Wu Pouya Haghi Zhenyu Pan and 5 more

Deep Learning Recommendation Models (DLRMs) are important applications in various domains and have evolved into one of the largest most machine learning applications. With their trillions parameters necessarily exceeding high bandwidth memory (HBM) capacity GPUs, ever more massive DLRMs require large-scale multi-node systems for distributed training inference. However, these all suffer from all-to-all communication bottleneck, which limits scalability.

10.1145/3577193.3593724 article EN 2023-06-20

Nature-GL: A Revolutionary Learning Paradigm Unleashing Nature's Power in Real-World Spatial-Temporal Graph Learning

OPENALEX - Publications

Chuan Liu Chunshu Wu Ruibing Song Yousu Chen Ang Li and 2 more

10.1145/3658617.3703142 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2025-01-20

O3BNN-R: An Out-of-Order Architecture for High-Performance and Regularized BNN Inference

OPENALEX - Publications

Tong Geng Ang Li Tianqi Wang Chunshu Wu Yanfei Li and 3 more

Binarized Neural Networks (BNN), which significantly reduce computational complexity and memory demand, have shown potential in cost- power-restricted domains, such as IoT smart edge-devices, where reaching certain accuracy bars is sufficient real-time highly desired. In this article, we demonstrate that the highly-condensed BNN model can be shrunk by dynamically pruning irregular redundant edges. Based on two new observations BNN-specific properties, an out-of-order (OoO) architecture,...

10.1109/tpds.2020.3013637 article EN publisher-specific-oa IEEE Transactions on Parallel and Distributed Systems 2020-08-03

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

OPENALEX - Publications

Tong Geng Tianqi Wang Chunshu Wu Chen Yang Shuaiwen Leon Song and 2 more

High inference latency seriously limits the deployment of DNNs in real-time domains such as autonomous driving, robotic control, and many others. To address this emerging challenge, researchers have proposed approximate with reduced precision, e.g., Binarized Neural Networks (BNNs). While BNNs can be built to little loss accuracy, reduction still has much room for improvement. In paper, we propose a single-FPGA-based BNN accelerator that achieves microsecond-level ultra-low-latency ImageNet,...

10.1109/asap.2019.00-43 article EN 2019-07-01

O3BNN

OPENALEX - Publications

Tong Geng Tianqi Wang Chunshu Wu Chen Yang Wei Wu and 2 more

Binarized Neural Networks (BNN) have drawn tremendous attention due to significantly reduced computational complexity and memory demand. They especially shown great potential in cost- power-restricted domains, such as IoT smart edge-devices, where reaching a certain accuracy bar is often sufficient, real-time highly desired.

10.1145/3330345.3330386 article EN 2019-06-18

A Survey: Handling Irregularities in Neural Network Acceleration with FPGAs

OPENALEX - Publications

Tong Geng Chunshu Wu Cheng Tan Chenhao Xie Anqi Guo and 5 more

In the last decade, Artificial Intelligence (AI) through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Many types DNNs have been continue to be developed, including Convolutional (CNNs), Recurrent (RNNs), Graph (GNNs). The overall problem for all these (NNs) is that their target applications generally pose stringent constraints on latency throughput, while also having strict accuracy requirements. There many previous efforts in...

10.1109/hpec49654.2021.9622877 article EN 2021-09-20

FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

OPENALEX - Publications

Pouya Haghi William Krska Cheng Tan Tong Geng Po Hao Chen and 7 more

Some communication switches, e.g., the Mellanox SHArP and those in IBM BlueGene clusters, are augmented to process packets at application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability diverse dynamic workloads. Recently, a new type of programmable packet processor, uses high-level languages, P4, has emerged as possible candidate. P4-based fall short certain applications, including machine learning, where capabilities not...

10.1145/3577193.3593739 article EN 2023-06-20

FASDA: An FPGA-Aided, Scalable and Distributed Accelerator for Range-Limited Molecular Dynamics

OPENALEX - Publications

Chunshu Wu Tong Geng Anqi Guo Sahan Bandara Pouya Haghi and 3 more

Conducting long-timescale simulations of small molecules using Molecular Dynamics (MD) is crucial in drug design. However, traditional methods to accelerate the process, including ASICs or GPUs, have limitations. ASIC solutions are not always generally available, while GPU may scale when processing molecules. FPGAs both communication processors and accelerators, with tight coupling between these capabilities, so could be used address strong scaling this domain.

10.1145/3581784.3607100 article EN 2023-10-30

FPGA-Accelerated Range-Limited Molecular Dynamics

OPENALEX - Publications

Chunshu Wu Chen Yang Sahan Bandara Tong Geng Anqi Guo and 3 more

Long timescale Molecular Dynamics (MD) simulation of small molecules is crucial in drug design and basic science. To accelerate a data set that executed for large number iterations, high-efficiency required. Recent work this domain has demonstrated among COTS devices only FPGA-centric clusters can scale beyond few processors. The problem addressed here that, as the on-chip processors increased from fewer than 10 into hundreds, previous intra-chip routing solutions are no longer viable. We...

10.1109/tc.2024.3375613 article EN IEEE Transactions on Computers 2024-03-14

CQNN: a CGRA-based QNN Framework

OPENALEX - Publications

Tong Geng Chunshu Wu Cheng Tan Bo Fang Ang Li and 1 more

Quantized Neural Networks (QNNs) have drawn tremendous attention since - when compared with Convolution (CNNs) they often dramatically reduce computation, communication, and storage demands negligible loss in accuracy. To find an optimal balance between performance accuracy, developers use different data-widths for layers channels. Given this large parameter space, it is challenging to design a QNN accelerator which generally efficient various flexible model configurations. In paper we...

10.1109/hpec43674.2020.9286194 article EN 2020-09-22

Upgrade of FPGA Range-Limited Molecular Dynamics to Handle Hundreds of Processors

OPENALEX - Publications

Chunshu Wu Tong Geng Sahan Bandara Chen Yang Vipin Sachdeva and 2 more

With the current pandemic, central role that Molecular Dynamics simulation (MD) plays in drug discovery makes advances MD performance urgent. Recent work has demonstrated among COTS devices only FPGA-centric clusters can scale beyond a few processors for relevant targets; other shown single FPGA compares favorably to of GPU. In this study we demonstrate an additional factor 4× be achieved which results 5× speed up over The problem addressed is designs last decade no longer when number...

10.1109/fccm51124.2021.00024 article EN 2021-05-01

System-Level Modeling of GPU/FPGA Clusters for Molecular Dynamics Simulations

OPENALEX - Publications

Chunshu Wu Sahan Bandara Tong Geng Vipin Sachdeva Woody Sherman and 1 more

FPGA-accelerated molecular dynamics (MD) research dates back to almost two decades ago and is still being actively studied. MD on FPGA clusters, however, in its initial phase with only small systems built limited performance studies. Given the cost of building accelerator (as we show) number plausible architectures, a thorough study needed. In particular, investigate both GPU/FPGA hybrid clusters. The latter are potentially attractive given broad availability GPU clusters use GPUs for MD,...

10.1109/hpec49654.2021.9622838 article EN 2021-09-20

A Framework for Neural Network Inference on FPGA-Centric SmartNICs

OPENALEX - Publications

Anqi Guo Tong Geng Yongan Zhang Pouya Haghi Chunshu Wu and 4 more

FPGA-based SmartNICs offer great potential to significantly improve the performance of high-performance computing and warehouse data processing by tightly coupling support for reconfigurable data-intensive computation with cross-node communication thereby mitigating von Neumann bottleneck. Existing work however has generally been limited in that it assumes an accelerator model where kernels are offloaded most control tasks left CPUs. This leads frequent waiting reduced scaling challenges. In...

10.1109/fpl57034.2022.00071 article EN 2022-08-01

A Communication-Efficient Multi-Chip Design for Range-Limited Molecular Dynamics

OPENALEX - Publications

Chunshu Wu Tong Geng Chen Yang Vipin Sachdeva Woody Sherman and 1 more

Molecular Dynamics simulation (MD) has been thought a promising FPGA application for many years, especially with clusters of tightly coupled FPGAs where the large-scale, general-purpose, low-latency interconnects provide communication capability not available any other COTS computing technology. Parallelization one part MD computation, 3D FFT, studied previously; likely cluster sizes, however, range-limited computation (RL) is more challenging. The motivation here that direct replication...

10.1109/hpec43674.2020.9286146 article EN 2020-09-22

AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing

OPENALEX - Publications

Tong Geng Ang Li Runbin Shi Chunshu Wu Tianqi Wang and 6 more

Deep learning systems have been successfully applied to Euclidean data such as images, video, and audio. In many applications, however, information their relationships are better expressed with graphs. Graph Convolutional Networks (GCNs) appear be a promising approach efficiently learn from graph structures, having shown advantages in critical applications. As other deep modalities, hardware acceleration is critical. The challenge that real-world graphs often extremely large unbalanced; this...

10.48550/arxiv.1908.10834 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Optimized Mappings for Symmetric Range-Limited Molecular Force Calculations on FPGAs

OPENALEX - Publications

Chunshu Wu Sahan Bandara Tong Geng Anqi Guo Pouya Haghi and 3 more

In N-body applications, the efficient evaluation of range-limited forces depends on applying certain constraints, including a cut-off radius and force symmetry (Newton's Third Law). When computing pair-wise in parallel, finding optimal mapping particles computations to memories processors is surprisingly challenging, but can result greatly reduced data movement computation. Despite FPGAs having distinct compute model (BRAMs/network/pipelines) from CPUs ASICs, mappings have not previously...

10.1109/fpl57034.2022.00026 article EN 2022-08-01

SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications

OPENALEX - Publications

Pouya Haghi Cheng Tan Anqi Guo Chunshu Wu Dongfang Liu and 4 more

Communication switches have sometimes been augmented to process collectives, e.g., in the IBM BlueGene and Mellanox SHArP switches. In this work, we find that there is a great acceleration opportunity through further augmentation of accelerate more complex functions combine communication with computation. We consider three types such functions. The first fully-fused collectives built by fusing multiple existing like Allreduce Alltoall. second semi-fused combining collective another third are...

10.1145/3650200.3656616 article EN cc-by 2024-05-30

Long-Range MD Electrostatics Force Computation on FPGAs

OPENALEX - Publications

Sahan Bandara Anthony Ducimo Chunshu Wu Martin Herbordt

10.1109/tpds.2024.3434347 article EN IEEE Transactions on Parallel and Distributed Systems 2024-07-26

DS-GL: Advancing Graph Learning via Harnessing Nature’s Power within Scalable Dynamical Systems

OPENALEX - Publications

Ruibing Song Chunshu Wu Chuan Liu Ang Li Michael Huang and 1 more

10.1109/isca59077.2024.00014 article EN 2024-06-29

Bridging the Gap Between LLMs and LNS with Dynamic Data Format and Architecture Codesign

OPENALEX - Publications

Pouya Haghi Chunshu Wu Zahra Azad Yanfei Li Andrew Gui and 3 more

10.1109/micro61859.2024.00118 article EN 2024-11-02

FCsN: A FPGA-Centric SmartNIC Framework for Neural Networks

OPENALEX - Publications

Anqi Guo Tong Geng Yongan Zhang Pouya Haghi Chunshu Wu and 4 more

Network communication is increasingly becoming the performance bottleneck for scaled-out HPC and warehouse applications, as enormous CPU processing devoted to packet processing, contributing long latencies. To reduce this latency, advanced network interface cards known SmartNICs have been introduced handle networking functions. Dozens of commercial FPGA-based released (e.g., [1] – [3] see surveys [4] , [5] ). Other developed also with aim near-network [6] [9] . There prior art that uses...

10.1109/fccm53951.2022.9786193 article EN 2022-05-15