- Advanced Graph Neural Networks
- Parallel Computing and Optimization Techniques
- Advanced Neural Network Applications
- Nanopore and Nanochannel Transport Studies
- Advanced Memory and Neural Computing
- Graph Theory and Algorithms
- Ferroelectric and Negative Capacitance Devices
- Adversarial Robustness in Machine Learning
- Interconnection Networks and Systems
- Quantum-Dot Cellular Automata
- Quantum Computing Algorithms and Architecture
- Low-power high-performance VLSI design
- Machine Learning in Materials Science
- Embedded Systems Design Techniques
- Human Pose and Action Recognition
- Radiation Effects in Electronics
- Indoor and Outdoor Localization Technologies
- Advanced Computational Techniques and Applications
- Facial Trauma and Fracture Management
- Gait Recognition and Analysis
- Software-Defined Networks and 5G
- Economic theories and models
- Neural Networks and Reservoir Computing
- Human Mobility and Location-Based Analysis
- Data Management and Algorithms
University of Rochester
2024-2025
Boston University
2019-2024
Dalian University of Technology
2014
Deep learning systems have been successfully applied to Euclidean data such as images, video, and audio. In many applications, however, information their relationships are better expressed with graphs. Graph Convolutional Networks (GCNs) appear be a promising approach efficiently learn from graph structures, having shown advantages in critical applications. As other deep modalities, hardware acceleration is critical. The challenge that real-world graphs often extremely large unbalanced; this...
Graph Convolutional Networks (GCNs) have drawn tremendous attention in the past three years. Compared with other deep learning modalities, high-performance hardware acceleration of GCNs is as critical but even more challenging. The hurdles arise from poor data locality and redundant computation due to large size, high sparsity, irregular non-zero distribution real-world graphs.
The implementation of Molecular Dynamics (MD) on FPGAs has received substantial attention. Previous work, however, consisted either proof-of-concept implementations components, usually the range-limited force; full systems, but with much work shared by host CPU; or prototype demonstrations, e.g., using OpenCL, that neither implement a whole system nor have competitive performance. In this paper, we present what believe to be first full-scale FPGA-based simulation engine, and show its...
Deep Learning Recommendation Models (DLRMs) are important applications in various domains and have evolved into one of the largest most machine learning applications. With their trillions parameters necessarily exceeding high bandwidth memory (HBM) capacity GPUs, ever more massive DLRMs require large-scale multi-node systems for distributed training inference. However, these all suffer from all-to-all communication bottleneck, which limits scalability.
Binarized Neural Networks (BNN), which significantly reduce computational complexity and memory demand, have shown potential in cost- power-restricted domains, such as IoT smart edge-devices, where reaching certain accuracy bars is sufficient real-time highly desired. In this article, we demonstrate that the highly-condensed BNN model can be shrunk by dynamically pruning irregular redundant edges. Based on two new observations BNN-specific properties, an out-of-order (OoO) architecture,...
High inference latency seriously limits the deployment of DNNs in real-time domains such as autonomous driving, robotic control, and many others. To address this emerging challenge, researchers have proposed approximate with reduced precision, e.g., Binarized Neural Networks (BNNs). While BNNs can be built to little loss accuracy, reduction still has much room for improvement. In paper, we propose a single-FPGA-based BNN accelerator that achieves microsecond-level ultra-low-latency ImageNet,...
Binarized Neural Networks (BNN) have drawn tremendous attention due to significantly reduced computational complexity and memory demand. They especially shown great potential in cost- power-restricted domains, such as IoT smart edge-devices, where reaching a certain accuracy bar is often sufficient, real-time highly desired.
In the last decade, Artificial Intelligence (AI) through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Many types DNNs have been continue to be developed, including Convolutional (CNNs), Recurrent (RNNs), Graph (GNNs). The overall problem for all these (NNs) is that their target applications generally pose stringent constraints on latency throughput, while also having strict accuracy requirements. There many previous efforts in...
Some communication switches, e.g., the Mellanox SHArP and those in IBM BlueGene clusters, are augmented to process packets at application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability diverse dynamic workloads. Recently, a new type of programmable packet processor, uses high-level languages, P4, has emerged as possible candidate. P4-based fall short certain applications, including machine learning, where capabilities not...
Conducting long-timescale simulations of small molecules using Molecular Dynamics (MD) is crucial in drug design. However, traditional methods to accelerate the process, including ASICs or GPUs, have limitations. ASIC solutions are not always generally available, while GPU may scale when processing molecules. FPGAs both communication processors and accelerators, with tight coupling between these capabilities, so could be used address strong scaling this domain.
Long timescale Molecular Dynamics (MD) simulation of small molecules is crucial in drug design and basic science. To accelerate a data set that executed for large number iterations, high-efficiency required. Recent work this domain has demonstrated among COTS devices only FPGA-centric clusters can scale beyond few processors. The problem addressed here that, as the on-chip processors increased from fewer than 10 into hundreds, previous intra-chip routing solutions are no longer viable. We...
Quantized Neural Networks (QNNs) have drawn tremendous attention since - when compared with Convolution (CNNs) they often dramatically reduce computation, communication, and storage demands negligible loss in accuracy. To find an optimal balance between performance accuracy, developers use different data-widths for layers channels. Given this large parameter space, it is challenging to design a QNN accelerator which generally efficient various flexible model configurations. In paper we...
With the current pandemic, central role that Molecular Dynamics simulation (MD) plays in drug discovery makes advances MD performance urgent. Recent work has demonstrated among COTS devices only FPGA-centric clusters can scale beyond a few processors for relevant targets; other shown single FPGA compares favorably to of GPU. In this study we demonstrate an additional factor 4× be achieved which results 5× speed up over The problem addressed is designs last decade no longer when number...
FPGA-accelerated molecular dynamics (MD) research dates back to almost two decades ago and is still being actively studied. MD on FPGA clusters, however, in its initial phase with only small systems built limited performance studies. Given the cost of building accelerator (as we show) number plausible architectures, a thorough study needed. In particular, investigate both GPU/FPGA hybrid clusters. The latter are potentially attractive given broad availability GPU clusters use GPUs for MD,...
FPGA-based SmartNICs offer great potential to significantly improve the performance of high-performance computing and warehouse data processing by tightly coupling support for reconfigurable data-intensive computation with cross-node communication thereby mitigating von Neumann bottleneck. Existing work however has generally been limited in that it assumes an accelerator model where kernels are offloaded most control tasks left CPUs. This leads frequent waiting reduced scaling challenges. In...
Molecular Dynamics simulation (MD) has been thought a promising FPGA application for many years, especially with clusters of tightly coupled FPGAs where the large-scale, general-purpose, low-latency interconnects provide communication capability not available any other COTS computing technology. Parallelization one part MD computation, 3D FFT, studied previously; likely cluster sizes, however, range-limited computation (RL) is more challenging. The motivation here that direct replication...
Deep learning systems have been successfully applied to Euclidean data such as images, video, and audio. In many applications, however, information their relationships are better expressed with graphs. Graph Convolutional Networks (GCNs) appear be a promising approach efficiently learn from graph structures, having shown advantages in critical applications. As other deep modalities, hardware acceleration is critical. The challenge that real-world graphs often extremely large unbalanced; this...
In N-body applications, the efficient evaluation of range-limited forces depends on applying certain constraints, including a cut-off radius and force symmetry (Newton's Third Law). When computing pair-wise in parallel, finding optimal mapping particles computations to memories processors is surprisingly challenging, but can result greatly reduced data movement computation. Despite FPGAs having distinct compute model (BRAMs/network/pipelines) from CPUs ASICs, mappings have not previously...
Communication switches have sometimes been augmented to process collectives, e.g., in the IBM BlueGene and Mellanox SHArP switches. In this work, we find that there is a great acceleration opportunity through further augmentation of accelerate more complex functions combine communication with computation. We consider three types such functions. The first fully-fused collectives built by fusing multiple existing like Allreduce Alltoall. second semi-fused combining collective another third are...
Network communication is increasingly becoming the performance bottleneck for scaled-out HPC and warehouse applications, as enormous CPU processing devoted to packet processing, contributing long latencies. To reduce this latency, advanced network interface cards known SmartNICs have been introduced handle networking functions. Dozens of commercial FPGA-based released (e.g., [1] – [3] see surveys [4] , [5] ). Other developed also with aim near-network [6] [9] . There prior art that uses...