NFDI4DS | UHH-SEMS - Publication Details

Guohao Dai

ORCID: 0000-0003-0849-3252

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5015946486

Research Areas

Advanced Neural Network Applications
Advanced Memory and Neural Computing
Advanced Graph Neural Networks
Ferroelectric and Negative Capacitance Devices
Advanced Image and Video Retrieval Techniques
Graph Theory and Algorithms
Parallel Computing and Optimization Techniques
CCD and CMOS Imaging Sensors
Topic Modeling
Natural Language Processing Techniques
Stochastic Gradient Optimization Techniques
Tensor decomposition and applications
Algorithms and Data Compression
Data Management and Algorithms
Image Retrieval and Classification Techniques
Advanced Data Storage Technologies
Speech Recognition and Synthesis
Complex Network Analysis Techniques
Recommender Systems and Techniques
Caching and Content Delivery
Machine Learning in Materials Science
Cloud Computing and Resource Management
Brain Tumor Detection and Classification
Advanced Data Processing Techniques
Particle Detector Development and Performance

Shanghai Jiao Tong University
2022-2025

Tsinghua University
2014-2023

National Engineering Research Center for Information Technology in Agriculture
2020

GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks

OPENALEX - Publications

Guyue Huang Guohao Dai Yu Wang Huazhong Yang

The acceleration of Graph Neural Networks (GNNs) requires efficient and framework-compatible Sparse-Dense Matrix-Matrix Multiplication (SpMM). From the compatibility perspective, sophisticated sparse matrix representations in state-of-the-art SpMM designs cause heavy preprocessing overhead for framework. efficiency optimizations SpMV (Sparse Matrix-Vector) do not apply well to SpMM, leading redundant uncoalesced global memory access. We propose GE-SpMM1, which takes CSR format consistent...

10.1109/sc41405.2020.00076 article EN 2020-11-01

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

OPENALEX - Publications

Shulin Zeng J Liu Guohao Dai Xinhao Yang Tianyu Fu and 12 more

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification quantization are commonly used to mitigate the gap between LLM's computation/memory overheads hardware capacity. existing GPU transformer-based accelerators cannot efficiently process compressed LLMs, due following unresolved challenges: low computational efficiency,...

10.1145/3626202.3637562 article EN cc-by 2024-04-01

A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM

OPENALEX - Publications

Zhenhua Zhu Hanbo Sun Yujun Lin Guohao Dai Lixue Xia and 3 more

Convolutional Neural Networks (CNNs) play a vital role in machine learning. Emerging resistive random-access memories (RRAMs) and RRAM-based Processing-In-Memory architectures have demonstrated great potentials boosting both the performance energy efficiency of CNNs. However, restricted by immature process technology, it is hard to implement fabricate CNN accelerator chip based on multi-bit RRAM devices. In addition, existing single bit accelerators only focus binary or ternary CNNs which...

10.1145/3316781.3317739 article EN 2019-05-23

MNSIM 2.0: A Behavior-Level Modeling Tool for Memristor-based Neuromorphic Computing Systems

OPENALEX - Publications

Zhenhua Zhu Hanbo Sun Kaizhong Qiu Lixue Xia Gokul Krishnan and 8 more

Memristor based neuromorphic computing systems give alternative solutions to boost the energy efficiency of Neural Network (NN) algorithms. Because large-scale applications and large architecture design space, many factors will affect accuracy system's performance. In this work, we propose a behavior-level modeling tool for memristor-based systems, MNSIM 2.0, model performance help researchers realize an early-stage space exploration. Compared with former version other benchmarks, 2.0 has...

10.1145/3386263.3407647 article EN 2020-09-04

A Survey on Efficient Inference for Large Language Models

OPENALEX - Publications

Zixuan Zhou Xuefei Ning Ke Hong Tianyu Fu Jiaming Xu and 10 more

Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within field been directed towards developing techniques aimed at enhancing efficiency inference. This paper presents a comprehensive survey existing literature on efficient We start by analyzing primary causes...

10.48550/arxiv.2404.14294 preprint EN arXiv (Cornell University) 2024-04-22

Heuristic adaptability to input dynamics for SpMM on GPUs

OPENALEX - Publications

Guohao Dai Guyue Huang Shang Fa Yang Zhongming Yu Hengrui Zhang and 4 more

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of on different input data (e.g., >85% loss with single algorithm). In this paper, we consider challenge dynamics from novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal...

10.1145/3489517.3530508 article EN Proceedings of the 59th ACM/IEEE Design Automation Conference 2022-07-10

DeepGate4: Efficient and Effective Representation Learning for Circuit Design at Scale

OPENALEX - Publications

Ziyang Zheng Shan Huang Jianyuan Zhong Zhengyuan Shi Guohao Dai and 2 more

Circuit representation learning has become pivotal in electronic design automation, enabling critical tasks such as testability analysis, logic reasoning, power estimation, and SAT solving. However, existing models face significant challenges scaling to large circuits due limitations like over-squashing graph neural networks the quadratic complexity of transformer-based models. To address these issues, we introduce DeepGate4, a scalable efficient transformer specifically designed for...

10.48550/arxiv.2502.01681 preprint EN arXiv (Cornell University) 2025-02-02

FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs

OPENALEX - Publications

J Liu Shulin Zeng Li Ding Widyadewi Soedarmadji Hao Zhou and 9 more

Video Generation Model (VGM), as a representative of multi-modal large models, has revolutionized the productivity video content creation. VGMs are compute-bound due to adopting Diffusion Transformer (i.e., DiT) structure. Sparsification is common method for accelerating compute-intensive models. Still, sparse cannot fully exploit effective throughput TOPS) GPUs. FPGAs good candidates deep learning However, existing FPGA accelerators still face low ( < 2TOPS) on significant gap in peak...

10.1145/3706628.3708864 article EN 2025-02-26

FMC-LLM: Enabling FPGAs for Efficient Batched Decoding of 70B+ LLMs with a Memory-Centric Streaming Architecture

OPENALEX - Publications

Wenheng Ma Xinhao Yang Shulin Zeng T.Q. Liu Libo Shen and 12 more

For large language model (LLM) acceleration, FPGAs face two challenges: insufficient peak computing performance and unacceptable accuracy loss of compression. This paper proposes FMC-LLM to enable for efficient batched decoding 70B+ LLMs.

10.1145/3706628.3708863 article EN 2025-02-26

Deploying Diffusion Models with Scheduling Space Search and Memory Overflow Prevention Based on Graph Optimization

OPENALEX - Publications

Hao Zhou Yang Liu Hongji Wang Enhao Tang Shun Li and 4 more

10.1145/3658617.3697788 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2025-01-20

Accelerator for LLM-Enhanced GNN with Product Quantization and Unified Indexing

OPENALEX - Publications

Jianrong Xu Jinhao Li J Liu Hao Zhou Guohao Dai

10.1145/3658617.3697645 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2025-01-20

ViDA: Video Diffusion Transformer Acceleration with Differential Approximation and Adaptive Dataflow

OPENALEX - Publications

Li Ding J Liu Shan Huang Guohao Dai

10.1145/3658617.3697692 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2025-01-20

LLSM: LLM-enhanced Logic Synthesis Model with EDA-guided CoT Prompting, Hybrid Embedding and AIG-tailored Acceleration

OPENALEX - Publications

Shan Huang Jinhao Li Zhen Yu J. Ye Jianrong Xu and 2 more

10.1145/3658617.3697618 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2025-01-20

TB-STC: Transposable Block-wise N:M Structured Sparse Tensor Core

OPENALEX - Publications

Jun Liu Shulin Zeng Junbo Zhao Li Ding Zeyu Wang and 6 more

10.1109/hpca61900.2025.00075 article EN 2025-03-01

DyLGNN: Efficient LM-GNN Fine-Tuning with Dynamic Node Partitioning, Low-Degree Sparsity, and Asynchronous Sub-Batch

OPENALEX - Publications

Zhen Yu Jinhao Li Jianrong Xu Shan Huang J. Ye and 2 more

10.23919/date64628.2025.10992745 article EN 2025-03-31

TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

OPENALEX - Publications

Haotian Tang Shang Yang Zhijian Liu Ke Hong Zhongming Yu and 4 more

Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing AR/VR, autonomous driving, and graph understanding recommendation systems. Since the computation pattern is sparse irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for convolution. The gather-GEMM-scatter easy to implement but not optimal performance, while dataflows with overlapped memory access (e.g. implicit GEMM) highly performant...

10.1145/3613424.3614303 article EN cc-by 2023-10-28

Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

OPENALEX - Publications

Shulin Zeng Guohao Dai Hanbo Sun Kai Zhong Guangjun Ge and 3 more

FPGAs have shown great potential in providing low-latency and energy-efficient solutions for deep neural network (DNN) inference applications. Currently, the majority of FPGA-based DNN accelerators cloud run a time-division multiplexing way multiple users sharing single FPGA, require re-compilation with $\sim$100s overhead. Such designs lead to poor isolation heavy performance loss users, which are far away from efficient flexible FPGA virtualization neither public nor private scenarios. To...

10.1109/fccm48280.2020.00023 article EN 2020-05-01

Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation

OPENALEX - Publications

Yitu Wang Zhenhua Zhu Fan Chen Mingyuan Ma Guohao Dai and 3 more

Personalized recommendation systems are widely used in many Internet services. The sparse embedding lookup models dominates the computational cost of inference due to its intensive irregular memory accesses. Applying resistive random access (ReRAM) based process-in-memory (PIM) architecture accelerate processing can avoid data movements caused by off-chip However, naïve adoption ReRAM-based DNN accelerators leads low computation parallelism and severe under-utilization computing resources,...

10.1109/iccad51958.2021.9643573 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2021-11-01

CogDL: A Comprehensive Library for Graph Deep Learning

OPENALEX - Publications

Yukuo Cen Zhenyu Hou Yan Wang Qibin Chen Yizhen Luo and 13 more

Graph neural networks (GNNs) have attracted tremendous attention from the graph learning community in recent years. It has been widely adopted various real-world applications diverse domains, such as social and biological graphs. The research of deep present new challenges, including sparse nature data, complicated training GNNs, non-standard evaluation tasks. To tackle issues, we CogDL1, a comprehensive library for that allows researchers practitioners to conduct experiments, compare...

10.1145/3543507.3583472 article EN cc-by Proceedings of the ACM Web Conference 2022 2023-04-26

Online scheduling for FPGA computation in the Cloud

OPENALEX - Publications

Guohao Dai Yi Shan Fei Chen Yu Wang Kun Wang and 1 more

The popularization and application of Cloud Computing have provided a new approach for users to get computing resources in recent years. Meanwhile, due the advantages including programmability power-efficiency, FPGAs been applied custom many domains. Previous work has made FPGA available under cloud environment. However, effective usage requires efficient online task scheduling: properly assign as tasks from different tenants possible FPGAs. In this paper, we propose benefit-based scheduling...

10.1109/fpt.2014.7082811 article EN 2014-12-01

Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective

OPENALEX - Publications

Hengrui Zhang Zhongming Yu Guohao Dai Guyue Huang Yufei Ding and 2 more

Graph Neural Networks (GNNs) have been widely used in various domains, and GNNs with sophisticated computational graph lead to higher latency larger memory consumption. Optimizing the GNN suffers from: (1) Redundant neural operator computation. The same data are propagated through structure perform operation multiple times GNNs, leading redundant computation which accounts for 92.4% of total operators. (2) Inconsistent thread mapping. Efficient mapping schemes vertex-centric edge-centric...

10.48550/arxiv.2110.09524 preprint EN other-oa arXiv (Cornell University) 2021-01-01

An Efficient Accelerator for Point-based and Voxel-based Point Cloud Neural Networks

OPENALEX - Publications

Xinhao Yang Tianyu Fu Guohao Dai Shulin Zeng Kai Zhong and 2 more

The 3D point cloud neural networks, including point-based and voxel-based play an essential role in various applications. Many previous works have proposed dedicated accelerators to speed up network processing. Yet, two major challenges still exist: (1) Inefficient memory access due large off-chip data volume. method visits massive redundant points, while the fails reuse on-chip voxel data, leading 983× compared with original input data. (2) Poor scalability low computing unit utilization....

10.1109/dac56929.2023.10247806 article EN 2023-07-09

Gibbon: Efficient Co-Exploration of NN Model and Processing-In-Memory Architecture

OPENALEX - Publications

Hanbo Sun Chenyu Wang Zhenhua Zhu Xuefei Ning Guohao Dai and 2 more

The memristor-based Processing-In-Memory (PIM) architectures have shown great potential to boost the computing energy efficiency of Neural Networks (NNs). Existing work concentrates on hardware architecture design and algorithm-hardware co-optimization, but neglects non-negligible impact correlation between NN models PIM architectures. To ensure high accuracy efficiency, it is important co-design model architecture. However, one hand, co-exploration space extremely tremendous, making...

10.23919/date54114.2022.9774605 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2022-03-14

CoGNN: An Algorithm-Hardware Co-Design Approach to Accelerate GNN Inference With Minibatch Sampling

OPENALEX - Publications

Kai Zhong Shulin Zeng Wentao Hou Guohao Dai Zhenhua Zhu and 4 more

As a new algorithm of graph embedding, neural networks (GNNs) have been widely used in many fields. However, GNN computing has the characteristics both sparse processing and dense network, which make it difficult to be deployed efficiently on existing accelerators or network accelerators. Recently, some proposed, but following challenges not fully solved: 1) minibatch inference scenario potential software hardware co-design, can bring 30% computation amount reduction, this is well utilized....

10.1109/tcad.2023.3279302 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2023-05-23

Coming Soon ...