Guohao Dai

ORCID: 0000-0003-0849-3252
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Advanced Memory and Neural Computing
  • Advanced Graph Neural Networks
  • Ferroelectric and Negative Capacitance Devices
  • Advanced Image and Video Retrieval Techniques
  • Graph Theory and Algorithms
  • Parallel Computing and Optimization Techniques
  • CCD and CMOS Imaging Sensors
  • Topic Modeling
  • Natural Language Processing Techniques
  • Stochastic Gradient Optimization Techniques
  • Tensor decomposition and applications
  • Algorithms and Data Compression
  • Data Management and Algorithms
  • Image Retrieval and Classification Techniques
  • Advanced Data Storage Technologies
  • Speech Recognition and Synthesis
  • Complex Network Analysis Techniques
  • Recommender Systems and Techniques
  • Caching and Content Delivery
  • Machine Learning in Materials Science
  • Cloud Computing and Resource Management
  • Brain Tumor Detection and Classification
  • Advanced Data Processing Techniques
  • Particle Detector Development and Performance

Shanghai Jiao Tong University
2022-2025

Tsinghua University
2014-2023

National Engineering Research Center for Information Technology in Agriculture
2020

The acceleration of Graph Neural Networks (GNNs) requires efficient and framework-compatible Sparse-Dense Matrix-Matrix Multiplication (SpMM). From the compatibility perspective, sophisticated sparse matrix representations in state-of-the-art SpMM designs cause heavy preprocessing overhead for framework. efficiency optimizations SpMV (Sparse Matrix-Vector) do not apply well to SpMM, leading redundant uncoalesced global memory access. We propose GE-SpMM1, which takes CSR format consistent...

10.1109/sc41405.2020.00076 article EN 2020-11-01

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification quantization are commonly used to mitigate the gap between LLM's computation/memory overheads hardware capacity. existing GPU transformer-based accelerators cannot efficiently process compressed LLMs, due following unresolved challenges: low computational efficiency,...

10.1145/3626202.3637562 article EN cc-by 2024-04-01

Convolutional Neural Networks (CNNs) play a vital role in machine learning. Emerging resistive random-access memories (RRAMs) and RRAM-based Processing-In-Memory architectures have demonstrated great potentials boosting both the performance energy efficiency of CNNs. However, restricted by immature process technology, it is hard to implement fabricate CNN accelerator chip based on multi-bit RRAM devices. In addition, existing single bit accelerators only focus binary or ternary CNNs which...

10.1145/3316781.3317739 article EN 2019-05-23

Memristor based neuromorphic computing systems give alternative solutions to boost the energy efficiency of Neural Network (NN) algorithms. Because large-scale applications and large architecture design space, many factors will affect accuracy system's performance. In this work, we propose a behavior-level modeling tool for memristor-based systems, MNSIM 2.0, model performance help researchers realize an early-stage space exploration. Compared with former version other benchmarks, 2.0 has...

10.1145/3386263.3407647 article EN 2020-09-04

Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within field been directed towards developing techniques aimed at enhancing efficiency inference. This paper presents a comprehensive survey existing literature on efficient We start by analyzing primary causes...

10.48550/arxiv.2404.14294 preprint EN arXiv (Cornell University) 2024-04-22

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of on different input data (e.g., >85% loss with single algorithm). In this paper, we consider challenge dynamics from novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal...

10.1145/3489517.3530508 article EN Proceedings of the 59th ACM/IEEE Design Automation Conference 2022-07-10

Circuit representation learning has become pivotal in electronic design automation, enabling critical tasks such as testability analysis, logic reasoning, power estimation, and SAT solving. However, existing models face significant challenges scaling to large circuits due limitations like over-squashing graph neural networks the quadratic complexity of transformer-based models. To address these issues, we introduce DeepGate4, a scalable efficient transformer specifically designed for...

10.48550/arxiv.2502.01681 preprint EN arXiv (Cornell University) 2025-02-02

Video Generation Model (VGM), as a representative of multi-modal large models, has revolutionized the productivity video content creation. VGMs are compute-bound due to adopting Diffusion Transformer (i.e., DiT) structure. Sparsification is common method for accelerating compute-intensive models. Still, sparse cannot fully exploit effective throughput TOPS) GPUs. FPGAs good candidates deep learning However, existing FPGA accelerators still face low ( < 2TOPS) on significant gap in peak...

10.1145/3706628.3708864 article EN 2025-02-26

For large language model (LLM) acceleration, FPGAs face two challenges: insufficient peak computing performance and unacceptable accuracy loss of compression. This paper proposes FMC-LLM to enable for efficient batched decoding 70B+ LLMs.

10.1145/3706628.3708863 article EN 2025-02-26

10.1145/3658617.3697645 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2025-01-20

10.1145/3658617.3697692 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2025-01-20

Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing AR/VR, autonomous driving, and graph understanding recommendation systems. Since the computation pattern is sparse irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for convolution. The gather-GEMM-scatter easy to implement but not optimal performance, while dataflows with overlapped memory access (e.g. implicit GEMM) highly performant...

10.1145/3613424.3614303 article EN cc-by 2023-10-28

FPGAs have shown great potential in providing low-latency and energy-efficient solutions for deep neural network (DNN) inference applications. Currently, the majority of FPGA-based DNN accelerators cloud run a time-division multiplexing way multiple users sharing single FPGA, require re-compilation with $\sim$100s overhead. Such designs lead to poor isolation heavy performance loss users, which are far away from efficient flexible FPGA virtualization neither public nor private scenarios. To...

10.1109/fccm48280.2020.00023 article EN 2020-05-01

Personalized recommendation systems are widely used in many Internet services. The sparse embedding lookup models dominates the computational cost of inference due to its intensive irregular memory accesses. Applying resistive random access (ReRAM) based process-in-memory (PIM) architecture accelerate processing can avoid data movements caused by off-chip However, naïve adoption ReRAM-based DNN accelerators leads low computation parallelism and severe under-utilization computing resources,...

10.1109/iccad51958.2021.9643573 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2021-11-01

Graph neural networks (GNNs) have attracted tremendous attention from the graph learning community in recent years. It has been widely adopted various real-world applications diverse domains, such as social and biological graphs. The research of deep present new challenges, including sparse nature data, complicated training GNNs, non-standard evaluation tasks. To tackle issues, we CogDL1, a comprehensive library for that allows researchers practitioners to conduct experiments, compare...

10.1145/3543507.3583472 article EN cc-by Proceedings of the ACM Web Conference 2022 2023-04-26

The popularization and application of Cloud Computing have provided a new approach for users to get computing resources in recent years. Meanwhile, due the advantages including programmability power-efficiency, FPGAs been applied custom many domains. Previous work has made FPGA available under cloud environment. However, effective usage requires efficient online task scheduling: properly assign as tasks from different tenants possible FPGAs. In this paper, we propose benefit-based scheduling...

10.1109/fpt.2014.7082811 article EN 2014-12-01

Graph Neural Networks (GNNs) have been widely used in various domains, and GNNs with sophisticated computational graph lead to higher latency larger memory consumption. Optimizing the GNN suffers from: (1) Redundant neural operator computation. The same data are propagated through structure perform operation multiple times GNNs, leading redundant computation which accounts for 92.4% of total operators. (2) Inconsistent thread mapping. Efficient mapping schemes vertex-centric edge-centric...

10.48550/arxiv.2110.09524 preprint EN other-oa arXiv (Cornell University) 2021-01-01

The 3D point cloud neural networks, including point-based and voxel-based play an essential role in various applications. Many previous works have proposed dedicated accelerators to speed up network processing. Yet, two major challenges still exist: (1) Inefficient memory access due large off-chip data volume. method visits massive redundant points, while the fails reuse on-chip voxel data, leading 983× compared with original input data. (2) Poor scalability low computing unit utilization....

10.1109/dac56929.2023.10247806 article EN 2023-07-09

The memristor-based Processing-In-Memory (PIM) architectures have shown great potential to boost the computing energy efficiency of Neural Networks (NNs). Existing work concentrates on hardware architecture design and algorithm-hardware co-optimization, but neglects non-negligible impact correlation between NN models PIM architectures. To ensure high accuracy efficiency, it is important co-design model architecture. However, one hand, co-exploration space extremely tremendous, making...

10.23919/date54114.2022.9774605 article EN Design, Automation &amp; Test in Europe Conference &amp; Exhibition (DATE), 2015 2022-03-14

As a new algorithm of graph embedding, neural networks (GNNs) have been widely used in many fields. However, GNN computing has the characteristics both sparse processing and dense network, which make it difficult to be deployed efficiently on existing accelerators or network accelerators. Recently, some proposed, but following challenges not fully solved: 1) minibatch inference scenario potential software hardware co-design, can bring 30% computation amount reduction, this is well utilized....

10.1109/tcad.2023.3279302 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2023-05-23
Coming Soon ...