- Parallel Computing and Optimization Techniques
- Advanced Memory and Neural Computing
- Advanced Data Storage Technologies
- Ferroelectric and Negative Capacitance Devices
- Brain Tumor Detection and Classification
- Advanced Neural Network Applications
- Network Packet Processing and Optimization
- Tensor decomposition and applications
- Neural Networks and Applications
- Advanced Vision and Imaging
- Semiconductor materials and devices
- Distributed and Parallel Computing Systems
- Advanced Image and Video Retrieval Techniques
- Image and Signal Denoising Methods
- Algorithms and Data Compression
- Generative Adversarial Networks and Image Synthesis
- Manufacturing Process and Optimization
- Advanced Data Compression Techniques
- Semiconductor materials and interfaces
- Industrial Vision Systems and Defect Detection
- Analog and Mixed-Signal Circuit Design
- Smart Grid Security and Resilience
- Integrated Circuits and Semiconductor Failure Analysis
- Embedded Systems Design Techniques
- Elevator Systems and Control
Tsinghua University
2019-2025
Global Energy Interconnection Research Institute North America
2020
This work presents a 65nm CMOS speech recognition processor, named Thinker-IM, which employs 16 computing-in-memory (SRAM-CIM) macros for binarized recurrent neural network (RNN) computation. Its major contributions are: 1) A novel digital-CIM mixed architecture that runs an output-weight dual stationary (OWDS) dataflow, reducing 85.7% memory accessing; 2) Multi-bit XNOR SRAM-CIM and corresponding CIM-aware weight adaptation reduces 9.9% energy consumption in average; 3) Predictive early...
Computing-in-memory (CIM) improves energy efficiency by enabling parallel multiply-and-accumulate (MAC) operations and reducing memory accesses [1-4]. However, today's typical neural networks (NNs) usually exceed on-chip capacity. Thus, a CIM-based processor may encounter bottleneck [5]. Tensor-train (TT) is tensor decomposition method, which decomposes d-dimensional to d 4D tensor-cores (TCs: G <sub xmlns:mml="http://www.w3.org/1998/Math/MathML"...
Diffusion models (DMs) have emerged as a powerful category of generative with record-breaking performance in image synthesis [1]. A noisy created from pure Gaussian random variables needs to be denoised by iterative DMs ensure quality. For DMs, quantizing activations integers (INT) degrades quality due changes activation distributions and the accumulation quantization errors across iterations. GPU (Nvidia A100) requires 2560 ms 250 W generate $256 \times 256$ through 50 iterations...
Computing-in-memory (CIM) is an attractive approach for energy-efficient deep neural network (DNN) processing, especially low-power edge devices. However, today's typical DNNs usually exceed CIM-static random access memory (SRAM) capacity. The introduced off-chip communication covers up the benefits of CIM technique, meaning that processors still encounter bottleneck. To eliminate this bottleneck, we propose a processor, called TT@CIM, which applies tensor-train decomposition (TTD) method to...
Rapidly expanding artificial intelligence (Al) models, for complex AI tasks, drive high-energy efficiency and high-precision requirements Al processors [1–6]. Floating-point CIM (FP-CIM) is a promising technique to improve energy maintain accuracy. However, FP-CIM with FP32/FP16/BF16 suffers from performance bottleneck due its large storage considerable MAC power. The emerging POSIT data format, exploiting dynamic bit width that adapts varied distributions, can use low achieve nearly the...
Transformer models have achieved impressive performance in various artificial intelligence (AI) applications. However, the high cost of computation and memory footprint make its inference inefficient. Although digital compute-in-memory (CIM) is a promising hardware architecture with accuracy, Transformer's attention mechanism raises three challenges access CIM: 1) involving <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Query</i>...
Cost-volume construction, which accurately computes the similarities between pixels in paired images, is a fundamental kernel of stereo vision processing and has been directly used robotic, autopilot, AR/VR applications. However, large parameter size consecutive data accesses real-time cost-volume construction (>30fps) exerts high demand on memory bandwidth (0.254Tb/s) operation (391GOPs). A promising candidate to resolve bottleneck computation-in-memory (CIM), provides computing parallelism...
This work presents a 65nm RNN processor with computing-inmemory (CIM) macros. The main contributions include: 1) A similarity analyzer (SimAyz) to fully leverage the temporal stability of input sequences 1.52× performance speedup; 2) An attention-based context-breaking (AttenBrk) method output speculation reduce off-chip data accesses up 30.3%; 3) double-buffering scheme for CIM macros hide writing latency and pipeline processing element (PE) array increase system throughput. Measured...
This paper proposes an energy-efficient Transformer processor exploiting dynamic similarity in global attention computing. It has three features: 1) A principal-component-prior speculation unit (PCSU) removes 28.4% of redundant computations. 2) similar-vector tracked computing engine (STCE) saves 42.2% multiplications. 3) bit-wise stationary processing element (BSPE) reduces multiplication energy by $1.47\times$. The proposed achieves a peak efficiency 77.35TOPS/W. $2.81\times$ and offers...
Computing-in-memory (CIM) is an attractive approach for energy-efficient neural network (NN) processors. Attention mechanisms shows great performance in NLP and CV by capturing contextual knowledge from the entire tokens (X). The attention mechanism essentially a content-based similarity search computing probabilities (P) final results (Att). For P, first, query (Q) key (K) are computed X weight matrices $(\text{W}_{Q}, \text{W}_{K})$ respectively. Then, Q multiplied $\text{K}^{T}($ QxK...
Residual (2+1)-dimensional convolution neural network (R(2+1)D CNN) has achieved great success in video recognition due to the spatiotemporal structure. However, R(2+1)D CNN incurs large energy and latency overhead because of intensive computation frequent memory access. To solve issues, we propose a digital SRAM-CIM based accelerator with two key features: (1) Systolic CIM array efficiently match massive computations regular architecture; (2) Digtal circuit design output sparsity...
Transformer models shows state-of-the-art results in natural language processing and computer vision, leveraging a multi-headed self-attention mechanism. In each head, the operation is defined as <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\text{Attn}=\text{Softmax}(\mathrm{Q}\cdot \mathrm{K}^{\top})\cdot \mathrm{V}$</tex> , where xmlns:xlink="http://www.w3.org/1999/xlink">$\mathrm{Q}=\mathrm{X}\cdot \mathrm{W}_{\mathrm{Q}},\...