- Error Correcting Code Techniques
- Advanced Wireless Communication Techniques
- Advanced Neural Network Applications
- Coding theory and cryptography
- Cooperative Communication and Network Coding
- CCD and CMOS Imaging Sensors
- Advanced Memory and Neural Computing
- Cryptographic Implementations and Security
- Cryptography and Residue Arithmetic
- Image Processing Techniques and Applications
- Advanced Vision and Imaging
- Cryptography and Data Security
- Parallel Computing and Optimization Techniques
- Analog and Mixed-Signal Circuit Design
- Low-power high-performance VLSI design
- Neural Networks and Applications
- Advanced Image and Video Retrieval Techniques
- Numerical Methods and Algorithms
- Advanced MIMO Systems Optimization
- Wireless Communication Networks Research
- DNA and Biological Computing
- Digital Filter Design and Implementation
- Advanced Data Storage Technologies
- Algorithms and Data Compression
- Topic Modeling
Nanjing University
2016-2025
Sun Yat-sen University
2023-2025
Chinese Academy of Sciences
2012-2024
Shenyang Institute of Automation
2012-2024
University of California, Santa Barbara
2023-2024
China Electronics Technology Group Corporation
2020-2022
Xijing University
2020
Wanfang Data (China)
2019
Hefei University of Technology
2017
Broadcom (United States)
2008-2016
Abstract Internet of things (IoT) becomes part everyday life across the globe, whose nodes are able to sense, store, and transmit information wirelessly. However, IoT based on von Neumann architectures realize memory, computing communication functions with physical separated devices, which result in severe power consumption computation latency. In this study, a wireless multiferroic memristor consisting Metglas/Pb(Zr 0.3 Ti 0.7 )O 3 ‐1 mol% Mn/Metglas laminate is proposed, integrates...
Recently, significant improvement has been achieved for hardware architecture design of deep neural networks (DNNs). However, the implementation one widely used softmax function in DNNs not much investigated, which involves expensive division and exponentiation units. This paper performs an efficient function. Mathematical transformations linear fitting are to simplify this Multiple algorithmic strength reduction strategies fast addition methods employed optimize architecture. By using these...
Convolutional neural network (CNN) is the state-of-the-art deep learning approach employed in various applications. Real-time CNN implementations resource limited embedded systems are becoming highly desired recently. To ensure programmable flexibility and shorten development period, field gate array appropriate to implement models. However, bandwidth on-chip memory storage bottlenecks of acceleration. In this paper, we propose efficient hardware architectures accelerate The theoretical...
The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due the immense parameters and operations of models. To relieve this burden, exploiting sparsity effective approach accelerate Transformers. Newly emerging Ampere graphics processing units (GPUs) leverage a 2:4 pattern achieve model acceleration, while can hardly meet diverse algorithm hardware constraints when deploying By contrast,...
Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging implement complex embedded devices with stringent memory latency requirement. In this paper, we first present novel hybrid compression method for widely used RNN variant, long-short term (LSTM), tackle these...
Binary weight convolutional neural networks (BCNNs) can achieve near state-of-the-art classification accuracy and have far less computation complexity compared with traditional CNNs using high-precision weights. Due to their binary weights, BCNNs are well suited for vision-based Internet-of-Things systems being sensitive power consumption. make it possible very high throughput moderate dissipation. In this paper, an energy-efficient architecture is proposed. It fully exploits the weights...
The softmax function has been widely used in deep neural networks (DNNs), and studies on efficient hardware accelerators for DNN have also attracted tremendous attention. However, it is very challenging to design architectures because of the expensive exponentiation division calculations it. In this brief, firstly simplified by exploring algorithmic strength reductions. Afterwards, a hardware-friendly precision-adjustable calculation method proposed, which can meet different precision...
Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing are built either convolutional (CNNs) or recurrent (RNNs). Recently, the Transformer model is replacing RNN in natural language processing (NLP) area. However, because intensive matrix computations and complicated data flow being involved, design never reported. In this paper, we propose first accelerator two key components, i.e., multi-head attention (MHA) ResBlock...
Long Short-Term Memory (LSTM) and its variants have been widely adopted in many sequential learning tasks, such as speech recognition machine translation. Significant accuracy improvements can be achieved using complex LSTM model with a large memory requirement high computational complexity, which is time-consuming energy demanding. The low-latency energy-efficiency requirements of the real-world applications make compression hardware acceleration for an urgent need. In this paper, several...
The training of Deep Neural Networks (DNNs) brings enormous memory requirements and computational complexity, which makes it a challenge to train DNN models on resource-constrained devices. Training DNNs with reduced-precision data representation is crucial mitigate this problem. In article, we conduct thorough investigation low-bit posit numbers, Type-III universal number (Unum). Through comprehensive analysis quantization various formats, demonstrated that the format shows great potential...
To enable efficient deployment of convolutional neural networks (CNNs) on embedded platforms for different computer vision applications, several convolution variants have been introduced, such as depthwise (DWCV), transposed (TPCV), and dilated (DLCV). address the utilization degradation issue occurred in a general engine these emerging operators, highly flexible reconfigurable hardware accelerator is proposed to efficiently support various CNN-based tasks. Firstly, avoid workload imbalance...
Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications. Specifically, ViTs' multi-head attention layers make it possible embed information globally across the overall image. Nevertheless, computing and storing such matrices incurs quadratic cost dependency on number of patches, limiting its achievable efficiency scalability prohibiting more extensive real-world ViT applications resource-constrained devices....
Prior research efforts have been focusing on using BCH codes for error correction in multi-level cell (MLC) NAND flash memory. However, often require highly parallel implementations to meet the throughput requirement. As a result, large area is needed. In this paper, we propose use Reed-Solomon (RS) MLC A (828, 820) RS code has almost same rate and length terms of bits as (8248, 8192) code. Moreover, it at least error-correcting performance memory applications. Nevertheless, with 70% area,...
Power consumption is a major bottleneck of system performance and listed as one the top three challenges in International Technology Roadmap for Semiconductor 2008. In practice, large portion on chip power consumed by clock which made distribution network flop-flops. this paper, various design techniques low clocking are surveyed. Among them an effective way to reduce capacity load minimizing number clocked transistors. To approach this, we propose novel pair shared flip-flop reduces local...
In the era of artificial intelligence (AI), deep neural networks (DNNs) have emerged as most important and powerful AI technique. However, large DNN models are both storage computation intensive, posing significant challenges for adopting DNNs in resource-constrained scenarios. Thus, model compression becomes a crucial technique to ensure wide deployment DNNs.
This paper proposes a generalized hyperbolic COordinate Rotation Digital Computer (GH CORDIC) to directly compute logarithms and exponentials with an arbitrary fixed base. In hardware implementation, it is more efficient than the state of art which requires both CORDIC constant multiplier. More specifically, we develop theory GH by adding new parameter called base conventional CORDIC. can be used specify respect computation exponentials. As result, multiplier no longer needed convert e...
Designing hardware accelerators for convolutional neural networks (CNNs) has recently attracted tremendous attention. Plenty of existing are built dense CNNs or structured sparse CNNs. By contrast, unstructured can achieve higher compression ratio with equivalent accuracy. However, their corresponding implementations generally suffer from load imbalance and conflict access to on-chip buffers, which results in under utilization processing elements (PEs). To tackle these issues, we propose a...
Deformable convolutional networks (DCNs) have shown outstanding potential in video super-resolution with their powerful inter-frame feature alignment. However, deploying DCNs on resource-limited devices is challenging, due to high computational complexity and irregular memory accesses. In this work, an algorithm-hardware co-optimization framework proposed accelerate the field-programmable gate array (FPGA). Firstly, at algorithm level, anchor-based lightweight deformable network (ALDNet)...
Extreme edge platforms, such as in-vehicle smart devices, require efficient deployment of quantized deep neural networks (DNNs) to enable intelligent applications with limited amounts energy, memory, and computing resources. However, many devices struggle boost inference throughput various DNNs due the varying quantization levels, these lack floating-point (FP) support for on-device learning, which prevents them from improving model accuracy while ensuring data privacy. To tackle challenges...
Swin Transformer achieves greater efficiency than Vision by utilizing local self-attention and shifted windows. However, existing hardware accelerators designed for have not been optimized the unique computation flow data reuse property in Transformer, resulting lower utilization extra memory accesses. To address this issue, we develop SWAT, an efficient Accelerator based on FPGA. Firstly, to eliminate redundant computations windows, a novel tiling strategy is employed, which helps developed...
All-in-one image restoration (IR) recovers images from various unknown distortions by a single model, such as rain, haze, and blur. Transformer-based IR methods have significantly improved the visual effects of restored images. However, deploying complex models on edge devices is challenging due to massive parameters intensive computations. Moreover, existing accelerators are typically customized for task, resulting in severe resource underutilization when executing multiple tasks....
Recently, large models, such as Vision Transformer and BERT, have garnered significant attention due to their exceptional performance. However, extensive computational requirements lead considerable power hardware resource consumption. Brain-inspired computing, characterized by its spike-driven methods, has emerged a promising approach for low-power implementation. In this paper, we propose an efficient sparse accelerator Spike-driven Transformer. We first design novel encoding method that...