- Advanced Neural Network Applications
- Advanced Data Compression Techniques
- Advanced Image and Video Retrieval Techniques
- CCD and CMOS Imaging Sensors
- Video Coding and Compression Technologies
- Advanced Vision and Imaging
- Digital Filter Design and Implementation
- Brain Tumor Detection and Classification
- Embedded Systems Design Techniques
- Advancements in Semiconductor Devices and Circuit Design
- Ferroelectric and Negative Capacitance Devices
- Semiconductor materials and devices
- Image and Signal Denoising Methods
- Parallel Computing and Optimization Techniques
- Adversarial Robustness in Machine Learning
- Integrated Circuits and Semiconductor Failure Analysis
- Robotics and Sensor-Based Localization
- Interconnection Networks and Systems
- Advanced Manufacturing and Logistics Optimization
- Algorithms and Data Compression
- VLSI and FPGA Design Techniques
- Multimodal Machine Learning Applications
- Gaussian Processes and Bayesian Inference
- User Authentication and Security Systems
- Cryptographic Implementations and Security
Fudan University
2013-2025
Shanghai Fudan Microelectronics (China)
2010-2021
China Southern Power Grid (China)
2021
Zhejiang Gongshang University
2020
Changsha University of Science and Technology
2013-2017
Huaqiao University
2017
State Key Laboratory of ASIC and System
2008-2015
Education Department of Hunan Province
2015
Central South University
2013-2015
China Information Technology Security Evaluation Center
2014
Deep learning-based radiomics (DLR) was developed to extract deep information from multiple modalities of magnetic resonance (MR) images. The performance DLR for predicting the mutation status isocitrate dehydrogenase 1 (IDH1) validated in a dataset 151 patients with low-grade glioma. A modified convolutional neural network (CNN) structure 6 layers and fully connected layer 4096 neurons used segment tumors. Instead calculating image features segmented images, as typically performed normal...
In recent years, convolutional neural networks (CNNs) based machine learning algorithms have been widely applied in computer vision applications. However, for large-scale CNNs, the computation-intensive, memory-intensive and resource-consuming features brought many challenges to CNN implementations. This work proposes an end-to-end FPGA-based accelerator with all layers mapped on one chip so that different can concurrently a pipelined structure increase throughput. A methodology which find...
One of the major problems p-i-n tunneling field-effect transistor (TFET) is reliability due to strong electric field near junction. In this paper, using technology computer-aided design simulation, we show that insertion a thin n-layer into junction TFET (p-n-i-n TFET) not only enhances its drive current, as has been previously reported, but also improves reliability. As compared with conventional TFET, demonstrate following properties p-n-i-n TFET: 1) The normal component reduced, and...
Convolutional Neural Networks (CNNs) can achieve high classification accuracy while they require complex computation. Binarized (BNNs) with binarized weights and activations simplify computation but suffer from obvious loss. In this paper, low bit-width CNNs, BNNs standard CNNs are compared to show that is better suited for embedded systems. An architecture based on the two-stage arithmetic unit (TSAU) as basic processing element proposed process each layer iteratively CNN accelerators. Then...
In this letter, we report for the first time degradation mechanism of drain current in tunneling field-effect transistors (TFETs). Using positive-bias and hot-carrier (HC) stress experiments TCAD simulation, show that is mainly induced by interface traps and/or oxide charge located above region, causing reduction field current. The induce transconductance, while essentially causes a threshold-voltage shift TFETs. results interface-trap generation dominant under stress, oxide-charge creation...
Abstract Background Brain tumor segmentation is a challenging problem in medical image processing and analysis. It very time-consuming error-prone task. In order to reduce the burden on physicians improve accuracy, computer-aided detection (CAD) systems need be developed. Due powerful feature learning ability of deep technology, many learning-based methods have been applied brain CAD achieved satisfactory accuracy. However, neural networks high computational complexity, process consumes...
Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of networks recently enjoy benefits both computation and memory reduction. However, most cannot activations weights. For those works that opportunities, they achieve stable load balance through a static scheduling (SS) strategy, which is vulnerable distribution. In this work, balanced compressed sparse row format dynamic strategy improve balance. A set-associate structure also presented tradeoff...
VBSME (variable block size motion estimation) is adopted in the MPEG-4 AVC/H.264 standard. In order to increase hardware utilization for with FSBMA (full search matching algorithm), this paper proposed a new high-performance reconfigurable VLSI architecture support "meander"-like scan format high data reuse of area. The can three flows through computing array and memory achieve 100% processing element (PE) smaller blocks' SADs calculate 41 vectors (MVs) 16X16 parallel. design implemented...
CGRA, as a coprocessor in SoCs, has been widely studied. However, there is limited research on how to efficiently debug and verify SoCs composed of CGRAs processors during the design process. To address this gap, we introduce DVHetero. DVHetero incorporates simulation validation framework, SoCDiff, which enables comprehensive SoC simulation, debugging, rapid error localization. Using verification successfully implemented validated entire SoC. The includes Chisel-based CGRA generator provides...
This brief presents a reconfigurable VLSI architecture which is designed for multi-transform codec in several video coding standards of MPEG-2/4, VC-1, H.264/AVC and AVS. The multiple constant multiplication algorithm with two fusing strategies provided to generate multipliers the matrix calculation blocks. Additionally, adder-sharing strategy adopted unified preprocessing/postprocessing block save circuit areas. proposed can support different through static reconfiguration forward/inverse...
With the development of machine learning technology, exploration energy-efficient and flexible architectures for object inference algorithms is growing interest in recent years. However, not many publications concentrate on a coarsegrained reconfigurable architecture (CGRA) algorithms. This paper provides stream processing, dual-track programming CGRA-based approach to address inherent computing characteristics inference. Based proposed approach, an called CGRA (SDT-CGRA) presented as...
In the domain of password recovery, deep learning has emerged as a pivotal technology for enhancing recovery efficiency. Despite its effectiveness, inherent computation complexity learning-based generation algorithms poses substantial challenges, particularly in achieving synergistic acceleration between inference, and plaintext encryption process. this paper, we introduce PassRecover, multi-FPGA-based computing system that can simultaneously accelerate learning-driven an end-to-end manner....
This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve processing frame rate, the is designed with several characteristics. Firstly, sliding window method proposed extract feature points in parallel at selected scale levels. As result, time cost extraction can be greatly reduced. Secondly, data reuse strategy orientation generation and descriptor reduce memory access times. this way, 3.87x 2.25X...
With the continuous refinement of Deep Neural Networks (DNNs), a series deep and complex networks such as Residual (ResNets) show impressive prediction accuracy in image classification tasks. Unfortunately, structural complexity computational cost residual make hardware implementation difficult. In this paper, we present quantized reconstructed neural network (QR-DNN) technique, which first inserts batch normalization (BN) layers during training, later removes them to facilitate efficient...
Due to the fact that FPGA on-chip memory capacity increases significantly, feature maps and weights of convolutional layers can be stored on chip, which reduce data movement between off-chip memory. Hence, bottleneck shift from bandwidth computing resources in layers, will improve performance dramatically. Under this circumstance, paper quantitatively analyzes how design hardware architecture based roofline model optimize under constraints available propose an efficient architecture. Our...
In this brief, an FPGA-based solution is proposed to show the computing efficiency on rotated object detection based R <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> Det algorithm. The key idea of our approach firstly design reconfigurable neural processing units (NPU) for convolutional networks (CNN) and a specific architecture spatial operations, then adopt novel scheduling scheme deal with data dependency these modules. When...
Many models combining Transformers with convolutional neural networks (CNNs) for computer vision tasks have achieved state-of-the-art results. However, due to the different computation patterns between attention and convolution, using a dedicated Transformer or CNN accelerator will inevitably reduce computing efficiency of other. To overcome this problem, we propose unified architecture convolution on FPGA. We runtime overhead by offloading part self-attention computations offline before...
With the continuous refinement of Deep Neural Networks (DNNs), a series deep and complex networks such as Residual (ResNets) show impressive prediction accuracy in image classification tasks. Unfortunately, structural complexity computational cost residual make hardware implementation difficult. In this paper, we present quantized reconstructed neural network (QR-DNN) technique, which first inserts batch normalization (BN) layers during training, later removes them to facilitate efficient...
This paper presents a new hardware architecture that calculates SAD for variable block-size motion estimation (VBSME). The proposed with 16t1-PE array, 4-stage adder tree and two flexible register arrays supports 16t16, 16t8, 8t8, 8t4, 4t8, 4t4 block's calculation. can be used in the encoder enhanced block size MPEG-4 AVC (advanced video coding) emerging H.264 standard. Our design was described Verilog-HDL implemented Altera FPGA APEX20K clock frequency of 120MHz allowing processing 29296...
In this paper, a high-performance match engine for content-based image retrieval is proposed. Highly customized floating-point(FP) units are designed, to provide the dynamic range and precision of standard FP units, but with considerably less area than units. Match calculation arrays various architectures scales designed evaluated. An CBIR system built on 12-FPGA cluster. Inter-FPGA connections based 10-Gigabyte Ethernet. The whole FPGA cluster can compare query against 150 million library...
Winograd algorithm is an efficient approach to alleviate the computation burden of deep CNNs. Firstly, we introduce a fast matrix combine with further reduce complexity and adapt large-stride convolution kernel-partitioning method. Secondly, efficiency improvement due algorithms aggravates off-chip communication. DRAM access different data-flows varies significantly CNN patterns. Dynamic configurations both on-chip shared memory can effectively. A quantitative analysis established on design...