Wei Cao

ORCID: 0000-0003-0339-7093
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Advanced Data Compression Techniques
  • Advanced Image and Video Retrieval Techniques
  • CCD and CMOS Imaging Sensors
  • Video Coding and Compression Technologies
  • Advanced Vision and Imaging
  • Digital Filter Design and Implementation
  • Brain Tumor Detection and Classification
  • Embedded Systems Design Techniques
  • Advancements in Semiconductor Devices and Circuit Design
  • Ferroelectric and Negative Capacitance Devices
  • Semiconductor materials and devices
  • Image and Signal Denoising Methods
  • Parallel Computing and Optimization Techniques
  • Adversarial Robustness in Machine Learning
  • Integrated Circuits and Semiconductor Failure Analysis
  • Robotics and Sensor-Based Localization
  • Interconnection Networks and Systems
  • Advanced Manufacturing and Logistics Optimization
  • Algorithms and Data Compression
  • VLSI and FPGA Design Techniques
  • Multimodal Machine Learning Applications
  • Gaussian Processes and Bayesian Inference
  • User Authentication and Security Systems
  • Cryptographic Implementations and Security

Fudan University
2013-2025

Shanghai Fudan Microelectronics (China)
2010-2021

China Southern Power Grid (China)
2021

Zhejiang Gongshang University
2020

Changsha University of Science and Technology
2013-2017

Huaqiao University
2017

State Key Laboratory of ASIC and System
2008-2015

Education Department of Hunan Province
2015

Central South University
2013-2015

China Information Technology Security Evaluation Center
2014

Deep learning-based radiomics (DLR) was developed to extract deep information from multiple modalities of magnetic resonance (MR) images. The performance DLR for predicting the mutation status isocitrate dehydrogenase 1 (IDH1) validated in a dataset 151 patients with low-grade glioma. A modified convolutional neural network (CNN) structure 6 layers and fully connected layer 4096 neurons used segment tumors. Instead calculating image features segmented images, as typically performed normal...

10.1038/s41598-017-05848-2 article EN cc-by Scientific Reports 2017-07-10

In recent years, convolutional neural networks (CNNs) based machine learning algorithms have been widely applied in computer vision applications. However, for large-scale CNNs, the computation-intensive, memory-intensive and resource-consuming features brought many challenges to CNN implementations. This work proposes an end-to-end FPGA-based accelerator with all layers mapped on one chip so that different can concurrently a pipelined structure increase throughput. A methodology which find...

10.1109/fpl.2016.7577308 article EN 2016-08-01

One of the major problems p-i-n tunneling field-effect transistor (TFET) is reliability due to strong electric field near junction. In this paper, using technology computer-aided design simulation, we show that insertion a thin n-layer into junction TFET (p-n-i-n TFET) not only enhances its drive current, as has been previously reported, but also improves reliability. As compared with conventional TFET, demonstrate following properties p-n-i-n TFET: 1) The normal component reduced, and...

10.1109/ted.2011.2144987 article EN IEEE Transactions on Electron Devices 2011-05-20

Convolutional Neural Networks (CNNs) can achieve high classification accuracy while they require complex computation. Binarized (BNNs) with binarized weights and activations simplify computation but suffer from obvious loss. In this paper, low bit-width CNNs, BNNs standard CNNs are compared to show that is better suited for embedded systems. An architecture based on the two-stage arithmetic unit (TSAU) as basic processing element proposed process each layer iteratively CNN accelerators. Then...

10.23919/fpl.2017.8056820 article EN 2017-09-01

In this letter, we report for the first time degradation mechanism of drain current in tunneling field-effect transistors (TFETs). Using positive-bias and hot-carrier (HC) stress experiments TCAD simulation, show that is mainly induced by interface traps and/or oxide charge located above region, causing reduction field current. The induce transconductance, while essentially causes a threshold-voltage shift TFETs. results interface-trap generation dominant under stress, oxide-charge creation...

10.1109/led.2010.2050456 article EN IEEE Electron Device Letters 2010-06-25

Abstract Background Brain tumor segmentation is a challenging problem in medical image processing and analysis. It very time-consuming error-prone task. In order to reduce the burden on physicians improve accuracy, computer-aided detection (CAD) systems need be developed. Due powerful feature learning ability of deep technology, many learning-based methods have been applied brain CAD achieved satisfactory accuracy. However, neural networks high computational complexity, process consumes...

10.1186/s12859-021-04347-6 article EN cc-by BMC Bioinformatics 2021-09-07

Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of networks recently enjoy benefits both computation and memory reduction. However, most cannot activations weights. For those works that opportunities, they achieve stable load balance through a static scheduling (SS) strategy, which is vulnerable distribution. In this work, balanced compressed sparse row format dynamic strategy improve balance. A set-associate structure also presented tradeoff...

10.1109/tvlsi.2021.3060041 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2021-03-09

VBSME (variable block size motion estimation) is adopted in the MPEG-4 AVC/H.264 standard. In order to increase hardware utilization for with FSBMA (full search matching algorithm), this paper proposed a new high-performance reconfigurable VLSI architecture support "meander"-like scan format high data reuse of area. The can three flows through computing array and memory achieve 100% processing element (PE) smaller blocks' SADs calculate 41 vectors (MVs) 16X16 parallel. design implemented...

10.1109/tce.2008.4637625 article EN IEEE Transactions on Consumer Electronics 2008-08-01

CGRA, as a coprocessor in SoCs, has been widely studied. However, there is limited research on how to efficiently debug and verify SoCs composed of CGRAs processors during the design process. To address this gap, we introduce DVHetero. DVHetero incorporates simulation validation framework, SoCDiff, which enables comprehensive SoC simulation, debugging, rapid error localization. Using verification successfully implemented validated entire SoC. The includes Chisel-based CGRA generator provides...

10.1145/3733721 article EN ACM Transactions on Reconfigurable Technology and Systems 2025-05-02

This brief presents a reconfigurable VLSI architecture which is designed for multi-transform codec in several video coding standards of MPEG-2/4, VC-1, H.264/AVC and AVS. The multiple constant multiplication algorithm with two fusing strategies provided to generate multipliers the matrix calculation blocks. Additionally, adder-sharing strategy adopted unified preprocessing/postprocessing block save circuit areas. proposed can support different through static reconfiguration forward/inverse...

10.1109/tcsii.2011.2158265 article EN IEEE Transactions on Circuits & Systems II Express Briefs 2011-07-01

With the development of machine learning technology, exploration energy-efficient and flexible architectures for object inference algorithms is growing interest in recent years. However, not many publications concentrate on a coarsegrained reconfigurable architecture (CGRA) algorithms. This paper provides stream processing, dual-track programming CGRA-based approach to address inherent computing characteristics inference. Based proposed approach, an called CGRA (SDT-CGRA) presented as...

10.1109/tvlsi.2018.2797600 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2018-02-12

In the domain of password recovery, deep learning has emerged as a pivotal technology for enhancing recovery efficiency. Despite its effectiveness, inherent computation complexity learning-based generation algorithms poses substantial challenges, particularly in achieving synergistic acceleration between inference, and plaintext encryption process. this paper, we introduce PassRecover, multi-FPGA-based computing system that can simultaneously accelerate learning-driven an end-to-end manner....

10.3390/electronics14071415 article EN Electronics 2025-03-31

This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve processing frame rate, the is designed with several characteristics. Firstly, sliding window method proposed extract feature points in parallel at selected scale levels. As result, time cost extraction can be greatly reduced. Secondly, data reuse strategy orientation generation and descriptor reduce memory access times. this way, 3.87x 2.25X...

10.1109/fpt.2013.6718346 article EN 2013-12-01

With the continuous refinement of Deep Neural Networks (DNNs), a series deep and complex networks such as Residual (ResNets) show impressive prediction accuracy in image classification tasks. Unfortunately, structural complexity computational cost residual make hardware implementation difficult. In this paper, we present quantized reconstructed neural network (QR-DNN) technique, which first inserts batch normalization (BN) layers during training, later removes them to facilitate efficient...

10.1587/transinf.2018rcp0008 article EN IEICE Transactions on Information and Systems 2019-04-30

Due to the fact that FPGA on-chip memory capacity increases significantly, feature maps and weights of convolutional layers can be stored on chip, which reduce data movement between off-chip memory. Hence, bottleneck shift from bandwidth computing resources in layers, will improve performance dramatically. Under this circumstance, paper quantitatively analyzes how design hardware architecture based roofline model optimize under constraints available propose an efficient architecture. Our...

10.1109/fpt.2018.00052 article EN 2018-12-01

In this brief, an FPGA-based solution is proposed to show the computing efficiency on rotated object detection based R <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> Det algorithm. The key idea of our approach firstly design reconfigurable neural processing units (NPU) for convolutional networks (CNN) and a specific architecture spatial operations, then adopt novel scheduling scheme deal with data dependency these modules. When...

10.1109/tcsii.2022.3142807 article EN IEEE Transactions on Circuits & Systems II Express Briefs 2022-01-13

Many models combining Transformers with convolutional neural networks (CNNs) for computer vision tasks have achieved state-of-the-art results. However, due to the different computation patterns between attention and convolution, using a dedicated Transformer or CNN accelerator will inevitably reduce computing efficiency of other. To overcome this problem, we propose unified architecture convolution on FPGA. We runtime overhead by offloading part self-attention computations offline before...

10.1109/iscas46773.2023.10182145 article EN 2022 IEEE International Symposium on Circuits and Systems (ISCAS) 2023-05-21

With the continuous refinement of Deep Neural Networks (DNNs), a series deep and complex networks such as Residual (ResNets) show impressive prediction accuracy in image classification tasks. Unfortunately, structural complexity computational cost residual make hardware implementation difficult. In this paper, we present quantized reconstructed neural network (QR-DNN) technique, which first inserts batch normalization (BN) layers during training, later removes them to facilitate efficient...

10.1109/fpl.2018.00018 article EN 2018-08-01

This paper presents a new hardware architecture that calculates SAD for variable block-size motion estimation (VBSME). The proposed with 16t1-PE array, 4-stage adder tree and two flexible register arrays supports 16t16, 16t8, 8t8, 8t4, 4t8, 4t4 block's calculation. can be used in the encoder enhanced block size MPEG-4 AVC (advanced video coding) emerging H.264 standard. Our design was described Verilog-HDL implemented Altera FPGA APEX20K clock frequency of 120MHz allowing processing 29296...

10.1109/icasic.2003.1277368 article EN 2003-01-01

In this paper, a high-performance match engine for content-based image retrieval is proposed. Highly customized floating-point(FP) units are designed, to provide the dynamic range and precision of standard FP units, but with considerably less area than units. Match calculation arrays various architectures scales designed evaluated. An CBIR system built on 12-FPGA cluster. Inter-FPGA connections based 10-Gigabyte Ethernet. The whole FPGA cluster can compare query against 150 million library...

10.1109/fpt.2013.6718404 article EN 2013-12-01

10.1016/j.physa.2013.12.037 article EN Physica A Statistical Mechanics and its Applications 2014-01-03

Winograd algorithm is an efficient approach to alleviate the computation burden of deep CNNs. Firstly, we introduce a fast matrix combine with further reduce complexity and adapt large-stride convolution kernel-partitioning method. Secondly, efficiency improvement due algorithms aggravates off-chip communication. DRAM access different data-flows varies significantly CNN patterns. Dynamic configurations both on-chip shared memory can effectively. A quantitative analysis established on design...

10.1109/fpl.2018.00019 article EN 2018-08-01
Coming Soon ...