NFDI4DS | UHH-SEMS - Publication Details

Wei Cao

ORCID: 0000-0003-0339-7093

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5100694034

Research Areas

Advanced Neural Network Applications
Advanced Data Compression Techniques
Advanced Image and Video Retrieval Techniques
CCD and CMOS Imaging Sensors
Video Coding and Compression Technologies
Advanced Vision and Imaging
Digital Filter Design and Implementation
Brain Tumor Detection and Classification
Embedded Systems Design Techniques
Advancements in Semiconductor Devices and Circuit Design
Ferroelectric and Negative Capacitance Devices
Semiconductor materials and devices
Image and Signal Denoising Methods
Parallel Computing and Optimization Techniques
Adversarial Robustness in Machine Learning
Integrated Circuits and Semiconductor Failure Analysis
Robotics and Sensor-Based Localization
Interconnection Networks and Systems
Advanced Manufacturing and Logistics Optimization
Algorithms and Data Compression
VLSI and FPGA Design Techniques
Multimodal Machine Learning Applications
Gaussian Processes and Bayesian Inference
User Authentication and Security Systems
Cryptographic Implementations and Security

Fudan University
2013-2025

Shanghai Fudan Microelectronics (China)
2010-2021

China Southern Power Grid (China)
2021

Zhejiang Gongshang University
2020

Changsha University of Science and Technology
2013-2017

Huaqiao University
2017

State Key Laboratory of ASIC and System
2008-2015

Education Department of Hunan Province
2015

Central South University
2013-2015

China Information Technology Security Evaluation Center
2014

Deep Learning based Radiomics (DLR) and its usage in noninvasive IDH1 prediction for low grade glioma

OPENALEX - Publications

Zeju Li Yuanyuan Wang Jinhua Yu Yi Guo Wei Cao

Deep learning-based radiomics (DLR) was developed to extract deep information from multiple modalities of magnetic resonance (MR) images. The performance DLR for predicting the mutation status isocitrate dehydrogenase 1 (IDH1) validated in a dataset 151 patients with low-grade glioma. A modified convolutional neural network (CNN) structure 6 layers and fully connected layer 4096 neurons used segment tumors. Instead calculating image features segmented images, as typically performed normal...

10.1038/s41598-017-05848-2 article EN cc-by Scientific Reports 2017-07-10

A high performance FPGA-based accelerator for large-scale convolutional neural networks

OPENALEX - Publications

Huimin Li Xitian Fan Jiao Li Wei Cao Xuegong Zhou and 1 more

In recent years, convolutional neural networks (CNNs) based machine learning algorithms have been widely applied in computer vision applications. However, for large-scale CNNs, the computation-intensive, memory-intensive and resource-consuming features brought many challenges to CNN implementations. This work proposes an end-to-end FPGA-based accelerator with all layers mapped on one chip so that different can concurrently a pipelined structure increase throughput. A methodology which find...

10.1109/fpl.2016.7577308 article EN 2016-08-01

Improvement in Reliability of Tunneling Field-Effect Transistor With p-n-i-n Structure

OPENALEX - Publications

Wei Cao Chengjun Yao G. F. Jiao Daming Huang H.Y. Yu and 1 more

One of the major problems p-i-n tunneling field-effect transistor (TFET) is reliability due to strong electric field near junction. In this paper, using technology computer-aided design simulation, we show that insertion a thin n-layer into junction TFET (p-n-i-n TFET) not only enhances its drive current, as has been previously reported, but also improves reliability. As compared with conventional TFET, demonstrate following properties p-n-i-n TFET: 1) The normal component reduced, and...

10.1109/ted.2011.2144987 article EN IEEE Transactions on Electron Devices 2011-05-20

Accelerating low bit-width convolutional neural networks with embedded FPGA

OPENALEX - Publications

Jiao Li Cheng Luo Wei Cao Xuegong Zhou Lingli Wang

Convolutional Neural Networks (CNNs) can achieve high classification accuracy while they require complex computation. Binarized (BNNs) with binarized weights and activations simplify computation but suffer from obvious loss. In this paper, low bit-width CNNs, BNNs standard CNNs are compared to show that is better suited for embedded systems. An architecture based on the two-stage arithmetic unit (TSAU) as basic processing element proposed process each layer iteratively CNN accelerators. Then...

10.23919/fpl.2017.8056820 article EN 2017-09-01

Effect of Interface Traps and Oxide Charge on Drain Current Degradation in Tunneling Field-Effect Transistors

OPENALEX - Publications

X. Huang G. F. Jiao Wei Cao Daming Huang H.Y. Yu and 5 more

In this letter, we report for the first time degradation mechanism of drain current in tunneling field-effect transistors (TFETs). Using positive-bias and hot-carrier (HC) stress experiments TCAD simulation, show that is mainly induced by interface traps and/or oxide charge located above region, causing reduction field current. The induce transconductance, while essentially causes a threshold-voltage shift TFETs. results interface-trap generation dominant under stress, oxide-charge creation...

10.1109/led.2010.2050456 article EN IEEE Electron Device Letters 2010-06-25

MRI-based brain tumor segmentation using FPGA-accelerated neural network

OPENALEX - Publications

Siyu Xiong Guoqing Wu Xitian Fan Xuan Feng Zhongcheng Huang and 6 more

Abstract Background Brain tumor segmentation is a challenging problem in medical image processing and analysis. It very time-consuming error-prone task. In order to reduce the burden on physicians improve accuracy, computer-aided detection (CAD) systems need be developed. Due powerful feature learning ability of deep technology, many learning-based methods have been applied brain CAD achieved satisfactory accuracy. However, neural networks high computational complexity, process consumes...

10.1186/s12859-021-04347-6 article EN cc-by BMC Bioinformatics 2021-09-07

SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

OPENALEX - Publications

Di Wu Xitian Fan Wei Cao Lingli Wang

Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of networks recently enjoy benefits both computation and memory reduction. However, most cannot activations weights. For those works that opportunities, they achieve stable load balance through a static scheduling (SS) strategy, which is vulnerable distribution. In this work, balanced compressed sparse row format dynamic strategy improve balance. A set-associate structure also presented tradeoff...

10.1109/tvlsi.2021.3060041 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2021-03-09

A high-performance reconfigurable VLSI architecture for vbsme in H.264

OPENALEX - Publications

Wei Cao Hou Hui Jiarong Tong Jinmei Lai Hao Min

VBSME (variable block size motion estimation) is adopted in the MPEG-4 AVC/H.264 standard. In order to increase hardware utilization for with FSBMA (full search matching algorithm), this paper proposed a new high-performance reconfigurable VLSI architecture support "meander"-like scan format high data reuse of area. The can three flows through computing array and memory achieve 100% processing element (PE) smaller blocks' SADs calculate 41 vectors (MVs) 16X16 parallel. design implemented...

10.1109/tce.2008.4637625 article EN IEEE Transactions on Consumer Electronics 2008-08-01

Image reconstruction algorithm from compressed sensing measurements by dictionary learning

OPENALEX - Publications

Yanfei Shen Jintao Li Zhenmin Zhu Wei Cao Yun Song

10.1016/j.neucom.2014.06.082 article EN Neurocomputing 2014-10-30

DVHetero: A Framework for Designing and Validating Heterogeneous SoC with RISC-V Processor and CGRA

OPENALEX - Publications

Guowei Zhu Liming Deng Kaisen Zhang Wang Fan Boyin Jin and 5 more

CGRA, as a coprocessor in SoCs, has been widely studied. However, there is limited research on how to efficiently debug and verify SoCs composed of CGRAs processors during the design process. To address this gap, we introduce DVHetero. DVHetero incorporates simulation validation framework, SoCDiff, which enables comprehensive SoC simulation, debugging, rapid error localization. Using verification successfully implemented validated entire SoC. The includes Chisel-based CGRA generator provides...

10.1145/3733721 article EN ACM Transactions on Reconfigurable Technology and Systems 2025-05-02

A Reconfigurable Multi-Transform VLSI Architecture Supporting Video Codec Design

OPENALEX - Publications

Kanwen Wang Jialin Chen Wei Cao Ying Wang Lingli Wang and 1 more

This brief presents a reconfigurable VLSI architecture which is designed for multi-transform codec in several video coding standards of MPEG-2/4, VC-1, H.264/AVC and AVS. The multiple constant multiplication algorithm with two fusing strategies provided to generate multipliers the matrix calculation blocks. Additionally, adder-sharing strategy adopted unified preprocessing/postprocessing block save circuit areas. proposed can support different through static reconfiguration forward/inverse...

10.1109/tcsii.2011.2158265 article EN IEEE Transactions on Circuits & Systems II Express Briefs 2011-07-01

Stream Processing Dual-Track CGRA for Object Inference

OPENALEX - Publications

Xitian Fan Di Wu Wei Cao Wayne Luk Lingli Wang

With the development of machine learning technology, exploration energy-efficient and flexible architectures for object inference algorithms is growing interest in recent years. However, not many publications concentrate on a coarsegrained reconfigurable architecture (CGRA) algorithms. This paper provides stream processing, dual-track programming CGRA-based approach to address inherent computing characteristics inference. Based proposed approach, an called CGRA (SDT-CGRA) presented as...

10.1109/tvlsi.2018.2797600 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2018-02-12

Wear Debris Image Processing and Feature Recognition Based on Boundary Energy Extraction

OPENALEX - Publications

Wei Cao Xu Chen Jianying Yan Rui Su Li Ding and 2 more

10.2139/ssrn.5084917 preprint EN 2025-01-01

PassRecover: A Multi-FPGA System for End-to-End Offline Password Recovery Acceleration

OPENALEX - Publications

Guangwei Xie Xitian Fan Zhongchen Huang Wei Cao Fan Zhang

In the domain of password recovery, deep learning has emerged as a pivotal technology for enhancing recovery efficiency. Despite its effectiveness, inherent computation complexity learning-based generation algorithms poses substantial challenges, particularly in achieving synergistic acceleration between inference, and plaintext encryption process. this paper, we introduce PassRecover, multi-FPGA-based computing system that can simultaneously accelerate learning-driven an end-to-end manner....

10.3390/electronics14071415 article EN Electronics 2025-03-31

Implementation of high performance hardware architecture of OpenSURF algorithm on FPGA

OPENALEX - Publications

Xitian Fan Chenlu Wu Wei Cao Xuegong Zhou Shengye Wang and 1 more

This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve processing frame rate, the is designed with several characteristics. Firstly, sliding window method proposed extract feature points in parallel at selected scale levels. As result, time cost extraction can be greatly reduced. Secondly, data reuse strategy orientation generation and descriptor reduce memory access times. this way, 3.87x 2.25X...

10.1109/fpt.2013.6718346 article EN 2013-12-01

Real-time order scheduling and execution monitoring in public warehouses based on radio frequency identification

OPENALEX - Publications

Wei Cao Pingyu Jiang Bin Liu Kaiyong Jiang

10.1007/s00170-017-1381-z article EN The International Journal of Advanced Manufacturing Technology 2017-11-24

RNA: An Accurate Residual Network Accelerator for Quantized and Reconstructed Deep Neural Networks

OPENALEX - Publications

Cheng Luo Wei Cao Lingli Wang Philip H. W. Leong

With the continuous refinement of Deep Neural Networks (DNNs), a series deep and complex networks such as Residual (ResNets) show impressive prediction accuracy in image classification tasks. Unfortunately, structural complexity computational cost residual make hardware implementation difficult. In this paper, we present quantized reconstructed neural network (QR-DNN) technique, which first inserts batch normalization (BN) layers during training, later removes them to facilitate efficient...

10.1587/transinf.2018rcp0008 article EN IEICE Transactions on Information and Systems 2019-04-30

High Throughput CNN Accelerator Design Based on FPGA

OPENALEX - Publications

Liang Xie Xitian Fan Wei Cao Lingli Wang

Due to the fact that FPGA on-chip memory capacity increases significantly, feature maps and weights of convolutional layers can be stored on chip, which reduce data movement between off-chip memory. Hence, bottleneck shift from bandwidth computing resources in layers, will improve performance dramatically. Under this circumstance, paper quantitatively analyzes how design hardware architecture based roofline model optimize under constraints available propose an efficient architecture. Our...

10.1109/fpt.2018.00052 article EN 2018-12-01

Acceleration of Rotated Object Detection on FPGA

OPENALEX - Publications

Xitian Fan Guangwei Xie Zhongchen Huang Wei Cao Lingli Wang

In this brief, an FPGA-based solution is proposed to show the computing efficiency on rotated object detection based R <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> Det algorithm. The key idea of our approach firstly design reconfigurable neural processing units (NPU) for convolutional networks (CNN) and a specific architecture spatial operations, then adopt novel scheduling scheme deal with data dependency these modules. When...

10.1109/tcsii.2022.3142807 article EN IEEE Transactions on Circuits & Systems II Express Briefs 2022-01-13

Unified Accelerator for Attention and Convolution in Inference Based on FPGA

OPENALEX - Publications

Tianyang Li Fan Zhang Xitian Fan Jianliang Shen Wei Guo and 1 more

Many models combining Transformers with convolutional neural networks (CNNs) for computer vision tasks have achieved state-of-the-art results. However, due to the different computation patterns between attention and convolution, using a dedicated Transformer or CNN accelerator will inevitably reduce computing efficiency of other. To overcome this problem, we propose unified architecture convolution on FPGA. We runtime overhead by offloading part self-attention computations offline before...

10.1109/iscas46773.2023.10182145 article EN 2022 IEEE International Symposium on Circuits and Systems (ISCAS) 2023-05-21

RNA: An Accurate Residual Network Accelerator for Quantized and Reconstructed Deep Neural Networks

OPENALEX - Publications

Cheng Luo Yuhua Wang Wei Cao Philip H. W. Leong Lingli Wang

10.1109/fpl.2018.00018 article EN 2018-08-01

A novel SAD computing hardware architecture for variable-size block motion estimation and its implementation with FPGA

OPENALEX - Publications

Wei Cao Mao Zhi Gang

This paper presents a new hardware architecture that calculates SAD for variable block-size motion estimation (VBSME). The proposed with 16t1-PE array, 4-stage adder tree and two flexible register arrays supports 16t16, 16t8, 8t8, 8t4, 4t8, 4t4 block's calculation. can be used in the encoder enhanced block size MPEG-4 AVC (advanced video coding) emerging H.264 standard. Our design was described Verilog-HDL implemented Altera FPGA APEX20K clock frequency of 120MHz allowing processing 29296...

10.1109/icasic.2003.1277368 article EN 2003-01-01

An FPGA-cluster-accelerated match engine for content-based image retrieval

OPENALEX - Publications

Liang Chen Chenlu Wu Xuegong Zhou Wei Cao Shengye Wang and 1 more

In this paper, a high-performance match engine for content-based image retrieval is proposed. Highly customized floating-point(FP) units are designed, to provide the dynamic range and precision of standard FP units, but with considerably less area than units. Match calculation arrays various architectures scales designed evaluated. An CBIR system built on 12-FPGA cluster. Inter-FPGA connections based 10-Gigabyte Ethernet. The whole FPGA cluster can compare query against 150 million library...

10.1109/fpt.2013.6718404 article EN 2013-12-01

Compare two community-based personalized information recommendation algorithms

OPENALEX - Publications

Yuan Wen Yun Liu Zhenjiang Zhang Fei Xiong Wei Cao

10.1016/j.physa.2013.12.037 article EN Physica A Statistical Mechanics and its Applications 2014-01-03

A Novel Low-Communication Energy-Efficient Reconfigurable CNN Acceleration Architecture

OPENALEX - Publications

Di Wu Chen Jin Wei Cao Lingli Wang

Winograd algorithm is an efficient approach to alleviate the computation burden of deep CNNs. Firstly, we introduce a fast matrix combine with further reduce complexity and adapt large-stride convolution kernel-partitioning method. Secondly, efficiency improvement due algorithms aggravates off-chip communication. DRAM access different data-flows varies significantly CNN patterns. Dynamic configurations both on-chip shared memory can effectively. A quantitative analysis established on design...

10.1109/fpl.2018.00019 article EN 2018-08-01

Coming Soon ...