- Advanced Neural Network Applications
- Robotics and Sensor-Based Localization
- Video Surveillance and Tracking Methods
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Indoor and Outdoor Localization Technologies
- Adversarial Robustness in Machine Learning
- Visual Attention and Saliency Detection
Zhejiang University
2022-2025
Shanghai Artificial Intelligence Laboratory
2022
University of Nottingham Ningbo China
2022
While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage them explicitly. To this end, we first propose a cross-scale transformer, CrossFormer. It introduces embedding layer (CEL) and long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches scales, providing self-attention module itself features. other LSDA splits into short-distance long-distance counterpart, which only...
The width of a neural network matters since increasing the will necessarily increase model capacity. However, performance does not improve linearly with and soon gets saturated. In this case, we argue that number networks (ensemble) can achieve better accuracy-efficiency trade-offs than purely width. To prove it, one large is divided into several small ones regarding its parameters regularization components. Each these has fraction original one's parameters. We then train together make them...
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of imagery can lead depth ambiguity. Specifically, objects with different depths appear same bounding boxes similar visual features in 2D image. Unfortunately, network cannot accurately distinguish from such...
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of imagery can lead depth ambiguity. Specifically, objects with different depths appear same bounding boxes similar visual features in 2D image. Unfortunately, network cannot accurately distinguish from such...