- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Video Surveillance and Tracking Methods
- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
- Advanced Image and Video Retrieval Techniques
- Autonomous Vehicle Technology and Safety
- Face and Expression Recognition
- Hearing Impairment and Communication
- Advanced Vision and Imaging
- Hand Gesture Recognition Systems
- Face recognition and analysis
- Anomaly Detection Techniques and Applications
- Topic Modeling
- Neural Networks and Applications
- Brain Tumor Detection and Classification
- Image Processing Techniques and Applications
- Digital Media and Philosophy
- Fuzzy Logic and Control Systems
- Video Analysis and Summarization
- Fire Detection and Safety Systems
- Robotics and Automated Systems
- Reinforcement Learning in Robotics
- Adversarial Robustness in Machine Learning
- Educational Research and Pedagogy
University of Science and Technology of China
2019-2025
University Medical Center Freiburg
2025
University of Freiburg
2025
Heidelberg University
2025
University Hospital Heidelberg
2025
Nvidia (United States)
2022-2024
Beijing Institute of Graphic Communication
2020-2024
Yantai University
2023
Trier University of Applied Sciences
2023
Meizu (China)
2023
Urban traffic optimization using cameras as sensors is driving the need to advance state-of-the-art multi-target multi-camera (MTMC) tracking. This work introduces CityFlow, a city-scale camera dataset consisting of more than 3 hours synchronized HD videos from 40 across 10 intersections, with longest distance between two simultaneous being 2.5 km. To best our knowledge, CityFlow largest-scale in terms spatial coverage and number cameras/videos an urban environment. The contains 200K...
Attention mechanisms have significantly boosted the performance of video classification neural networks thanks to utilization perspective contexts. However, current research on attention generally focuses adopting a specific aspect contexts (e.g., channel, spatial/temporal, or global context) refine features and neglects their underlying correlation when computing attentions. This leads incomplete context hence bears weakness limited improvement. To tackle problem, this paper proposes an...
The 6th edition of the AI City Challenge specifically focuses on problems in two domains where there is tremendous unlocked potential at intersection computer vision and artificial intelligence: Intelligent Traffic Systems (ITS), brick mortar retail businesses. four challenge tracks 2022 received participation requests from 254 teams across 27 countries. Track 1 addressed city-scale multi-target multi-camera (MTMC) vehicle tracking. 2 natural-language-based track retrieval. 3 was a brand new...
Zero-shot learning (ZSL) suffers intensely from the domain shift issue, i.e., mismatch (or misalignment) between true and learned data distributions for classes without training (unseen classes). By additionally unlabelled collected unseen classes, transductive ZSL (TZSL) could reduce but only to a certain extent. To improve TZSL, we propose novel approach Bi-VAEGAN which strengthens distribution alignment visual space an auxiliary space. As result, it can largely shift. The proposed key...
Cross-modal hashing intends to project data from two modalities into a common hamming space perform cross-modal retrieval efficiently. Despite satisfactory performance achieved on real applications, existing methods are incapable of effectively preserving semantic structure maintain inter-class relationship and improving discriminability make intra-class samples aggregated simultaneously, which thus limits the higher performance. To handle this problem, we propose Equally-Guided...
Face recognition has achieved significant progress in deep learning era due to the ultra-large-scale and well- labeled datasets. However, training on outsize datasets is time-consuming takes up a lot of hardware resource. Therefore, designing an efficient approach in- dispensable. The heavy computational memory costs mainly result from million-level dimensionality fully connected (FC) layer. To this end, we propose novel approach, termed Faster Classification (F <inf...
Transformer-based detectors (DETRs) are becoming popular for their simple framework, but the large model size and heavy time consumption hinder deployment in real world. While knowledge distillation (KD) can be an appealing technique to compress giant into small ones comparable detection performance low inference cost. Since DETRs formulate object as a set prediction problem, existing KD methods designed classic convolution-based may not directly applicable. In this paper, we propose...
Few-shot learning (FSL) aims at recognizing a novel object under limited training samples. A robust feature extractor (backbone) can significantly improve the recognition performance of FSL model. However, an effective backbone is challenging issue since 1) designing and validating structures backbones are time-consuming expensive processes, 2) trained on known (base) categories more inclined to focus textures objects it learns, which hard describe To solve these problems, we propose mixture...
As a long-standing problem in computer vision, face detection has attracted much attention recent decades for its practical applications. With the availability of benchmark WIDER FACE dataset, progresses have been made by various algorithms years. Among them, Selective Refinement Network (SRN) detector introduces two-step classification and regression operations selectively into an anchor-based to reduce false positives improve location accuracy simultaneously. Moreover, it designs receptive...
Model quantification uses low bit-width values to represent the weight matrices of models, which is a promising approach reduce both storage and computational overheads deploying highly anticipated LLMs. However, existing quantization methods suffer severe performance degradation when extremely reduced, thus focus on utilizing 4-bit or 8-bit quantize models. This paper boldly quantizes LLMs 1-bit, paving way for deployment For this target, we introduce 1-bit quantization-aware training (QAT)...
Multi-view 3D object detection (MV3D-Det) in Bird-Eye-View (BEV) has drawn extensive attention due to its low cost and high efficiency. Although new algorithms for camera-only have been continuously proposed, most of them may risk drastic performance degradation when the domain input images differs from that training. In this paper, we first analyze causes gap MV3D-Det task. Based on covariate shift assumption, find mainly attributes feature distribution BEV, which is determined by quality...
Sign Language Production (SLP) aims to convert text or audio sentences into sign language videos corresponding their semantics, which is challenging due the diversity and complexity of languages, cross-modal semantic mapping issues. In this work, we propose a Gloss-driven Conditional Diffusion Model (GCDM) for SLP. The core GCDM diffusion model architecture, in gloss sequence encoded by Transformer-based encoder input as prior condition. process pose generation, textual priors carried...
Few-shot learning (FSL) aims to classify a novel object into specific category under limited training samples. This is challenging task since (1) the features expressed by pre-trained knowledge introduce perceived bias and then constrain classification space, (2) use of general hallucination techniques based on global fails escape resulting in suboptimal improvements. To solve these issues, this paper proposes an interventional feature generation (IFG) method. Specifically, we first...
Existing works mainly focus on crowd and ignore the confusion regions which contain extremely similar appearance to in background, while counting needs face these two sides at same time. To address this issue, we propose a novel end-to-end trainable region discriminating erasing network called CDENet. Specifically, CDENet is composed of modules mining module (CRM) guided (GEM). CRM consists basic density estimation (BDE) network, aware bridge network. The BDE first generates primary map,...
Efficient action recognition aims to classify a video clip into specific category with low computational cost. It is challenging since the integrated spatial-temporal calculation (e. g., 3D convolution) introduces intensive operations and increases complexity. This paper explores feasibility of integration channel splitting filter decoupling for efficient architecture design feature refinement by proposing novel spatio-temporal collaborative (STC) module. STC splits channels two groups...
Pseudo-Labeling (PL) is a critical approach in semisupervised 3D object detection (SSOD). In PL, delicately selected pseudo-labels, generated by the teacher model, are provided for student model to supervise framework. However, such paradigm may introduce misclassified labels or loose localized box predictions, resulting sub-optimal solution of performance. this paper, we take PL from noisy learning perspective: instead directly applying vanilla design noise-resistant instance supervision...
Unpaired Image Captioning (UIC) is designed to describe an image without relying on matched vision-language training data. It a challenging task since (1) the implicit and unpaired data nature of limits captioning model's ability represent diverse scene representations, (2) it difficult for model discern intrinsic relationships among objects, potentially leading misinterpretation con- tent. To solve these issues, we propose pseudo content hallucination (PCH) help enlarge perception ob- jects...