- Advanced Image and Video Retrieval Techniques
- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Advanced Vision and Imaging
- Advanced Image Processing Techniques
- Anomaly Detection Techniques and Applications
- Visual Attention and Saliency Detection
- Video Surveillance and Tracking Methods
- Video Analysis and Summarization
- Image Processing Techniques and Applications
- Image Retrieval and Classification Techniques
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- 3D Shape Modeling and Analysis
- Computer Graphics and Visualization Techniques
- Image and Video Quality Assessment
- Advanced Computational Techniques and Applications
- Robotics and Sensor-Based Localization
- Image Processing and 3D Reconstruction
- Web Data Mining and Analysis
- Image Enhancement Techniques
- Human Motion and Animation
- Computational Geometry and Mesh Generation
- Advanced Image Fusion Techniques
- Simulation and Modeling Applications
Nanjing University
2016-2025
Jiangsu Vocational College of Medicine
2024
ETH Zurich
2022
Ya'an Polytechnic College
2007-2021
Jiangsu Agri-animal Husbandry Vocational College
2019
Nanjing University of Science and Technology
2005-2017
Novel (United States)
2005-2015
United States Government Accountability Office
2010
Nanjing Tech University
2003-2006
Nanjing University of Posts and Telecommunications
2002
Recently, very deep convolutional neural networks (CNNs) have shown great power in single image super-resolution (SISR) and achieved significant improvements against traditional methods. Among these CNN-based methods, the residual connections play a critical role boosting network performance. As depth grows, features gradually focused on different aspects of input image, which is useful for reconstructing spatial details. However, existing methods neglect to fully utilize hierarchical...
Most previous works on saliency detection are dedicated to 2D images. Recently it has been shown that 3D visual information supplies a powerful cue for analysis. In this paper, we propose novel method depth images based anisotropic center-surround difference. Instead of depending absolute depth, measure the point by how much outstands from surroundings, which takes global structure into consideration. Besides, two common priors and location used refinement. The proposed within complexity...
Tracking often uses a multistage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this unify the process extraction we present compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize flexibility attention operations, propose Mixed Attention Module (MAM) for simultaneous integration. This synchronous modeling scheme allows extract target-specific discriminative features perform extensive...
Temporal modeling still remains challenging for action recognition in videos. To mitigate this issue, paper presents a new video architecture, termed as Difference Network (TDN), with focus on capturing multi-scale temporal information efficient recognition. The core of our TDN is to devise an module (TDM) by explicitly leveraging difference operator, and systematically assess its effect short-term long-term motion modeling. fully capture over the entire video, established two-level...
Modeling relation between actors is important for recognizing group activity in a multi-person scene. This paper aims at learning discriminative efficiently using deep models. To this end, we propose to build flexible and efficient Actor Relation Graph (ARG) simultaneously capture the appearance position actors. Thanks Convolutional Network, connections ARG could be automatically learned from videos an end-to-end manner, inference on performed with standard matrix operations. Furthermore,...
Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing in-stances of interest. The existing approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple efficient framework (RTD-Net) for direct generation, by re-purposing Transformer-alike architecture. To tackle the essential visual difference between time space,...
Runtime and memory consumption are two important aspects for efficient image super-resolution (EISR) models to be deployed on resource-constrained devices. Recent advances in EISR [16], [32] exploit distillation aggregation strategies with plenty of channel split concatenation operations fully use limited hierarchical features. In contrast, sequential network avoid frequently accessing preceding states extra nodes, thus beneficial reducing the runtime overhead. Following this idea, we design...
Effectively extracting inter-frame motion and appearance information is important for video frame interpolation (VFI). Previous works either extract both types of in a mixed way or devise separate modules each type information, which lead to representation ambiguity low efficiency. In this paper, we propose new module explicitly via unified operation. Specifically, rethink the process attention reuse its map feature enhancement extraction. Furthermore, efficient VFI, our proposed could be...
Multi-object tracking (MOT) in sports scenes plays a critical role gathering players statistics, supporting further applications, such as automatic tactical analysis. Yet existing MOT benchmarks cast little attention on this domain. In work, we present new large-scale multi-object dataset multiple scenes, coined SportsMOT, where all the court are supposed to be tracked. It consists of 240 video sequences, over 150K frames (almost 15x MOT17) and 1.6M bounding boxes (3x collected from 3...
Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this unify the process extraction in paper, we present compact framework, termed as MixFormer, built upon transformers. Our core design is to utilize flexibility attention operations, propose Mixed Attention Module (MAM) for simultaneous integration. This synchronous modeling scheme allows us extract target-specific discriminative...
Spatial downsampling layers are favored in convolutional neural networks (CNNs) to downscale feature maps for larger receptive fields and less memory consumption. However, discriminative tasks, there is a possibility that these lose the details due improper pooling strategies, which could hinder learning process eventually result suboptimal models. In this paper, we present unified framework over existing (e.g., average pooling, max strided convolution) from local importance view. framework,...
Deep learning has achieved remarkable progress for visual recognition on large-scale balanced datasets but still performs poorly real-world long-tailed data. Previous methods often adopt class re-balanced training strategies to effectively alleviate the imbalance issue, might be a risk of over-fitting tail classes. The recent decoupling method overcomes issues by using multi-stage scheme, yet, it is incapable capturing information in feature stage. In this paper, we show that soft label can...
Spatio-temporal action detection is an important and challenging problem in video understanding. The existing benchmarks are limited aspects of small numbers instances a trimmed or low-level atomic actions. This paper aims to present new multi-person dataset spatio-temporal localized sports actions, coined as MultiSports. We first analyze the ingredients constructing realistic for by proposing three criteria: (1) scenes motion dependent identification, (2) with well-defined boundaries, (3)...
Temporal grounding aims to localize a video moment which is semantically aligned with given natural language query. Existing methods typically apply detection or regression pipeline on the fused representation research focus designing complicated prediction heads fusion strategies. Instead, from perspective temporal as metric-learning problem, we present Mutual Matching Network (MMN), directly model similarity between queries and moments in joint embedding space. This new framework enables...
This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus proposed solutions and results. The task of was to super-resolve an input a magnification factor ×4 based pairs low corresponding high resolution images. aim design network for that achieved improvement efficiency measured according several metrics including runtime, parameters, FLOPs, activations, memory consumption while at least maintaining PSNR 29.00dB DIV2K validation set. IMDN is set as...
Object modeling has become a core part of recent tracking frameworks. Current popular tackers use Transformer attention to extract the template feature separately or interactively with search region. However, separate learning lacks communication between and regions, which brings difficulty in extracting discriminative target-oriented features. On other hand, interactive produces hybrid features, may introduce potential distractors via cluttered regions. To enjoy merits both methods, we...
Three-dimensional face dense alignment and reconstruction in the wild is a challenging problem as partial facial information commonly missing occluded large pose images. Large head variations also increase solution space make modeling more difficult. Our key idea to model occlusion decompose this task into several relatively manageable subtasks. To end, we propose an end-to-end framework, termed Self-aligned Dual Regression Network (SADRNet), which predicts pose-dependent face,...