- Image and Video Quality Assessment
- Video Analysis and Summarization
- Generative Adversarial Networks and Image Synthesis
- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Video Coding and Compression Technologies
- Multimedia Communication and Technology
- Advanced Vision and Imaging
- Human Motion and Animation
- Visual Attention and Saliency Detection
- Image Retrieval and Classification Techniques
- Advanced Image and Video Retrieval Techniques
- Neural dynamics and brain function
- Advanced Optical Imaging Technologies
- Domain Adaptation and Few-Shot Learning
- Image and Object Detection Techniques
- Anomaly Detection Techniques and Applications
- Artificial Immune Systems Applications
- Digital Marketing and Social Media
- Advanced Image Processing Techniques
- Virtual Reality Applications and Impacts
- Geographic Information Systems Studies
- Advanced Computing and Algorithms
- Face recognition and analysis
- Advanced Neural Network Applications
Wuhan University
2012-2024
Generating controllable videos conforming to user intentions is an appealing yet challenging topic in computer vision. To enable maneuverable control line with intentions, a novel video generation task, named Text-Image-to-Video (TI2V), proposed. With both appearance and motion, TI2V aims at generating from static image text description. The key challenges of task lie aligning motion different modalities, handling uncertainty descriptions. address these challenges, we propose Motion...
In this paper, a Hierarchical Temporal Model (HTM) is proposed for the video captioning task, based on exploring global and local temporal structure to better recognize fine-grained objects actions. our HTM, encoder decoder are hierarchically aligned according different levels of features. The applies two LSTM layers construct structures at both frame-level object-level where attention mechanism applied locate interest, uses corresponding extract pivotal features from through multi-level...
Recently an impressive development in immersive technologies, such as Augmented Reality (AR), Virtual (VR) and 360<inline-formula><tex-math notation="LaTeX">${^\circ }$</tex-math></inline-formula> video, has been witnessed. However, methods for quality assessment have not keeping up. This paper studies of video from the cross-lab tests (involving ten laboratories more than 300 participants) carried out by Immersive Media Group (IMG) Video Quality Experts (VQEG). These were addressed to...
The complexity of scenes and variations in image quality result significant variability the performance semantic segmentation methods remote sensing imagery (RSI) supervised real-world scenarios. This makes evaluation such scenarios an issue to be resolved. However, most existing metrics are developed based on expert-labeled object-level annotations, which not applicable To address this issue, we propose RS-SQA, unsupervised assessment model for RSI vision language (VLM). framework leverages...
Text-driven Image to Video Generation (TI2V) aims generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how identify target objects ensure consistency between movement trajectory (ii) improve subjective quality generated videos. To tackle above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending achieve precise...
Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities freely-labeled data for autonomous applications and present SubjectDrive, first model proven scale production in a way that could continuously improve applications. We investigate impact scaling up quantity performance downstream perception find enhancing diversity plays crucial role effectively production. Therefore, have developed...
Predicting the popularity of a micro-video is challenging task, due to number factors impacting distribution such as diversity video content and user interests, complex online interactions, etc. In this paper, we propose multimodal variational encoder-decoder (MMVED) framework that considers uncertain randomness for mapping from features popularity. Specifically, MMVED first encodes multiple modalities in observation space into latent representations learns their probability distributions...
For a typical Scene Graph Generation (SGG) method in image understanding, there usually exists large gap the performance of predicates' head classes and tail classes. This phenomenon is mainly caused by semantic overlap between different predicates as well long-tailed data distribution. In this paper, Predicate Correlation Learning (PCL) for SGG proposed to address above problems taking correlation into consideration. To measure highly correlated predicate classes, Matrix (PCM) defined...
This study proposes a multiresolution Markov random field model with fuzzy constraint in wavelet domain (MRMRF-F). In this model, is introduced into the to estimate parameters, by which spatial between neighbouring features can be reflected. There are three subfields on each resolution MRMRF-F model: one feature field, label and field. Among these fields, three-step iteration scheme designed realise image segmentation. Namely, renews field; then estimates parameters of renewed obtained...
Generating coherent and natural movement is the key challenge in video generation. This research proposes to condense generation into a problem of motion generation, improve expressiveness make more manageable. can be achieved by breaking down process latent reconstruction. We present diffusion (LaMD) framework, which consists motion-decomposed autoencoder diffusion-based generator, implement this idea. Through careful design, compress patterns concise representation. Meanwhile, generator...
In this paper, we propose a two-stream refinement network for RGB-D saliency detection. A fusion module is designed to fuse output features from different resolution and modals. The structure information depth helps distinguish between foreground background the lower level with higher can be adopted refine boundary of detected targets. proposed model predicts high-resolution map then use propagation-based further object boundary. Experimental results demonstrate that method performs well...
Automatic video generation is a challenging research topic, attracting interests from different perspectives, including Image-to-Video (I2V), Video-to-Video (V2V), and Text-to-Video (T2V). To pursue more controllable fine-grained generation, novel task, named Text-Image-to-Video (TI2V), corresponding baseline solution, Motion Anchor-based Generator (MAGE), were proposed. However, two other factors, namely clean datasets reliable evaluation metrics, also play important roles in the success of...
Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities freely-labeled data for autonomous applications and present SubjectDrive, first model proven scale production in a way that could continuously improve applications. We investigate impact scaling up quantity performance downstream perception find enhancing diversity plays crucial role effectively production. Therefore, have developed...
Text-driven Image to Video Generation (TI2V) aims generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how identify target objects ensure consistency between movement trajectory (ii) improve subjective quality generated videos. To tackle above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending achieve precise...
Video-telephony applications have been widely used in people's daily life, such as online conferences, education, and socialization. Especially during the COVID-19 pandemic, business volume of video-telephony services has generally increased rapidly. This leads to a growing need for service quality assessment monitoring. paper presents subjective tests conducted 'Computational model QoE/QoS monitor assess services' (G.CMVTQS) project, which is under study ITU-T SG12 Q.15. Two types are...
Past few years have witnessed the surprising popularization of micro-videos. Various micro-video applications occupied a dominant portion in mobile application market. To enhance user experience, it is crucial to explore perceptual quality In this paper, we establish new subjective assessment database for The consists 121 user-captured videos and mean opinion scores (MOS) generated by 2541 rating from 21 naive subjects. are chosen be representative micro-videos, including different capture...
Video captioning is considered to be challenging due the combination of video understanding and text generation. Recent progress in has been made mainly using methods visual feature extraction sequential learning. However, syntax structure semantic consistency generated captions are not fully explored. Thus, our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance generate more accu-rate captions. In general, word generation POS prediction...
Recently, quality assessment for user-generated content (UGC) videos has become a challenging task due to the absence of reference and presence complex distortions. Prior methods highlighted effectiveness semantic features assessment. However, these models are incapable real-time prediction efficient computation in practical applications. In this paper, we design lightweight no-reference video model leveraging pretrained network understanding utilizing low-level CNN distortion features. The...
In this paper, we present the first video decomposition framework, named SyCoMo, that factorizes a into style, content, and motion. Such fine-grained enables flexible editing, for time allows tripartite synthesis. SyCoMo is unified domain-agnostic learning framework which can process videos of various object categories without domain-specific design or supervision. Different from other motion work, derives style-free content by isolating style in place. Content organized subchannels, each...