Yaosi Hu

ORCID: 0000-0003-2784-6738
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Image and Video Quality Assessment
  • Video Analysis and Summarization
  • Generative Adversarial Networks and Image Synthesis
  • Human Pose and Action Recognition
  • Multimodal Machine Learning Applications
  • Video Coding and Compression Technologies
  • Multimedia Communication and Technology
  • Advanced Vision and Imaging
  • Human Motion and Animation
  • Visual Attention and Saliency Detection
  • Image Retrieval and Classification Techniques
  • Advanced Image and Video Retrieval Techniques
  • Neural dynamics and brain function
  • Advanced Optical Imaging Technologies
  • Domain Adaptation and Few-Shot Learning
  • Image and Object Detection Techniques
  • Anomaly Detection Techniques and Applications
  • Artificial Immune Systems Applications
  • Digital Marketing and Social Media
  • Advanced Image Processing Techniques
  • Virtual Reality Applications and Impacts
  • Geographic Information Systems Studies
  • Advanced Computing and Algorithms
  • Face recognition and analysis
  • Advanced Neural Network Applications

Wuhan University
2012-2024

Generating controllable videos conforming to user intentions is an appealing yet challenging topic in computer vision. To enable maneuverable control line with intentions, a novel video generation task, named Text-Image-to-Video (TI2V), proposed. With both appearance and motion, TI2V aims at generating from static image text description. The key challenges of task lie aligning motion different modalities, handling uncertainty descriptions. address these challenges, we propose Motion...

10.1109/cvpr52688.2022.01768 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

In this paper, a Hierarchical Temporal Model (HTM) is proposed for the video captioning task, based on exploring global and local temporal structure to better recognize fine-grained objects actions. our HTM, encoder decoder are hierarchically aligned according different levels of features. The applies two LSTM layers construct structures at both frame-level object-level where attention mechanism applied locate interest, uses corresponding extract pivotal features from through multi-level...

10.1145/3343031.3351072 article EN Proceedings of the 30th ACM International Conference on Multimedia 2019-10-15

Recently an impressive development in immersive technologies, such as Augmented Reality (AR), Virtual (VR) and 360<inline-formula><tex-math notation="LaTeX">${^\circ }$</tex-math></inline-formula> video, has been witnessed. However, methods for quality assessment have not keeping up. This paper studies of video from the cross-lab tests (involving ten laboratories more than 300 participants) carried out by Immersive Media Group (IMG) Video Quality Experts (VQEG). These were addressed to...

10.1109/tmm.2021.3093717 article EN cc-by IEEE Transactions on Multimedia 2021-07-05

The complexity of scenes and variations in image quality result significant variability the performance semantic segmentation methods remote sensing imagery (RSI) supervised real-world scenarios. This makes evaluation such scenarios an issue to be resolved. However, most existing metrics are developed based on expert-labeled object-level annotations, which not applicable To address this issue, we propose RS-SQA, unsupervised assessment model for RSI vision language (VLM). framework leverages...

10.48550/arxiv.2502.13990 preprint EN arXiv (Cornell University) 2025-02-18

10.1007/s11263-025-02386-7 article EN International Journal of Computer Vision 2025-03-03

Text-driven Image to Video Generation (TI2V) aims generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how identify target objects ensure consistency between movement trajectory (ii) improve subjective quality generated videos. To tackle above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending achieve precise...

10.1609/aaai.v39i8.32861 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities freely-labeled data for autonomous applications and present SubjectDrive, first model proven scale production in a way that could continuously improve applications. We investigate impact scaling up quantity performance downstream perception find enhancing diversity plays crucial role effectively production. Therefore, have developed...

10.1609/aaai.v39i4.32376 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Predicting the popularity of a micro-video is challenging task, due to number factors impacting distribution such as diversity video content and user interests, complex online interactions, etc. In this paper, we propose multimodal variational encoder-decoder (MMVED) framework that considers uncertain randomness for mapping from features popularity. Specifically, MMVED first encodes multiple modalities in observation space into latent representations learns their probability distributions...

10.1145/3366423.3380004 article EN 2020-04-20

For a typical Scene Graph Generation (SGG) method in image understanding, there usually exists large gap the performance of predicates' head classes and tail classes. This phenomenon is mainly caused by semantic overlap between different predicates as well long-tailed data distribution. In this paper, Predicate Correlation Learning (PCL) for SGG proposed to address above problems taking correlation into consideration. To measure highly correlated predicate classes, Matrix (PCM) defined...

10.1109/tip.2022.3181511 article EN IEEE Transactions on Image Processing 2022-01-01

This study proposes a multiresolution Markov random field model with fuzzy constraint in wavelet domain (MRMRF-F). In this model, is introduced into the to estimate parameters, by which spatial between neighbouring features can be reflected. There are three subfields on each resolution MRMRF-F model: one feature field, label and field. Among these fields, three-step iteration scheme designed realise image segmentation. Namely, renews field; then estimates parameters of renewed obtained...

10.1049/iet-ipr.2010.0176 article EN IET Image Processing 2012-03-27

Generating coherent and natural movement is the key challenge in video generation. This research proposes to condense generation into a problem of motion generation, improve expressiveness make more manageable. can be achieved by breaking down process latent reconstruction. We present diffusion (LaMD) framework, which consists motion-decomposed autoencoder diffusion-based generator, implement this idea. Through careful design, compress patterns concise representation. Meanwhile, generator...

10.48550/arxiv.2304.11603 preprint EN cc-by arXiv (Cornell University) 2023-01-01

10.1016/j.jvcir.2020.102751 article EN Journal of Visual Communication and Image Representation 2020-01-27

In this paper, we propose a two-stream refinement network for RGB-D saliency detection. A fusion module is designed to fuse output features from different resolution and modals. The structure information depth helps distinguish between foreground background the lower level with higher can be adopted refine boundary of detected targets. proposed model predicts high-resolution map then use propagation-based further object boundary. Experimental results demonstrate that method performs well...

10.1109/icip.2019.8803653 article EN 2022 IEEE International Conference on Image Processing (ICIP) 2019-08-26

Automatic video generation is a challenging research topic, attracting interests from different perspectives, including Image-to-Video (I2V), Video-to-Video (V2V), and Text-to-Video (T2V). To pursue more controllable fine-grained generation, novel task, named Text-Image-to-Video (TI2V), corresponding baseline solution, Motion Anchor-based Generator (MAGE), were proposed. However, two other factors, namely clean datasets reliable evaluation metrics, also play important roles in the success of...

10.1109/tmm.2023.3284989 article EN IEEE Transactions on Multimedia 2023-06-12

Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities freely-labeled data for autonomous applications and present SubjectDrive, first model proven scale production in a way that could continuously improve applications. We investigate impact scaling up quantity performance downstream perception find enhancing diversity plays crucial role effectively production. Therefore, have developed...

10.48550/arxiv.2403.19438 preprint EN arXiv (Cornell University) 2024-03-28

10.1016/j.jvcir.2024.104185 article EN Journal of Visual Communication and Image Representation 2024-05-01

Text-driven Image to Video Generation (TI2V) aims generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how identify target objects ensure consistency between movement trajectory (ii) improve subjective quality generated videos. To tackle above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending achieve precise...

10.48550/arxiv.2412.10275 preprint EN arXiv (Cornell University) 2024-12-13

Video-telephony applications have been widely used in people's daily life, such as online conferences, education, and socialization. Especially during the COVID-19 pandemic, business volume of video-telephony services has generally increased rapidly. This leads to a growing need for service quality assessment monitoring. paper presents subjective tests conducted 'Computational model QoE/QoS monitor assess services' (G.CMVTQS) project, which is under study ITU-T SG12 Q.15. Two types are...

10.1109/bmsb55706.2022.9828647 article EN 2022 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB) 2022-06-15

Past few years have witnessed the surprising popularization of micro-videos. Various micro-video applications occupied a dominant portion in mobile application market. To enhance user experience, it is crucial to explore perceptual quality In this paper, we establish new subjective assessment database for The consists 121 user-captured videos and mean opinion scores (MOS) generated by 2541 rating from 21 naive subjects. are chosen be representative micro-videos, including different capture...

10.1109/mipr49039.2020.00054 article EN 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR) 2020-08-01

Video captioning is considered to be challenging due the combination of video understanding and text generation. Recent progress in has been made mainly using methods visual feature extraction sequential learning. However, syntax structure semantic consistency generated captions are not fully explored. Thus, our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance generate more accu-rate captions. In general, word generation POS prediction...

10.1109/vcip53242.2021.9675348 article EN 2021 International Conference on Visual Communications and Image Processing (VCIP) 2021-12-05

Recently, quality assessment for user-generated content (UGC) videos has become a challenging task due to the absence of reference and presence complex distortions. Prior methods highlighted effectiveness semantic features assessment. However, these models are incapable real-time prediction efficient computation in practical applications. In this paper, we design lightweight no-reference video model leveraging pretrained network understanding utilizing low-level CNN distortion features. The...

10.1109/vcip59821.2023.10402738 article EN 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) 2023-12-04

In this paper, we present the first video decomposition framework, named SyCoMo, that factorizes a into style, content, and motion. Such fine-grained enables flexible editing, for time allows tripartite synthesis. SyCoMo is unified domain-agnostic learning framework which can process videos of various object categories without domain-specific design or supervision. Different from other motion work, derives style-free content by isolating style in place. Content organized subchannels, each...

10.2139/ssrn.4177879 article EN SSRN Electronic Journal 2022-01-01
Coming Soon ...