NFDI4DS | UHH-SEMS - Publication Details

Make It Move: Controllable Image-to-Video Generation with Text Descriptions

OPENALEX - Publications

Yaosi Hu Chong Luo Zhenzhong Chen

Generating controllable videos conforming to user intentions is an appealing yet challenging topic in computer vision. To enable maneuverable control line with intentions, a novel video generation task, named Text-Image-to-Video (TI2V), proposed. With both appearance and motion, TI2V aims at generating from static image text description. The key challenges of task lie aligning motion different modalities, handling uncertainty descriptions. address these challenges, we propose Motion...

10.1109/cvpr52688.2022.01768 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Hierarchical Global-Local Temporal Modeling for Video Captioning

OPENALEX - Publications

Yaosi Hu Zhenzhong Chen Zheng-Jun Zha Feng Wu

In this paper, a Hierarchical Temporal Model (HTM) is proposed for the video captioning task, based on exploring global and local temporal structure to better recognize fine-grained objects actions. our HTM, encoder decoder are hierarchically aligned according different levels of features. The applies two LSTM layers construct structures at both frame-level object-level where attention mechanism applied locate interest, uses corresponding extract pivotal features from through multi-level...

10.1145/3343031.3351072 article EN Proceedings of the 30th ACM International Conference on Multimedia 2019-10-15

Subjective Evaluation of Visual Quality and Simulator Sickness of Short 360$^\circ$ Videos: ITU-T Rec. P.919

OPENALEX - Publications

Jesús Gutiérrez Pablo Pérez Marta Orduna Ashutosh Singla Carlos Cortés and 22 more

Recently an impressive development in immersive technologies, such as Augmented Reality (AR), Virtual (VR) and 360<inline-formula><tex-math notation="LaTeX">${^\circ }$</tex-math></inline-formula> video, has been witnessed. However, methods for quality assessment have not keeping up. This paper studies of video from the cross-lab tests (involving ten laboratories more than 300 participants) carried out by Immersive Media Group (IMG) Video Quality Experts (VQEG). These were addressed to...

10.1109/tmm.2021.3093717 article EN cc-by IEEE Transactions on Multimedia 2021-07-05

Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model

OPENALEX - Publications

Huiying Shi Zhihong Tan Zhihan Zhang Hongchen Wei Yaosi Hu and 2 more

The complexity of scenes and variations in image quality result significant variability the performance semantic segmentation methods remote sensing imagery (RSI) supervised real-world scenarios. This makes evaluation such scenarios an issue to be resolved. However, most existing metrics are developed based on expert-labeled object-level annotations, which not applicable To address this issue, we propose RS-SQA, unsupervised assessment model for RSI vision language (VLM). framework leverages...

10.48550/arxiv.2502.13990 preprint EN arXiv (Cornell University) 2025-02-18

LaMD: Latent Motion Diffusion for Image-Conditional Video Generation

OPENALEX - Publications

Yaosi Hu Zhenzhong Chen Chong Luo

10.1007/s11263-025-02386-7 article EN International Journal of Computer Vision 2025-03-03

TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

OPENALEX - Publications

X. D. Wang Xin Li Yaosi Hu Hanxin Zhu Chen Hou and 2 more

Text-driven Image to Video Generation (TI2V) aims generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how identify target objects ensure consistency between movement trajectory (ii) improve subjective quality generated videos. To tackle above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending achieve precise...

10.1609/aaai.v39i8.32861 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control

OPENALEX - Publications

Binyuan Huang Yuqing Wen Yucheng Zhao Yaosi Hu Yingfei Liu and 7 more

Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities freely-labeled data for autonomous applications and present SubjectDrive, first model proven scale production in a way that could continuously improve applications. We investigate impact scaling up quantity performance downstream perception find enhancing diversity plays crucial role effectively production. Therefore, have developed...

10.1609/aaai.v39i4.32376 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

A Multimodal Variational Encoder-Decoder Framework for Micro-video Popularity Prediction

OPENALEX - Publications

Jiayi Xie Yaochen Zhu Zhi-Bin Zhang Jian Peng Jing Yi and 3 more

Predicting the popularity of a micro-video is challenging task, due to number factors impacting distribution such as diversity video content and user interests, complex online interactions, etc. In this paper, we propose multimodal variational encoder-decoder (MMVED) framework that considers uncertain randomness for mapping from features popularity. Specifically, MMVED first encodes multiple modalities in observation space into latent representations learns their probability distributions...

10.1145/3366423.3380004 article EN 2020-04-20

Predicate Correlation Learning for Scene Graph Generation

OPENALEX - Publications

Leitian Tao Mi Li Nannan Li Xianhang Cheng Yaosi Hu and 1 more

For a typical Scene Graph Generation (SGG) method in image understanding, there usually exists large gap the performance of predicates' head classes and tail classes. This phenomenon is mainly caused by semantic overlap between different predicates as well long-tailed data distribution. In this paper, Predicate Correlation Learning (PCL) for SGG proposed to address above problems taking correlation into consideration. To measure highly correlated predicate classes, Matrix (PCM) defined...

10.1109/tip.2022.3181511 article EN IEEE Transactions on Image Processing 2022-01-01

Image segmentation based on multiresolution Markov random field with fuzzy constraint in wavelet domain

OPENALEX - Publications

Zheng Chen Q. Qin Guoying Liu Yaosi Hu

This study proposes a multiresolution Markov random field model with fuzzy constraint in wavelet domain (MRMRF-F). In this model, is introduced into the to estimate parameters, by which spatial between neighbouring features can be reflected. There are three subfields on each resolution MRMRF-F model: one feature field, label and field. Among these fields, three-step iteration scheme designed realise image segmentation. Namely, renews field; then estimates parameters of renewed obtained...

10.1049/iet-ipr.2010.0176 article EN IET Image Processing 2012-03-27

LaMD: Latent Motion Diffusion for Video Generation

OPENALEX - Publications

Yaosi Hu Zhenzhong Chen Chong Luo

Generating coherent and natural movement is the key challenge in video generation. This research proposes to condense generation into a problem of motion generation, improve expressiveness make more manageable. can be achieved by breaking down process latent reconstruction. We present diffusion (LaMD) framework, which consists motion-decomposed autoencoder diffusion-based generator, implement this idea. Through careful design, compress patterns concise representation. Meanwhile, generator...

10.48550/arxiv.2304.11603 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Exploiting the local temporal information for video captioning

OPENALEX - Publications

Ran Wei Mi Li Yaosi Hu Zhenzhong Chen

10.1016/j.jvcir.2020.102751 article EN Journal of Visual Communication and Image Representation 2020-01-27

Two-Stream Refinement Network for RGB-D Saliency Detection

OPENALEX - Publications

Di Liu Yaosi Hu Kao Zhang Zhenzhong Chen

In this paper, we propose a two-stream refinement network for RGB-D saliency detection. A fusion module is designed to fuse output features from different resolution and modals. The structure information depth helps distinguish between foreground background the lower level with higher can be adopted refine boundary of detected targets. proposed model predicts high-resolution map then use propagation-based further object boundary. Experimental results demonstrate that method performs well...

10.1109/icip.2019.8803653 article EN 2022 IEEE International Conference on Image Processing (ICIP) 2019-08-26

A Benchmark for Controllable Text -Image-to-Video Generation

OPENALEX - Publications

Yaosi Hu Chong Luo Zhenzhong Chen

Automatic video generation is a challenging research topic, attracting interests from different perspectives, including Image-to-Video (I2V), Video-to-Video (V2V), and Text-to-Video (T2V). To pursue more controllable fine-grained generation, novel task, named Text-Image-to-Video (TI2V), corresponding baseline solution, Motion Anchor-based Generator (MAGE), were proposed. However, two other factors, namely clean datasets reliable evaluation metrics, also play important roles in the success of...

10.1109/tmm.2023.3284989 article EN IEEE Transactions on Multimedia 2023-06-12

SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control

OPENALEX - Publications

Binyuan Huang Yuqing Wen Yucheng Zhao Yaosi Hu Yingfei Liu and 7 more

Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities freely-labeled data for autonomous applications and present SubjectDrive, first model proven scale production in a way that could continuously improve applications. We investigate impact scaling up quantity performance downstream perception find enhancing diversity plays crucial role effectively production. Therefore, have developed...

10.48550/arxiv.2403.19438 preprint EN arXiv (Cornell University) 2024-03-28

Memory-guided representation matching for unsupervised video anomaly detection

OPENALEX - Publications

Yiran Tao Yaosi Hu Zhenzhong Chen

10.1016/j.jvcir.2024.104185 article EN Journal of Visual Communication and Image Representation 2024-05-01

TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

OPENALEX - Publications

X. D. Wang Xin Li Yaosi Hu Hanxin Zhu Chen Hou and 2 more

Text-driven Image to Video Generation (TI2V) aims generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how identify target objects ensure consistency between movement trajectory (ii) improve subjective quality generated videos. To tackle above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending achieve precise...

10.48550/arxiv.2412.10275 preprint EN arXiv (Cornell University) 2024-12-13

Subjective Quality Assessment of One-to-One Video-Telephony Services

OPENALEX - Publications

Mengying Liu José Joskowicz Rafael Sotelo Yaosi Hu Zhenzhong Chen and 1 more

Video-telephony applications have been widely used in people's daily life, such as online conferences, education, and socialization. Especially during the COVID-19 pandemic, business volume of video-telephony services has generally increased rapidly. This leads to a growing need for service quality assessment monitoring. paper presents subjective tests conducted 'Computational model QoE/QoS monitor assess services' (G.CMVTQS) project, which is under study ITU-T SG12 Q.15. Two types are...

10.1109/bmsb55706.2022.9828647 article EN 2022 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB) 2022-06-15

Subjective Study of Perceptual Quality for Micro-Video Applications

OPENALEX - Publications

Yaosi Hu Yingxue Zhang Zizheng Liu Zhenzhong Chen Shan Liu

Past few years have witnessed the surprising popularization of micro-videos. Various micro-video applications occupied a dominant portion in mobile application market. To enhance user experience, it is crucial to explore perceptual quality In this paper, we establish new subjective assessment database for The consists 121 user-captured videos and mean opinion scores (MOS) generated by 2541 rating from 21 naive subjects. are chosen be representative micro-videos, including different capture...

10.1109/mipr49039.2020.00054 article EN 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR) 2020-08-01

MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning

OPENALEX - Publications

Cong Zou Xuchen Wang Yaosi Hu Zhenzhong Chen Shan Liu

Video captioning is considered to be challenging due the combination of video understanding and text generation. Recent progress in has been made mainly using methods visual feature extraction sequential learning. However, syntax structure semantic consistency generated captions are not fully explored. Thus, our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance generate more accu-rate captions. In general, word generation POS prediction...

10.1109/vcip53242.2021.9675348 article EN 2021 International Conference on Visual Communications and Image Processing (VCIP) 2021-12-05

Multiple visual relationship forecasting and arrangement in videos

OPENALEX - Publications

Wanping Ouyang Yaosi Hu Yangjun Ou Zhenzhong Chen

10.1016/j.neucom.2023.126274 article EN Neurocomputing 2023-04-27

A Lightweight No-reference Video Quality Assessment Method

OPENALEX - Publications

Huiying Shi Yaosi Hu Yingxue Zhang Zhenzhong Chen

Recently, quality assessment for user-generated content (UGC) videos has become a challenging task due to the absence of reference and presence complex distortions. Prior methods highlighted effectiveness semantic features assessment. However, these models are incapable real-time prediction efficient computation in practical applications. In this paper, we design lightweight no-reference video model leveraging pretrained network understanding utilizing low-level CNN distortion features. The...

10.1109/vcip59821.2023.10402738 article EN 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) 2023-12-04

Decomposing Style, Content, and Motion for Videos

OPENALEX - Publications

Yaosi Hu Dacheng Yin Yuwang Wang Zhenzhong Chen Chong Luo

In this paper, we present the first video decomposition framework, named SyCoMo, that factorizes a into style, content, and motion. Such fine-grained enables flexible editing, for time allows tripartite synthesis. SyCoMo is unified domain-agnostic learning framework which can process videos of various object categories without domain-specific design or supervision. Different from other motion work, derives style-free content by isolating style in place. Content organized subchannels, each...

10.2139/ssrn.4177879 article EN SSRN Electronic Journal 2022-01-01