VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

DOI: 10.1609/aaai.v39i10.33114 Publication Date: 2025-04-11T11:56:03Z
ABSTRACT
Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video (T2V) still lag far behind frame quality text alignment, owing to insufficient quantity of training videos. In this paper, we introduce VideoElevator, a training-free plug-and-play method, which elevates performance T2V using superior T2I. Different from conventional sampling (i.e., temporal spatial modeling), VideoElevator explicitly decomposes each step into motion refining elevating. Specifically, uses encapsulated enhance consistency, followed by inverting noise distribution required Then, elevating harnesses inflated T2I directly predict less noisy latent, adding more photo-realistic details. We conducted experiments extensive prompts under combination various The results show that not only improves baselines with foundational T2I, but also facilitates stylistic video synthesis personalized Please watch all videos supplementary materials for better view.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....