FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Fresco Zero (linguistics)
DOI: 10.48550/arxiv.2403.12962 Publication Date: 2024-03-19
ABSTRACT
The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration their potential application in video domains. Zero-shot methods seek to extend image videos without necessitating model training. Recent mainly focus on incorporating inter-frame correspondence into attention mechanisms. However, the soft constraint imposed determining where attend valid features can sometimes be insufficient, resulting temporal inconsistency. In this paper, we introduce FRESCO, intra-frame alongside establish a more robust spatial-temporal constraint. This enhancement ensures consistent transformation semantically similar content across frames. Beyond mere guidance, our approach involves an explicit update achieve high consistency with input video, significantly improving visual coherence translated videos. Extensive experiments demonstrate effectiveness proposed framework producing high-quality, coherent videos, marking notable improvement over existing zero-shot methods.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....