V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Beat (acoustics)
DOI:
10.1609/aaai.v38i5.28299
Publication Date:
2024-03-25T09:40:36Z
AUTHORS (11)
ABSTRACT
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music models excel at the former through advanced audio codecs, exploration of signatures has been confined to specific visual scenarios. In contrast, our research confronts challenge learning between video directly from paired videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, video-to-music system capable producing for diverse range input types using multi-stage autoregressive model. Trained on 5k hours clips with frames mined in-the-wild V2Meow is competitive previous when evaluated in zero-shot manner. It synthesizes high-fidelity waveforms solely by conditioning pre-trained general-purpose features extracted frames, optional style control via text prompts. Through qualitative quantitative evaluations, we demonstrate that model outperforms various existing systems terms visual-audio correspondence quality. Music samples are available tinyurl.com/v2meow.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (6)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....