Speech driven video editing via an audio-conditioned diffusion model
Video editing
TRACE (psycholinguistics)
Code (set theory)
DOI:
10.1016/j.imavis.2024.104911
Publication Date:
2024-01-21T05:40:07Z
AUTHORS (7)
ABSTRACT
Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing denoising model. Given of talking person, and separate auditory speech recording, the lip jaw motions are re-synchronised without relying on intermediate structural representations such as facial landmarks or 3D face We show this is possible by conditioning model audio mel spectral features to generate synchronised motion. Proof concept results demonstrated both single-speaker multi-speaker editing, providing baseline CREMA-D audiovisual data set. To best our knowledge, first work demonstrate validate feasibility applying models task audio-driven editing. All code, datasets, used part made publicly available here: https://danbigioi.github.io/DiffusionVideoEditing/.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (91)
CITATIONS (19)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....