Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings

Speaker diarisation
DOI: 10.1121/10.0002924 Publication Date: 2020-12-15T14:03:17Z
ABSTRACT
In this paper, an audio-driven, multimodal approach for speaker diarization in multimedia content is introduced and evaluated. The proposed algorithm based on semi-supervised clustering of audio-visual embeddings, generated using deep learning techniques. two modes, audio video, are separately addressed; a long short-term memory Siamese neural network employed to produce embeddings from audio, whereas pre-trained convolutional deployed generate two-dimensional blocks representing the faces speakers detected video frames. both cases, models trained cost functions that favor smaller spatial distances between samples same greater different speakers. A fusion stage, hypotheses derived established practices television production, top unimodal sub-components improve performance. methodology evaluated against VoxCeleb, large-scale dataset with hundreds available AVL-SD, newly developed, publicly aiming at capturing peculiarities TV news under scenarios. order promote reproducible research collaboration field, implemented provided as open-source software package.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (68)
CITATIONS (11)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....