Multimodal active speaker detection and virtual cinematography for video conferencing

Panning (audio) Cinematography Frame rate
DOI: 10.48550/arxiv.2002.03977 Publication Date: 2020-01-01
ABSTRACT
Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting zooming conferencing camera: users subjectively rate an expert cinematographer's higher than unedited video. We describe new automated ASD VC that performs within 0.3 MOS cinematographer based on subjective ratings with 1-5 scale. This system uses 4K wide-FOV camera, depth microphone array; it extracts features from each modality trains using AdaBoost machine learning is very efficient runs in real-time. A similarly trained to optimize quality overall experience. To avoid distracting room participants reduce switching latency has no moving parts -- works cropping stream. The was tuned evaluated extensive crowdsourcing techniques dataset N=100 meetings, 2-5 minutes length.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....