An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings

Speaker diarisation
DOI: 10.1016/j.csl.2023.101534 Publication Date: 2023-05-30T00:40:10Z
ABSTRACT
We performed an experimental review of current diarization systems for the conversational telephone speech (CTS) domain. In detail, we considered a total eight different algorithms belonging to clustering-based, end-to-end neural (EEND), and separation guided (SSGD) paradigms. studied inference-time computational requirements accuracy on four CTS datasets with characteristics languages. found that, among all methods considered, EEND-vector clustering (EEND-VC) offers best trade-off in terms computing performance. More general, EEND models have been be lighter faster inference compared clustering-based methods. However, they also require large amount diarization-oriented annotated data. particular EEND-VC performance our experiments degraded when dataset size was reduced, whereas self-attentive (SA-EEND) less affected. that SA-EEND gives consistent results EEND-VC, its degrading long conversations high sparsity. Clustering-based systems, VBx, instead more but are outperformed by EEND-VC. The gap respect this latter is reduced overlap-aware considered. SSGD most computationally demanding method, it could convenient if recognition has performed. Its close degrades significantly training data matched.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (111)
CITATIONS (3)