Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation

Separation (statistics) Source Separation
DOI: 10.21437/interspeech.2021-430 Publication Date: 2021-08-27T05:59:39Z
ABSTRACT
Although the conventional mask-based minimum variance distortionless response (MVDR) could reduce non-linear distortion, residual noise level of MVDR separated speech is still high.In this paper, we propose a spatio-temporal recurrent neural network based beamformer (RNN-BF) for target separation.This new beamforming framework directly learns weights from estimated and spatial covariance matrices.Leveraging on temporal modeling capability RNNs, RNN-BF automatically accumulate statistics matrices to learn frame-level in recursive way.An RNN-based generalized eigenvalue (RNN-GEV) more RNN (GRNN-BF) are proposed.We further improve RNN-GEV GRNN-BF by using layer normalization replace commonly used mask matrices.The proposed obtains better performance against prior arts terms quality (PESQ), speech-to-noise ratio (SNR) word error rate (WER).
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (28)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....