- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Advanced Adaptive Filtering Techniques
- Phonetics and Phonology Research
- Speech and dialogue systems
- Authorship Attribution and Profiling
- Topic Modeling
- Hearing Loss and Rehabilitation
- Blind Source Separation Techniques
- Linguistics and Cultural Studies
The University of Texas at Dallas
2017-2021
Robust Chip (United States)
2019
Sharif University of Technology
2012-2014
In this study, we present systems submitted by the Center for Robust Speech Systems (CRSS) from UTDallas to NIST SRE 2018 (SRE18). Three alternative front-end speaker embedding frameworks are investigated, that includes: (i) i-vector, (ii) x-vector, (iii) and a modified triplet system (t-vector). Similar previous SRE, language mismatch between training enrollment/test data, so-called domain mismatch, remains as major challenge in evaluation. addition, SRE18 also introduces small portion of...
Speech separation has been studied widely for single-channel close-talk microphone recordings over the past few years; developed solutions are mostly in frequency-domain.Recently, a raw audio waveform network (TasNet) is introduced data, with achieving high Si-SNR (scale-invariant source-to-noise ratio) and SDR (sourceto-distortion comparing against state-of-the-art solution frequency-domain.In this study, we incorporate effective components of TasNet into frequency-domain method.We compare...
i-Vector feature representation with probabilistic linear discriminant analysis (PLDA) scoring in speaker recognition system has recently achieved effective permanence even on channel mismatch conditions. In general, experiments carried out using this combined strategy employ (LDA) after the extraction phase to suppress irrelevant directions, such as those introduced by noise or distortions. However, speaker-related and -non-related variability present data may prevent LDA from finding best...
The I4U consortium was established to facilitate a joint entry NIST speaker recognition evaluations (SRE). latest edition of such submission in SRE 2018, which the among best-performing systems. SRE'18 also marks 10-year anniversary into series evaluation. primary objective current paper is summarize results and lessons learned based on twelve sub-systems their fusion submitted SRE'18. It our intention present shared view advancements, progresses, major paradigm shifts that we have witnessed...
This document briefly describes the systems submitted by Center for Robust Speech Systems (CRSS) from The University of Texas at Dallas (UTD) to 2016 National Institute Standards and Technology (NIST) Speaker Recognition Evaluation (SRE). We developed several UBM DNN i-Vector based speaker recognition with different data sets feature representations. Given that emphasis NIST SRE is on language mismatch between training enrollment/test data, so-called domain mismatch, in our system...
This study presents systems submitted by the University of Texas at Dallas, Center for Robust Speech Systems (UTD-CRSS) to MGB-3 Arabic Dialect Identification (ADI) subtask. task is defined discriminate between five dialects Arabic, including Egyptian, Gulf, Levantine, North African, and Modern Standard Arabic. We develop multiple single with different front-end representations back-end classifiers. At level, feature extraction methods such as Mel-frequency cepstral coefficients (MFCCs) two...
Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations this decision tree-based structure: (i) The tree structure lacks adequate context generalization. (ii) It is unable express complex dependencies. (iii) Parameters generated from sudden transitions between adjacent states. In order...
Speech separation refers to extracting each individual speech source in a given mixed signal. Recent advancements and ongoing research this area, have made these approaches as promising techniques for pre-processing of naturalistic audio streams. After incorporating deep learning into separation, performance on systems is improving faster. The initial solutions introduced based analyzed the signals time-frequency domain with STFT; then encoded were fed neural network separator. Most...
Speech synthesis systems provided for the Persian language so far need various large-scale speech corpora to synthesize several target speakers' voice. Accordingly, synthesizing with a small amount of data seems be essential in Persian. Taking advantage speaker adaptation makes it possible generate remarkable quality when are limited. Here we conducted this method first time This paper describes based on Hidden Markov Models (HMMs) system FARsi DATabase (FARSDAT). In regard, prepared whole...
This article proposes a method to improve the performance of deterministic plus stochastic model (DSM-) based feature extraction by integrating contextual information. One precious advantage speech synthesis over recognition is that in both training and testing phases synthesis, information available. However, similar recognition, this invaluable knowledge has been forgotten during acoustic synthesis. DSM expresses residual Mel-cepstral analysis through summation two components, namely...
This document briefly describes the systems submitted by Center for Robust Speech Systems (CRSS) from The University of Texas at Dallas (UTD) to 2016 National Institute Standards and Technology (NIST) Speaker Recognition Evaluation (SRE). We developed several UBM DNN i-Vector based speaker recognition with different data sets feature representations. Given that emphasis NIST SRE is on language mismatch between training enrollment/test data, so-called domain mismatch, in our system...
We introduce a bilingual solution to support English as secondary locale for most primary locales in hybrid automatic speech recognition (ASR) settings. Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) fully alignment model and subsequently streaming transformer model, (c) parallel encoder structure language identification (LID) loss, (d) an auxiliary loss monolingual projections. conclude that comparison LID our proposed is superior...
This study presents systems submitted by the University of Texas at Dallas, Center for Robust Speech Systems (UTD-CRSS) to MGB-3 Arabic Dialect Identification (ADI) subtask. task is defined discriminate between five dialects Arabic, including Egyptian, Gulf, Levantine, North African, and Modern Standard Arabic. We develop multiple single with different front-end representations back-end classifiers. At level, feature extraction methods such as Mel-frequency cepstral coefficients (MFCCs) two...
Speech separation has been studied widely for single-channel close-talk microphone recordings over the past few years; developed solutions are mostly in frequency-domain. Recently, a raw audio waveform network (TasNet) is introduced data, with achieving high Si-SNR (scale-invariant source-to-noise ratio) and SDR (source-to-distortion comparing against state-of-the-art solution In this study, we incorporate effective components of TasNet into frequency-domain method. We compare both...