- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Linguistics and Cultural Studies
- Hate Speech and Cyberbullying Detection
- Infant Health and Development
- Smart Cities and Technologies
- Vehicular Ad Hoc Networks (VANETs)
- Indoor and Outdoor Localization Technologies
- Freedom of Expression and Defamation
- Linguistic research and analysis
- Traffic Prediction and Management Techniques
- Privacy-Preserving Technologies in Data
- Cryptography and Data Security
- Human Mobility and Location-Based Analysis
- Linguistic Variation and Morphology
- Corporate Taxation and Avoidance
- Innovation Policy and R&D
- Language, Discourse, Communication Strategies
- Multilingual Education and Policy
Institute for Infocomm Research
2023-2024
Agency for Science, Technology and Research
2023-2024
NEC (Japan)
2016-2021
Shanghai University
2011
State-of-the-art speaker recognition systems comprise an x-vector (or i-vector) embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) backend. The effectiveness of these components relies on the availability large collection labeled training data. In practice, it is common that domains (e.g., language, demographic) in which system deployed differ from we trained system. To close gap due to domain mismatch, propose unsupervised PLDA adaptation algorithm learn...
The residual neural networks (ResNet) demonstrate the impressive performance in automatic speaker verification (ASV). They treat time and frequency dimensions equally, following default stride configuration designed for image recognition, where horizontal vertical axes exhibit similarities. This approach ignores fact that are asymmetric speech representation. We address this issue postulate <italic xmlns:mml="http://www.w3.org/1998/Math/MathML"...
Self-supervised learning (SSL) speech representation models, trained on large corpora, have demonstrated effectiveness in extracting hierarchical embeddings through multiple transformer layers. However, the behavior of these specific tasks remains uncertain. This paper investigates multi-layer WavLM model anti-spoofing and proposes an attentive merging method to leverage hidden embeddings. Results demonstrate feasibility fine-tuning achieve best equal error rate (EER) 0.65%, 3.50%, 3.19%...
We present a Bayesian formulation for deep speaker embedding, wherein the xi-vector is counterpart of x-vector, taking into account uncertainty estimate. On technology front, we offer simple and straightforward extension to now widely used x-vector. It consists an auxiliary neural net predicting frame-wise input sequence. show that proposed leads substantial improvement across all operating points, with significant reduction in error rates detection cost. theoretical our proposal integrates...
The emergence of large-margin softmax cross-entropy losses in training deep speaker embedding neural networks has triggered a gradual shift from parametric back-ends to simpler cosine similarity measure for verification. Popular include the probabilistic linear discriminant analysis (PLDA) and its variants. This paper investigates properties margin-based leading such aims find scoring best suited In addition, we revisit pre-processing techniques which have been widely used past assess their...
State-of-the-art speaker recognition systems comprise a embedding front-end followed by probabilistic linear discriminant analysis (PLDA) back-end. The effectiveness of these components relies on the availability large amount labeled training data. In practice, it is common for domains (e.g., language, channel, demographic) in which system deployed to differ from that has been trained. To close resulting gap, domain adaptation often essential PLDA models. Among two its variants are...
This paper presents an experimental study on deep speaker embedding with attention mechanism that has been found to be a powerful representation learning technique in recognition. In this framework, model works as frame selector computes weight for each frame-level feature vector, accord which utterance-level is produced at the pooling layer network. general, trained together network single objective function, and thus those two components are tightly bound one another. paper, we consider...
The I4U consortium was established to facilitate a joint entry NIST speaker recognition evaluations (SRE). latest edition of such submission in SRE 2018, which the among best-performing systems. SRE'18 also marks 10-year anniversary into series evaluation. primary objective current paper is summarize results and lessons learned based on twelve sub-systems their fusion submitted SRE'18. It our intention present shared view advancements, progresses, major paradigm shifts that we have witnessed...
While i-vector-PLDA frameworks employing huge amounts of development data have achieved significant success in speaker recognition, it is infeasible to collect a sufficiently large amount for every real application. This paper proposes method perform supervised domain adaptation PLDA i-vector-based recognition systems with available resource-rich mismatched and small matched data, under two assumptions: (1) between-speaker within-speaker covariances depend on domains; (2) features one can be...
This paper proposes a generalized framework for domain adaptation of Probabilistic Linear Discriminant Analysis (PLDA) in speaker recognition. It not only includes several existing supervised and unsupervised methods but also makes possible more flexible usage available data different domains. In particular, we introduce here the two new techniques described below. (1) Correlation-alignment-based interpolation (2) covariance regularization. The proposed...
Speech utterances recorded under differing conditions exhibit varying degrees of confidence in their embedding estimates, i.e., uncertainty, even if they are extracted using the same neural network. This paper aims to incorporate uncertainty estimate produced xi-vector network front-end with a probabilistic linear discriminant analysis (PLDA) back-end scoring for speaker verification. To achieve this we derive posterior covariance matrix, which measures from frame-wise precisions space. We...
For speaker recognition, it is difficult to extract an accurate representation from speech because of its mixture traits and content. This paper proposes a disentanglement framework that simultaneously models content variability in speech. It realized with the use three Gaussian inference layers, each consisting learnable transition model extracts distinct components. Notably, strengthened specifically designed complex dynamics. We also propose self-supervision method dynamically disentangle...
This paper presents an experimental study on deep speaker embedding with attention mechanism that has been found to be a powerful representation learning technique in recognition. In this framework, model works as frame selector computes weight for each frame-level feature vector, accord which utterancelevel is produced at the pooling layer network. general, trained together network single objective function, and thus those two components are tightly bound one another. paper, we consider...
Uncertainty modeling in speaker representation aims to learn the variability present speech utterances. While conventional cosine-scoring is computationally efficient and prevalent recognition, it lacks capability handle uncertainty. To address this challenge, paper proposes an approach for estimating uncertainty at embedding front-end propagating cosine scoring back-end. Experiments conducted on VoxCeleb SITW datasets confirmed efficacy of proposed method handling arising from estimation....
Self-supervised learning (SSL) speech representation models, trained on large corpora, have demonstrated effectiveness in extracting hierarchical embeddings through multiple transformer layers. However, the behavior of these specific tasks remains uncertain. This paper investigates multi-layer WavLM model anti-spoofing and proposes an attentive merging method to leverage hidden embeddings. Results demonstrate feasibility fine-tuning achieve best equal error rate (EER) 0.65%, 3.50%, 3.19%...
The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification these remain limited. Existing datasets are mainly in English, the high cost acquiring multilingual hinders training language-independent models. We initiate this work by evaluating top-performing systems that trained on English data but tested other languages, observing notable performance declines. propose an innovative approach - Accent-based expansion via TTS (ACCENT), which...
As a highly urbanized nation, Singapore faces unique urban planning challenges due to its geographical attributes and demographics. These include optimizing land transportation, enhancing quality of life, preparing for pandemics. Quick responses understanding region-specific social voices are essential effective policy-making real-time insights into local dynamics. This work delves analyzing media data sourced from Twitter within the context Singapore, forming crucial component broader...
This work details our approach to achieving a leading system with 1.79% pooled equal error rate (EER) on the evaluation set of Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. (SVDD) Challenge 2024 aims address this complex task. In work, we explore ensemble methods, utilizing speech foundation develop robust voice...
This technical report describes the MERaLiON Speech Encoder, a foundation model designed to support wide range of downstream speech applications. Developed as part Singapore's National Multimodal Large Language Model Programme, Encoder is tailored address processing needs in Singapore and surrounding Southeast Asian region. The currently supports mainly English, including variety spoken Singapore. We are actively expanding our datasets gradually cover other languages subsequent releases. was...