- Speech and Audio Processing
- Speech Recognition and Synthesis
- Music and Audio Processing
- Advanced Data Compression Techniques
- Digital Media Forensic Detection
- Adversarial Robustness in Machine Learning
- Advanced Malware Detection Techniques
- Face recognition and analysis
- Image and Signal Denoising Methods
- Anomaly Detection Techniques and Applications
Nanyang Technological University
2022-2025
Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student at the embedding level or label level. However, conventional label-level KD overlooks significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for verification. In this paper, we first demonstrate that leveraging a larger number of training speakers improves models....
Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to convolutional neural network counterparts. This improvement could be due powerful modeling ability of multi-head self-attention (MHSA) in model, which learns temporal relationship each input token. However, artifacts can located specific regions both frequency channels and segments, while MHSA neglects this temporal-channel dependency sequence. In work, we proposed a Temporal-Channel...
Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to convolutional neural network counterparts. This improvement could be due powerful modeling ability of multi-head self-attention (MHSA) in model, which learns temporal relationship each input token. However, artifacts can located specific regions both frequency channels and segments, while MHSA neglects this temporal-channel dependency sequence. In work, we proposed a Temporal-Channel...
The human brain has the capability to associate unknown person's voice and face by leveraging their general relationship, referred as ``cross-modal speaker verification''. This task poses significant challenges due complex relationship between modalities. In this paper, we propose a ``Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''~(MFV-KSD) framework. MFV-KSD contains keynote diarization front-end effectively address noisy speech inputs issue. To balance...
The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) 0.87% on LA subset 2.58% DF. However, benchmark accuracy is no guarantee robustness in real-world scenarios. This paper investigates effectiveness...
Self-supervised learning (SSL) has played an important role in various tasks the field of speech and audio processing. However, there is limited research on adapting these SSL models to predict speaker's age gender using signals. In this paper, we investigate seven models, namely PASE+, NPC, wav2vec 2.0, XLSR, HuBERT, WavLM, data2vec joint estimation classification task TIMIT corpus. Additionally, also study effect different hidden encoder layers within result. Furthermore, evaluate how...
Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student at the embedding level or label level. However, conventional label-level KD overlooks significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for verification. In this paper, we first demonstrate that leveraging a larger number of training speakers improves models....
The estimation of speaker characteristics such as age and height is a challenging task, having numerous applications in voice forensic analysis. In this work, we propose bi-encoder transformer mixture model for estimation. Considering the wide differences male female formant fundamental frequencies, use two separate encoders extraction specific features gender, using wav2vec 2.0 common-level feature extractor. This architecture reduces interference effects during backpropagation improves...