Duc-Tuan Truong

ORCID: 0009-0002-1767-7598
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • Advanced Data Compression Techniques
  • Digital Media Forensic Detection
  • Adversarial Robustness in Machine Learning
  • Advanced Malware Detection Techniques
  • Face recognition and analysis
  • Image and Signal Denoising Methods
  • Anomaly Detection Techniques and Applications

Nanyang Technological University
2022-2025

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student at the embedding level or label level. However, conventional label-level KD overlooks significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for verification. In this paper, we first demonstrate that leveraging a larger number of training speakers improves models....

10.1109/icassp48485.2024.10447160 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

10.1109/icassp49660.2025.10889972 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to convolutional neural network counterparts. This improvement could be due powerful modeling ability of multi-head self-attention (MHSA) in model, which learns temporal relationship each input token. However, artifacts can located specific regions both frequency channels and segments, while MHSA neglects this temporal-channel dependency sequence. In work, we proposed a Temporal-Channel...

10.21437/interspeech.2024-659 article EN Interspeech 2022 2024-09-01

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to convolutional neural network counterparts. This improvement could be due powerful modeling ability of multi-head self-attention (MHSA) in model, which learns temporal relationship each input token. However, artifacts can located specific regions both frequency channels and segments, while MHSA neglects this temporal-channel dependency sequence. In work, we proposed a Temporal-Channel...

10.48550/arxiv.2406.17376 preprint EN arXiv (Cornell University) 2024-06-25

The human brain has the capability to associate unknown person's voice and face by leveraging their general relationship, referred as ``cross-modal speaker verification''. This task poses significant challenges due complex relationship between modalities. In this paper, we propose a ``Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''~(MFV-KSD) framework. MFV-KSD contains keynote diarization front-end effectively address noisy speech inputs issue. To balance...

10.48550/arxiv.2407.17902 preprint EN arXiv (Cornell University) 2024-07-25

The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) 0.87% on LA subset 2.58% DF. However, benchmark accuracy is no guarantee robustness in real-world scenarios. This paper investigates effectiveness...

10.48550/arxiv.2409.14712 preprint EN arXiv (Cornell University) 2024-09-23

10.1109/slt61566.2024.10832177 article NL 2022 IEEE Spoken Language Technology Workshop (SLT) 2024-12-02

Self-supervised learning (SSL) has played an important role in various tasks the field of speech and audio processing. However, there is limited research on adapting these SSL models to predict speaker's age gender using signals. In this paper, we investigate seven models, namely PASE+, NPC, wav2vec 2.0, XLSR, HuBERT, WavLM, data2vec joint estimation classification task TIMIT corpus. Additionally, also study effect different hidden encoder layers within result. Furthermore, evaluate how...

10.23919/apsipaasc55919.2022.9979878 article EN 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2022-11-07

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student at the embedding level or label level. However, conventional label-level KD overlooks significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for verification. In this paper, we first demonstrate that leveraging a larger number of training speakers improves models....

10.48550/arxiv.2309.14838 preprint EN other-oa arXiv (Cornell University) 2023-01-01

The estimation of speaker characteristics such as age and height is a challenging task, having numerous applications in voice forensic analysis. In this work, we propose bi-encoder transformer mixture model for estimation. Considering the wide differences male female formant fundamental frequencies, use two separate encoders extraction specific features gender, using wav2vec 2.0 common-level feature extractor. This architecture reduces interference effects during backpropagation improves...

10.48550/arxiv.2203.11774 preprint EN cc-by arXiv (Cornell University) 2022-01-01
Coming Soon ...