NFDI4DS | UHH-SEMS - Publication Details

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

OPENALEX - Publications

Duc-Tuan Truong Ruijie Tao Jia Qi Yip Kong Aik Lee Eng Siong Chng

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student at the embedding level or label level. However, conventional label-level KD overlooks significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for verification. In this paper, we first demonstrate that leveraging a larger number of training speakers improves models....

10.1109/icassp48485.2024.10447160 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Robust Audio Deepfake Detection using Ensemble Confidence Calibration

OPENALEX - Publications

Chin Yuen Kwok Duc-Tuan Truong Jia Qi Yip

10.1109/icassp49660.2025.10889972 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

OPENALEX - Publications

Jia Qi Yip Duc-Tuan Truong Dianwen Ng Chong Zhang Yukun Ma and 5 more

10.21437/interspeech.2023-1725 article EN Interspeech 2022 2023-08-14

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

OPENALEX - Publications

Duc-Tuan Truong Ruijie Tao Tuan Lam Nguyen Hieu-Thi Luong Kong Aik Lee and 1 more

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to convolutional neural network counterparts. This improvement could be due powerful modeling ability of multi-head self-attention (MHSA) in model, which learns temporal relationship each input token. However, artifacts can located specific regions both frequency channels and segments, while MHSA neglects this temporal-channel dependency sequence. In work, we proposed a Temporal-Channel...

10.21437/interspeech.2024-659 article EN Interspeech 2022 2024-09-01

Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

OPENALEX - Publications

Ruijie Tao Zhan Shi Yidi Jiang Duc-Tuan Truong Eng Siong Chng and 2 more

10.1145/3664647.3688980 article EN 2024-10-26

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

OPENALEX - Publications

Duc-Tuan Truong Ruijie Tao Tuan Lam Nguyen Hieu-Thi Luong Kong Aik Lee and 1 more

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to convolutional neural network counterparts. This improvement could be due powerful modeling ability of multi-head self-attention (MHSA) in model, which learns temporal relationship each input token. However, artifacts can located specific regions both frequency channels and segments, while MHSA neglects this temporal-channel dependency sequence. In work, we proposed a Temporal-Channel...

10.48550/arxiv.2406.17376 preprint EN arXiv (Cornell University) 2024-06-25

Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

OPENALEX - Publications

Ruijie Tao Zhan Shi Yidi Jiang Duc-Tuan Truong Eng Siong Chng and 2 more

The human brain has the capability to associate unknown person's voice and face by leveraging their general relationship, referred as ``cross-modal speaker verification''. This task poses significant challenges due complex relationship between modalities. In this paper, we propose a ``Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''~(MFV-KSD) framework. MFV-KSD contains keynote diarization front-end effectively address noisy speech inputs issue. To balance...

10.48550/arxiv.2407.17902 preprint EN arXiv (Cornell University) 2024-07-25

Room Impulse Responses help attackers to evade Deep Fake Detection

OPENALEX - Publications

Hieu-Thi Luong Duc-Tuan Truong Kong Aik Lee Eng Siong Chng

The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) 0.87% on LA subset 2.58% DF. However, benchmark accuracy is no guarantee robustness in real-world scenarios. This paper investigates effectiveness...

10.48550/arxiv.2409.14712 preprint EN arXiv (Cornell University) 2024-09-23

A study of guided masking data augmentation for deepfake speech detection

OPENALEX - Publications

Duc-Tuan Truong Yikang Wang Kong Aik Lee Ming Li Hiromitsu Nishizaki and 1 more

10.21437/asvspoof.2024-26 article EN 2024-08-31

Room Impulse Responses Help Attackers to Evade Deep Fake Detection

OPENALEX - Publications

Hieu-Thi Luong Duc-Tuan Truong Kong Aik Lee Eng Siong Chng

10.1109/slt61566.2024.10832177 article NL 2022 IEEE Spoken Language Technology Workshop (SLT) 2024-12-02

Exploring Speaker Age Estimation on Different Self-Supervised Learning Models

OPENALEX - Publications

Duc-Tuan Truong Tran The Anh Eng Siong Chng

Self-supervised learning (SSL) has played an important role in various tasks the field of speech and audio processing. However, there is limited research on adapting these SSL models to predict speaker's age gender using signals. In this paper, we investigate seven models, namely PASE+, NPC, wav2vec 2.0, XLSR, HuBERT, WavLM, data2vec joint estimation classification task TIMIT corpus. Additionally, also study effect different hidden encoder layers within result. Furthermore, evaluate how...

10.23919/apsipaasc55919.2022.9979878 article EN 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2022-11-07

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

OPENALEX - Publications

Duc-Tuan Truong Ruijie Tao Jia Qi Yip Kong Aik Lee Eng Siong Chng

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student at the embedding level or label level. However, conventional label-level KD overlooks significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for verification. In this paper, we first demonstrate that leveraging a larger number of training speakers improves models....

10.48550/arxiv.2309.14838 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model

OPENALEX - Publications

Tarun Gupta Duc-Tuan Truong Tran The Anh Eng Siong Chng

The estimation of speaker characteristics such as age and height is a challenging task, having numerous applications in voice forensic analysis. In this work, we propose bi-encoder transformer mixture model for estimation. Considering the wide differences male female formant fundamental frequencies, use two separate encoders extraction specific features gender, using wav2vec 2.0 common-level feature extractor. This architecture reduces interference effects during backpropagation improves...

10.48550/arxiv.2203.11774 preprint EN cc-by arXiv (Cornell University) 2022-01-01