- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Phonetics and Phonology Research
- Speech and dialogue systems
- Topic Modeling
- Music Technology and Sound Studies
- Hate Speech and Cyberbullying Detection
- Advanced Data Compression Techniques
- Authorship Attribution and Profiling
- Adversarial Robustness in Machine Learning
- Law, AI, and Intellectual Property
- Voice and Speech Disorders
- Sensor Technology and Measurement Systems
- Generative Adversarial Networks and Image Synthesis
- Video Analysis and Summarization
- Linguistics and Cultural Studies
- Particle physics theoretical and experimental studies
- Advanced Electrical Measurement Techniques
- Digital Media Forensic Detection
- Advanced Adaptive Filtering Techniques
- Bioethics and Human Rights Issues
- Marine and Coastal Research
- Advancements in PLL and VCO Technologies
Chinese University of Hong Kong, Shenzhen
2022-2025
Shanghai Artificial Intelligence Laboratory
2025
Shenzhen Research Institute of Big Data
2023-2025
Jingdong (China)
2019
University of Edinburgh
2014-2017
Apple (United States)
2017
Nanyang Technological University
2010-2015
Shanghai University
2015
Edinburgh College
2015
National Institute of Informatics
2014
An increasing number of independent studies have confirmed the vulnerability automatic speaker verification (ASV) technology to spoofing.However, in comparison that involving other biometric modalities, spoofing and countermeasure research for ASV is still its infancy.A current barrier progress lack standards which impedes results generated by different researchers.The ASVspoof initiative aims overcome this bottleneck through provision standard corpora, protocols metrics support a common...
We introduce the Merlin speech synthesis toolkit for neural network-based synthesis.The system takes linguistic features as input, and employs networks to predict acoustic features, which are then passed a vocoder produce waveform.Various network architectures implemented, including standard feedforward network, mixture density recurrent (RNN), long short-term memory (LSTM) amongst others.The is Open Source, written in Python, extensible.This paper briefly describes system, provides some...
Deep neural networks (DNNs) use a cascade of hidden representations to enable the learning complex mappings from input output features. They are able learn mapping text-based linguistic features speech acoustic features, and so perform text-to-speech synthesis. Recent results suggest that DNNs can produce more natural synthetic than conventional HMM-based statistical parametric systems. In this paper, we show representation used within DNN be improved through Multi-Task Learning, stacking...
Concerns regarding the vulnerability of automatic speaker verification (ASV) technology against spoofing can undermine confidence in its reliability and form a barrier to exploitation. The absence competitive evaluations lack common datasets has hampered progress developing effective countermeasures. This paper describes ASV Spoofing Countermeasures (ASVspoof) initiative, which aims fill this void. Through provision dataset, protocols, metrics, ASVspoof promotes sound research methodology...
This paper describes the Voice Conversion Challenge 2016 devised by authors to better understand different voice conversion (VC) techniques comparing their performance on a common dataset.The task of challenge was speaker conversion, i.e., transform identity source into that target while preserving linguistic content.Using dataset consisting 162 utterances for training and 54 evaluation from each 5 speakers, 17 groups working in VC around world developed own systems every combination 25...
Voice conversion - the methodology of automatically converting one's utterances to sound as if spoken by another speaker presents a threat for applications relying on verification. We study vulnerability text-independent verification systems against voice attacks using telephone speech. implemented with two types features and nonparallel frame alignment methods five ranging from simple Gaussian mixture models (GMMs) state-of-the-art joint factor analysis (JFA) recognizer. Experiments subset...
We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, spectrogram is reconstructed as weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The weights are constrained to be avoid over-smoothing, and high-resolution spectra employed in the exemplars directly without dimensionality reduction maintain spectral details. addition, compression...
Voice conversion techniques present a threat to speaker verification systems.To enhance the security of systems, We study how automatically distinguish natural speech and synthetic/converted speech.Motivated by research on phase spectrum in perception, this study, we propose use features derived from detect converted speech.The are tested under three different training situations detector: a) only Gaussian mixture model (GMM) based data available; b) unit-selection c) no available for...
Replay, which is to playback a pre-recorded speech sample, presents genuine risk automatic speaker verification technology. In this study, we evaluate the vulnerability of text-dependent systems under replay attack using standard benchmarking database, and also propose an anti-spoofing technique safeguard systems. The key idea spoofing detection decide whether presented sample matched any previous stored samples based similarity score. experiments conducted on RSR2015 database showed that...
Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although recent studies demonstrated that LSTMs can achieve significantly better performance on SPSS than deep feedforward networks, little known about why....
Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art verification systems. To prevent such spoofing attack enhance the security of systems, development anti-spoofing distinguish synthetic human speech is necessary. In this study, we continue quest discriminate speech. Motivated by facts that analysis-synthesis operate on frame level make frame-by-frame independence assumption, proposed adopt magnitude/phase modulation features detect from Modulation...
A major advantage of statistical parametric speech synthesis (SPSS) over unit-selection is its adaptability and controllability in changing speaker characteristics speaking style. Recently, several studies using deep neural networks (DNNs) as acoustic models for SPSS have shown promising results. However, the DNNs has not been systematically studied. In this paper, we conduct an experimental analysis adaptation DNN-based at different levels. particular, augment a low-dimensional...
In this paper, we present a systematic study of the vulnerability automatic speaker verification to diverse range spoofing attacks. We start with thorough analysis effects five speech synthesis and eight voice conversion systems, three systems under those then introduce number countermeasures prevent attacks from both known unknown attackers. Known attackers are whose output was used train countermeasures, while an attacker is system not available during training. Finally, benchmark against...
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering intricately encompasses various attributes (e.g., content, prosody, timbre, acoustic details) that pose challenges for generation, a natural idea is to factorize into individual subspaces representing different generate them individually. Motivated by it, we propose NaturalSpeech 3, TTS system with novel factorized diffusion...
The conventional statistical-based transformation functions for voice conversion have been shown to suffer over-smoothing and over-fitting problems. problem arises because of the statistical average during estimating model parameters function. In addition, large number in cannot be well estimated from limited parallel training data, which will result problem. this work, we investigate a robust function using conditional restricted Boltzmann machine. Conditional machine, performs linear...
Deep neural networks (DNNs) have recently been the focus of much text-to-speech research as a replacement for decision trees and hidden Markov models (HMMs) in statistical parametric synthesis systems. Performance improvements reported; however, configuration systems evaluated makes it impossible to judge how improvement is due new machine learning methods, other novel aspects Specifically, whereas HMM-based typically operate at state-level, separate are used handle acoustic streams, most...
Any biometric recognizer is vulnerable to spoofing attacks and hence voice biometric, also called automatic speaker verification (ASV), no exception; replay, synthesis, conversion all provoke false acceptances unless countermeasures are used. We focus on (VC) considered as one of the most challenging for modern recognition systems. To detect spoofing, existing assume explicit or implicit knowledge a particular VC system designing discriminative features. In this paper, we explore back-end...
This paper presents the first version of a speaker verification spoofing and anti-spoofing database, named SAS corpus. The corpus includes nine techniques, two which are speech synthesis, seven voice conversion. We design protocols, one for standard evaluation, other producing materials. Hence, they allow synthesis community to produce materials incrementally without knowledge anti-spoofing. To provide set preliminary results, we conducted experiments using state-of-the-art systems. Without...
The Voice Conversion Challenge 2016 is the first in which different voice conversion systems and approaches using same data were compared.This paper describes design of evaluation, it presents results statistical analyses results.
A major advantage of statistical parametric speech synthesis (SPSS) over unit-selection is its adaptability and controllability in changing speaker characteristics speaking style.Recently, several studies using deep neural networks (DNNs) as acoustic models for SPSS have shown promising results.However, the DNNs has not been systematically studied.In this paper, we conduct an experimental analysis adaptation DNN-based at different levels.In particular, augment a low-dimensional...