Zhizheng Wu

ORCID: 0009-0001-1192-9857
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Phonetics and Phonology Research
  • Speech and dialogue systems
  • Topic Modeling
  • Music Technology and Sound Studies
  • Hate Speech and Cyberbullying Detection
  • Advanced Data Compression Techniques
  • Authorship Attribution and Profiling
  • Adversarial Robustness in Machine Learning
  • Law, AI, and Intellectual Property
  • Voice and Speech Disorders
  • Sensor Technology and Measurement Systems
  • Generative Adversarial Networks and Image Synthesis
  • Video Analysis and Summarization
  • Linguistics and Cultural Studies
  • Particle physics theoretical and experimental studies
  • Advanced Electrical Measurement Techniques
  • Digital Media Forensic Detection
  • Advanced Adaptive Filtering Techniques
  • Bioethics and Human Rights Issues
  • Marine and Coastal Research
  • Advancements in PLL and VCO Technologies

Chinese University of Hong Kong, Shenzhen
2022-2025

Shanghai Artificial Intelligence Laboratory
2025

Shenzhen Research Institute of Big Data
2023-2025

Jingdong (China)
2019

University of Edinburgh
2014-2017

Apple (United States)
2017

Nanyang Technological University
2010-2015

Shanghai University
2015

Edinburgh College
2015

National Institute of Informatics
2014

An increasing number of independent studies have confirmed the vulnerability automatic speaker verification (ASV) technology to spoofing.However, in comparison that involving other biometric modalities, spoofing and countermeasure research for ASV is still its infancy.A current barrier progress lack standards which impedes results generated by different researchers.The ASVspoof initiative aims overcome this bottleneck through provision standard corpora, protocols metrics support a common...

10.21437/interspeech.2015-462 article EN Interspeech 2022 2015-09-06

We introduce the Merlin speech synthesis toolkit for neural network-based synthesis.The system takes linguistic features as input, and employs networks to predict acoustic features, which are then passed a vocoder produce waveform.Various network architectures implemented, including standard feedforward network, mixture density recurrent (RNN), long short-term memory (LSTM) amongst others.The is Open Source, written in Python, extensible.This paper briefly describes system, provides some...

10.21437/ssw.2016-33 article EN 2016-09-13

Deep neural networks (DNNs) use a cascade of hidden representations to enable the learning complex mappings from input output features. They are able learn mapping text-based linguistic features speech acoustic features, and so perform text-to-speech synthesis. Recent results suggest that DNNs can produce more natural synthetic than conventional HMM-based statistical parametric systems. In this paper, we show representation used within DNN be improved through Multi-Task Learning, stacking...

10.1109/icassp.2015.7178814 article EN 2015-04-01

Concerns regarding the vulnerability of automatic speaker verification (ASV) technology against spoofing can undermine confidence in its reliability and form a barrier to exploitation. The absence competitive evaluations lack common datasets has hampered progress developing effective countermeasures. This paper describes ASV Spoofing Countermeasures (ASVspoof) initiative, which aims fill this void. Through provision dataset, protocols, metrics, ASVspoof promotes sound research methodology...

10.1109/jstsp.2017.2671435 article EN IEEE Journal of Selected Topics in Signal Processing 2017-02-17

This paper describes the Voice Conversion Challenge 2016 devised by authors to better understand different voice conversion (VC) techniques comparing their performance on a common dataset.The task of challenge was speaker conversion, i.e., transform identity source into that target while preserving linguistic content.Using dataset consisting 162 utterances for training and 54 evaluation from each 5 speakers, 17 groups working in VC around world developed own systems every combination 25...

10.21437/interspeech.2016-1066 article EN Interspeech 2022 2016-08-29

Voice conversion - the methodology of automatically converting one's utterances to sound as if spoken by another speaker presents a threat for applications relying on verification. We study vulnerability text-independent verification systems against voice attacks using telephone speech. implemented with two types features and nonparallel frame alignment methods five ranging from simple Gaussian mixture models (GMMs) state-of-the-art joint factor analysis (JFA) recognizer. Experiments subset...

10.1109/icassp.2012.6288895 article EN 2012-03-01

We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, spectrogram is reconstructed as weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The weights are constrained to be avoid over-smoothing, and high-resolution spectra employed in the exemplars directly without dimensionality reduction maintain spectral details. addition, compression...

10.1109/taslp.2014.2333242 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2014-06-25

Voice conversion techniques present a threat to speaker verification systems.To enhance the security of systems, We study how automatically distinguish natural speech and synthetic/converted speech.Motivated by research on phase spectrum in perception, this study, we propose use features derived from detect converted speech.The are tested under three different training situations detector: a) only Gaussian mixture model (GMM) based data available; b) unit-selection c) no available for...

10.21437/interspeech.2012-465 article EN Interspeech 2022 2012-09-09

Replay, which is to playback a pre-recorded speech sample, presents genuine risk automatic speaker verification technology. In this study, we evaluate the vulnerability of text-dependent systems under replay attack using standard benchmarking database, and also propose an anti-spoofing technique safeguard systems. The key idea spoofing detection decide whether presented sample matched any previous stored samples based similarity score. experiments conducted on RSR2015 database showed that...

10.1109/apsipa.2014.7041636 article EN 2014-12-01

Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although recent studies demonstrated that LSTMs can achieve significantly better performance on SPSS than deep feedforward networks, little known about why....

10.1109/icassp.2016.7472657 article EN 2016-03-01

Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art verification systems. To prevent such spoofing attack enhance the security of systems, development anti-spoofing distinguish synthetic human speech is necessary. In this study, we continue quest discriminate speech. Motivated by facts that analysis-synthesis operate on frame level make frame-by-frame independence assumption, proposed adopt magnitude/phase modulation features detect from Modulation...

10.1109/icassp.2013.6639067 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

A major advantage of statistical parametric speech synthesis (SPSS) over unit-selection is its adaptability and controllability in changing speaker characteristics speaking style. Recently, several studies using deep neural networks (DNNs) as acoustic models for SPSS have shown promising results. However, the DNNs has not been systematically studied. In this paper, we conduct an experimental analysis adaptation DNN-based at different levels. particular, augment a low-dimensional...

10.7488/ds/259 article EN 2015-09-06

In this paper, we present a systematic study of the vulnerability automatic speaker verification to diverse range spoofing attacks. We start with thorough analysis effects five speech synthesis and eight voice conversion systems, three systems under those then introduce number countermeasures prevent attacks from both known unknown attackers. Known attackers are whose output was used train countermeasures, while an attacker is system not available during training. Finally, benchmark against...

10.1109/taslp.2016.2526653 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-02-08

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering intricately encompasses various attributes (e.g., content, prosody, timbre, acoustic details) that pose challenges for generation, a natural idea is to factorize into individual subspaces representing different generate them individually. Motivated by it, we propose NaturalSpeech 3, TTS system with novel factorized diffusion...

10.48550/arxiv.2403.03100 preprint EN arXiv (Cornell University) 2024-03-05

The conventional statistical-based transformation functions for voice conversion have been shown to suffer over-smoothing and over-fitting problems. problem arises because of the statistical average during estimating model parameters function. In addition, large number in cannot be well estimated from limited parallel training data, which will result problem. this work, we investigate a robust function using conditional restricted Boltzmann machine. Conditional machine, performs linear...

10.1109/chinasip.2013.6625307 article EN 2013-07-01

Deep neural networks (DNNs) have recently been the focus of much text-to-speech research as a replacement for decision trees and hidden Markov models (HMMs) in statistical parametric synthesis systems. Performance improvements reported; however, configuration systems evaluated makes it impossible to judge how improvement is due new machine learning methods, other novel aspects Specifically, whereas HMM-based typically operate at state-level, separate are used handle acoustic streams, most...

10.1109/icassp.2016.7472730 article EN 2016-03-01

Any biometric recognizer is vulnerable to spoofing attacks and hence voice biometric, also called automatic speaker verification (ASV), no exception; replay, synthesis, conversion all provoke false acceptances unless countermeasures are used. We focus on (VC) considered as one of the most challenging for modern recognition systems. To detect spoofing, existing assume explicit or implicit knowledge a particular VC system designing discriminative features. In this paper, we explore back-end...

10.1109/tifs.2015.2407362 article EN IEEE Transactions on Information Forensics and Security 2015-02-26

This paper presents the first version of a speaker verification spoofing and anti-spoofing database, named SAS corpus. The corpus includes nine techniques, two which are speech synthesis, seven voice conversion. We design protocols, one for standard evaluation, other producing materials. Hence, they allow synthesis community to produce materials incrementally without knowledge anti-spoofing. To provide set preliminary results, we conducted experiments using state-of-the-art systems. Without...

10.1109/icassp.2015.7178810 article EN 2015-04-01

The Voice Conversion Challenge 2016 is the first in which different voice conversion systems and approaches using same data were compared.This paper describes design of evaluation, it presents results statistical analyses results.

10.21437/interspeech.2016-1331 article EN Interspeech 2022 2016-08-28

A major advantage of statistical parametric speech synthesis (SPSS) over unit-selection is its adaptability and controllability in changing speaker characteristics speaking style.Recently, several studies using deep neural networks (DNNs) as acoustic models for SPSS have shown promising results.However, the DNNs has not been systematically studied.In this paper, we conduct an experimental analysis adaptation DNN-based at different levels.In particular, augment a low-dimensional...

10.21437/interspeech.2015-270 article EN Interspeech 2022 2015-09-06
Coming Soon ...