- Speech and Audio Processing
- Music and Audio Processing
- Speech Recognition and Synthesis
- Blind Source Separation Techniques
- Advanced Adaptive Filtering Techniques
- Music Technology and Sound Studies
- Advanced Data Compression Techniques
- Image and Signal Denoising Methods
- Neural Networks and Applications
- Natural Language Processing Techniques
- Phonetics and Phonology Research
- Speech and dialogue systems
- Voice and Speech Disorders
- Spectroscopy and Chemometric Analyses
- Face and Expression Recognition
- Bayesian Methods and Mixture Models
- Video Analysis and Summarization
- Generative Adversarial Networks and Image Synthesis
- Underwater Acoustics Research
- Infant Health and Development
- Time Series Analysis and Forecasting
- Hearing Loss and Rehabilitation
- Optical measurement and interference techniques
- Digital Media Forensic Detection
- Neuroscience and Music Perception
NTT (Japan)
2016-2025
NTT Basic Research Laboratories
2014-2022
Ritsumeikan University
2021
The University of Tokyo
2008-2019
Amazon (United States)
2018-2019
University of Science and Technology of China
2019
University of Udine
2018-2019
The Ohio State University
2018-2019
University of Cambridge
2018-2019
Middle East Technical University
2018-2019
This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using variant of generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns mappings across different attribute domains single network, (3) able to generate converted signals quickly enough allow real-time...
This paper addresses the determined blind source separation problem and proposes a new effective method unifying independent vector analysis (IVA) nonnegative matrix factorization (NMF). IVA is state-of-the-art technique that utilizes statistical independence between sources in mixture signal, an efficient optimization scheme has been proposed for IVA. However, since model based on spherical multivariate distribution, cannot utilize specific spectral structures such as harmonic of pitched...
We propose a non-parallel voice-conversion (VC) method that can learn mapping from source to target speech without relying on parallel data. The proposed is particularly noteworthy in it general purpose and high quality works any extra data, modules, or alignment procedure. Our method, called CycleGAN-VC, uses cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) an identity-mapping loss. A CycleGAN learns forward inverse mappings simultaneously...
This paper presents new formulations and algorithms for multichannel extensions of non-negative matrix factorization (NMF). The employ Hermitian positive semidefinite matrices to represent a version elements. Multichannel Euclidean distance Itakura-Saito (IS) divergence are defined based on appropriate statistical models utilizing multivariate complex Gaussian distributions. To minimize this distance/divergence, efficient optimization in the form multiplicative updates derived by using...
Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This an important task, but it has been challenging due disadvantages of training conditions. Recently, CycleGAN-VC provided breakthrough and performed comparably VC method any extra data, modules, or time alignment procedures. However, there still large gap between real converted speech, bridging this remains challenge. To reduce gap, we propose...
We propose a parallel-data-free voice-conversion (VC) method that can learn mapping from source to target speech without relying on parallel data. The proposed is general purpose, high quality, and parallel-data free works any extra data, modules, or alignment procedure. It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods. Our method, called CycleGAN-VC, uses cycle-consistent adversarial network (CycleGAN) with gated convolutional neural...
This paper proposes a multipitch analyzer called the harmonic temporal structured clustering (HTC) method, that jointly estimates pitch, intensity, onset, duration, etc., of each underlying source in audio signal. HTC decomposes energy patterns diffused time-frequency space, i.e., power spectrum time series, into distinct clusters such has originated from single source. The problem is equivalent to approximating observed series by superimposed models, whose parameters are associated with...
This paper presents a new sparse representation for acoustic signals which is based on mixing model defined in the complex-spectrum domain (where additivity holds), and allows us to extract recurrent patterns of magnitude spectra that underlie observed complex phase estimates constituent signals. An efficient iterative algorithm derived, reduces multiplicative update non-negative matrix factorization developed by Lee under particular condition.
Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data.This important but challenging owing to the requirement of and nonavailability explicit supervision.Recently, StarGAN-VC has garnered attention its ability solve this problem only using single generator.However, there still gap between real converted speech.To bridge gap, we rethink conditional methods StarGAN-VC, which are key components achieving...
We propose a postfilter based on generative adversarial network (GAN) to compensate for the differences between natural speech and synthesized by statistical parametric synthesis. In particular, we focus caused over-smoothing, which makes sounds muffled. Over-smoothing occurs in time frequency directions is highly correlated both directions, conventional methods heuristics are too limited cover all factors (e.g., global variance was designed only recover dynamic range). To solve this...
This paper proposes a non-parallel voice conversion (VC) method using variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE. The proposed has two key features. First, it adopts fully convolutional architectures to construct encoder and decoder networks so that can learn rules capture time dependencies in acoustic feature sequences source target speech. Second, uses information-theoretic regularization for model training ensure information attribute class...
This paper describes a method based on sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis recognition, machine translation, image captioning. In contrast to current VC techniques, our 1) stabilizes accelerates the training procedure by considering guided proposed losses, 2) allows not only spectral envelopes but also...
This letter proposes a multichannel source separation technique, the variational autoencoder (MVAE) method, which uses conditional VAE (CVAE) to model and estimate power spectrograms of sources in mixture. By training CVAE using examples with source-class labels, we can use trained decoder distribution as universal generative capable generating conditioned on specified class index. treating latent space variables index unknown parameters this model, develop convergence-guaranteed algorithm...
Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using parallel corpus.Recently, cycle-consistent adversarial network (CycleGAN)-VC CycleGAN-VC2 have shown promising results regarding this problem been widely used as benchmark methods.However, owing to the ambiguity of effectiveness CycleGAN-VC/VC2 mel-spectrogram conversion, they are typically mel-cepstrum even when comparative methods employ target.To address this, we...
We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.Seq2seq VC models are attractive owing to their ability convert prosody.While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated.Nonetheless, data-hungry property mispronunciation...
In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, the necessity for vocoder increasing. A must solve three inverse problems: recovery of original-scale magnitude spectrogram, phase reconstruction, frequency-to-time conversion. typical convolutional solves these problems jointly implicitly using neural network, including temporal upsampling layers, when directly calculating raw waveform. Such approach...
We propose a learning-based postfilter to reconstruct the high-fidelity spectral texture in short-term Fourier transform (STFT) spectrograms.In speech-processing systems, such as speech synthesis, voice conversion, and enhancement, STFT spectrograms have been widely used key acoustic representations.In these tasks, we normally need precisely generate or predict representations from inputs; however, generated spectra typically lack fine structures close true data.To overcome limitations...
This article proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the characteristics but also pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses model with fully convolutional architecture. is particularly advantageous in that suitable for parallel computations GPUs. It beneficial since enables effective normalization techniques such as batch to...
Non-parallel voice conversion (VC) is a technique for training converters without parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability grasp time-frequency structures, application limited mel-cepstrum not mel-spectrogram despite recent advances in vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates additional module...