Hirokazu Kameoka

ORCID: 0000-0003-3102-0162
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech Recognition and Synthesis
  • Blind Source Separation Techniques
  • Advanced Adaptive Filtering Techniques
  • Music Technology and Sound Studies
  • Advanced Data Compression Techniques
  • Image and Signal Denoising Methods
  • Neural Networks and Applications
  • Natural Language Processing Techniques
  • Phonetics and Phonology Research
  • Speech and dialogue systems
  • Voice and Speech Disorders
  • Spectroscopy and Chemometric Analyses
  • Face and Expression Recognition
  • Bayesian Methods and Mixture Models
  • Video Analysis and Summarization
  • Generative Adversarial Networks and Image Synthesis
  • Underwater Acoustics Research
  • Infant Health and Development
  • Time Series Analysis and Forecasting
  • Hearing Loss and Rehabilitation
  • Optical measurement and interference techniques
  • Digital Media Forensic Detection
  • Neuroscience and Music Perception

NTT (Japan)
2016-2025

NTT Basic Research Laboratories
2014-2022

Ritsumeikan University
2021

The University of Tokyo
2008-2019

Amazon (United States)
2018-2019

University of Science and Technology of China
2019

University of Udine
2018-2019

The Ohio State University
2018-2019

University of Cambridge
2018-2019

Middle East Technical University
2018-2019

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using variant of generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns mappings across different attribute domains single network, (3) able to generate converted signals quickly enough allow real-time...

10.1109/slt.2018.8639535 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

This paper addresses the determined blind source separation problem and proposes a new effective method unifying independent vector analysis (IVA) nonnegative matrix factorization (NMF). IVA is state-of-the-art technique that utilizes statistical independence between sources in mixture signal, an efficient optimization scheme has been proposed for IVA. However, since model based on spherical multivariate distribution, cannot utilize specific spectral structures such as harmonic of pitched...

10.1109/taslp.2016.2577880 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-06-07

We propose a non-parallel voice-conversion (VC) method that can learn mapping from source to target speech without relying on parallel data. The proposed is particularly noteworthy in it general purpose and high quality works any extra data, modules, or alignment procedure. Our method, called CycleGAN-VC, uses cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) an identity-mapping loss. A CycleGAN learns forward inverse mappings simultaneously...

10.23919/eusipco.2018.8553236 article EN 2021 29th European Signal Processing Conference (EUSIPCO) 2018-09-01

This paper presents new formulations and algorithms for multichannel extensions of non-negative matrix factorization (NMF). The employ Hermitian positive semidefinite matrices to represent a version elements. Multichannel Euclidean distance Itakura-Saito (IS) divergence are defined based on appropriate statistical models utilizing multivariate complex Gaussian distributions. To minimize this distance/divergence, efficient optimization in the form multiplicative updates derived by using...

10.1109/tasl.2013.2239990 article EN IEEE Transactions on Audio Speech and Language Processing 2013-01-14

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This an important task, but it has been challenging due disadvantages of training conditions. Recently, CycleGAN-VC provided breakthrough and performed comparably VC method any extra data, modules, or time alignment procedures. However, there still large gap between real converted speech, bridging this remains challenge. To reduce gap, we propose...

10.1109/icassp.2019.8682897 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-16

We propose a parallel-data-free voice-conversion (VC) method that can learn mapping from source to target speech without relying on parallel data. The proposed is general purpose, high quality, and parallel-data free works any extra data, modules, or alignment procedure. It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods. Our method, called CycleGAN-VC, uses cycle-consistent adversarial network (CycleGAN) with gated convolutional neural...

10.48550/arxiv.1711.11293 preprint EN other-oa arXiv (Cornell University) 2017-01-01

This paper proposes a multipitch analyzer called the harmonic temporal structured clustering (HTC) method, that jointly estimates pitch, intensity, onset, duration, etc., of each underlying source in audio signal. HTC decomposes energy patterns diffused time-frequency space, i.e., power spectrum time series, into distinct clusters such has originated from single source. The problem is equivalent to approximating observed series by superimposed models, whose parameters are associated with...

10.1109/tasl.2006.885248 article EN IEEE Transactions on Audio Speech and Language Processing 2007-03-01

This paper presents a new sparse representation for acoustic signals which is based on mixing model defined in the complex-spectrum domain (where additivity holds), and allows us to extract recurrent patterns of magnitude spectra that underlie observed complex phase estimates constituent signals. An efficient iterative algorithm derived, reduces multiplicative update non-negative matrix factorization developed by Lee under particular condition.

10.1109/icassp.2009.4960364 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2009-04-01

Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data.This important but challenging owing to the requirement of and nonavailability explicit supervision.Recently, StarGAN-VC has garnered attention its ability solve this problem only using single generator.However, there still gap between real converted speech.To bridge gap, we rethink conditional methods StarGAN-VC, which are key components achieving...

10.21437/interspeech.2019-2236 article EN Interspeech 2022 2019-09-13

We propose a postfilter based on generative adversarial network (GAN) to compensate for the differences between natural speech and synthesized by statistical parametric synthesis. In particular, we focus caused over-smoothing, which makes sounds muffled. Over-smoothing occurs in time frequency directions is highly correlated both directions, conventional methods heuristics are too limited cover all factors (e.g., global variance was designed only recover dynamic range). To solve this...

10.1109/icassp.2017.7953090 article EN 2017-03-01

This paper proposes a non-parallel voice conversion (VC) method using variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE. The proposed has two key features. First, it adopts fully convolutional architectures to construct encoder and decoder networks so that can learn rules capture time dependencies in acoustic feature sequences source target speech. Second, uses information-theoretic regularization for model training ensure information attribute class...

10.1109/taslp.2019.2917232 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2019-05-25

This paper describes a method based on sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis recognition, machine translation, image captioning. In contrast to current VC techniques, our 1) stabilizes accelerates the training procedure by considering guided proposed losses, 2) allows not only spectral envelopes but also...

10.1109/icassp.2019.8683282 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-16

This letter proposes a multichannel source separation technique, the variational autoencoder (MVAE) method, which uses conditional VAE (CVAE) to model and estimate power spectrograms of sources in mixture. By training CVAE using examples with source-class labels, we can use trained decoder distribution as universal generative capable generating conditioned on specified class index. treating latent space variables index unknown parameters this model, develop convergence-guaranteed algorithm...

10.1162/neco_a_01217 article EN Neural Computation 2019-07-23

Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using parallel corpus.Recently, cycle-consistent adversarial network (CycleGAN)-VC CycleGAN-VC2 have shown promising results regarding this problem been widely used as benchmark methods.However, owing to the ambiguity of effectiveness CycleGAN-VC/VC2 mel-spectrogram conversion, they are typically mel-cepstrum even when comparative methods employ target.To address this, we...

10.21437/interspeech.2020-2280 article EN Interspeech 2022 2020-10-25

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.Seq2seq VC models are attractive owing to their ability convert prosody.While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated.Nonetheless, data-hungry property mispronunciation...

10.21437/interspeech.2020-1066 article EN Interspeech 2022 2020-10-25

In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, the necessity for vocoder increasing. A must solve three inverse problems: recovery of original-scale magnitude spectrogram, phase reconstruction, frequency-to-time conversion. typical convolutional solves these problems jointly implicitly using neural network, including temporal upsampling layers, when directly calculating raw waveform. Such approach...

10.1109/icassp43922.2022.9746713 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

We propose a learning-based postfilter to reconstruct the high-fidelity spectral texture in short-term Fourier transform (STFT) spectrograms.In speech-processing systems, such as speech synthesis, voice conversion, and enhancement, STFT spectrograms have been widely used key acoustic representations.In these tasks, we normally need precisely generate or predict representations from inputs; however, generated spectra typically lack fine structures close true data.To overcome limitations...

10.21437/interspeech.2017-962 article EN Interspeech 2022 2017-08-16

This article proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the characteristics but also pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses model with fully convolutional architecture. is particularly advantageous in that suitable for parallel computations GPUs. It beneficial since enables effective normalization techniques such as batch to...

10.1109/taslp.2020.3001456 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2020-01-01

Non-parallel voice conversion (VC) is a technique for training converters without parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability grasp time-frequency structures, application limited mel-cepstrum not mel-spectrogram despite recent advances in vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates additional module...

10.1109/icassp39728.2021.9414851 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13
Coming Soon ...