NFDI4DS | UHH-SEMS - Publication Details

Hirokazu Kameoka

ORCID: 0000-0003-3102-0162

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5001243214

Research Areas

Speech and Audio Processing
Music and Audio Processing
Speech Recognition and Synthesis
Blind Source Separation Techniques
Advanced Adaptive Filtering Techniques
Music Technology and Sound Studies
Advanced Data Compression Techniques
Image and Signal Denoising Methods
Neural Networks and Applications
Natural Language Processing Techniques
Phonetics and Phonology Research
Speech and dialogue systems
Voice and Speech Disorders
Spectroscopy and Chemometric Analyses
Face and Expression Recognition
Bayesian Methods and Mixture Models
Video Analysis and Summarization
Generative Adversarial Networks and Image Synthesis
Underwater Acoustics Research
Infant Health and Development
Time Series Analysis and Forecasting
Hearing Loss and Rehabilitation
Optical measurement and interference techniques
Digital Media Forensic Detection
Neuroscience and Music Perception

NTT (Japan)
2016-2025

NTT Basic Research Laboratories
2014-2022

Ritsumeikan University
2021

The University of Tokyo
2008-2019

Amazon (United States)
2018-2019

University of Science and Technology of China
2019

University of Udine
2018-2019

The Ohio State University
2018-2019

University of Cambridge
2018-2019

Middle East Technical University
2018-2019

StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks

OPENALEX - Publications

Hirokazu Kameoka Takuhiro Kaneko Kou Tanaka Nobukatsu Hojo

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using variant of generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns mappings across different attribute domains single network, (3) able to generate converted signals quickly enough allow real-time...

10.1109/slt.2018.8639535 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

Determined Blind Source Separation Unifying Independent Vector Analysis and Nonnegative Matrix Factorization

OPENALEX - Publications

Daichi Kitamura Nobutaka Ono Hiroshi Sawada Hirokazu Kameoka Hiroshi Saruwatari

This paper addresses the determined blind source separation problem and proposes a new effective method unifying independent vector analysis (IVA) nonnegative matrix factorization (NMF). IVA is state-of-the-art technique that utilizes statistical independence between sources in mixture signal, an efficient optimization scheme has been proposed for IVA. However, since model based on spherical multivariate distribution, cannot utilize specific spectral structures such as harmonic of pitched...

10.1109/taslp.2016.2577880 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-06-07

CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks

OPENALEX - Publications

Takuhiro Kaneko Hirokazu Kameoka

We propose a non-parallel voice-conversion (VC) method that can learn mapping from source to target speech without relying on parallel data. The proposed is particularly noteworthy in it general purpose and high quality works any extra data, modules, or alignment procedure. Our method, called CycleGAN-VC, uses cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) an identity-mapping loss. A CycleGAN learns forward inverse mappings simultaneously...

10.23919/eusipco.2018.8553236 article EN 2021 29th European Signal Processing Conference (EUSIPCO) 2018-09-01

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

OPENALEX - Publications

Xin Wang Junichi Yamagishi Massimiliano Todisco Héctor Delgado Andreas Nautsch and 35 more

10.1016/j.csl.2020.101114 article EN publisher-specific-oa Computer Speech & Language 2020-05-20

Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data

OPENALEX - Publications

Hiroshi Sawada Hirokazu Kameoka Shoko Araki Naonori Ueda

This paper presents new formulations and algorithms for multichannel extensions of non-negative matrix factorization (NMF). The employ Hermitian positive semidefinite matrices to represent a version elements. Multichannel Euclidean distance Itakura-Saito (IS) divergence are defined based on appropriate statistical models utilizing multivariate complex Gaussian distributions. To minimize this distance/divergence, efficient optimization in the form multiplicative updates derived by using...

10.1109/tasl.2013.2239990 article EN IEEE Transactions on Audio Speech and Language Processing 2013-01-14

Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion

OPENALEX - Publications

Takuhiro Kaneko Hirokazu Kameoka Kou Tanaka Nobukatsu Hojo

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This an important task, but it has been challenging due disadvantages of training conditions. Recently, CycleGAN-VC provided breakthrough and performed comparably VC method any extra data, modules, or time alignment procedures. However, there still large gap between real converted speech, bridging this remains challenge. To reduce gap, we propose...

10.1109/icassp.2019.8682897 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-16

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

OPENALEX - Publications

Takuhiro Kaneko Hirokazu Kameoka

We propose a parallel-data-free voice-conversion (VC) method that can learn mapping from source to target speech without relying on parallel data. The proposed is general purpose, high quality, and parallel-data free works any extra data, modules, or alignment procedure. It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods. Our method, called CycleGAN-VC, uses cycle-consistent adversarial network (CycleGAN) with gated convolutional neural...

10.48550/arxiv.1711.11293 preprint EN other-oa arXiv (Cornell University) 2017-01-01

A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering

OPENALEX - Publications

Hirokazu Kameoka Takuya Nishimoto Shigeki Sagayama

This paper proposes a multipitch analyzer called the harmonic temporal structured clustering (HTC) method, that jointly estimates pitch, intensity, onset, duration, etc., of each underlying source in audio signal. HTC decomposes energy patterns diffused time-frequency space, i.e., power spectrum time series, into distinct clusters such has originated from single source. The problem is equivalent to approximating observed series by superimposed models, whose parameters are associated with...

10.1109/tasl.2006.885248 article EN IEEE Transactions on Audio Speech and Language Processing 2007-03-01

Complex NMF: A new sparse representation for acoustic signals

OPENALEX - Publications

Hirokazu Kameoka Nobutaka Ono Kunio Kashino Shigeki Sagayama

This paper presents a new sparse representation for acoustic signals which is based on mixing model defined in the complex-spectrum domain (where additivity holds), and allows us to extract recurrent patterns of magnitude spectra that underlie observed complex phase estimates constituent signals. An efficient iterative algorithm derived, reduces multiplicative update non-negative matrix factorization developed by Lee under particular condition.

10.1109/icassp.2009.4960364 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2009-04-01

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

OPENALEX - Publications

Takuhiro Kaneko Hirokazu Kameoka Kou Tanaka Nobukatsu Hojo

Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data.This important but challenging owing to the requirement of and nonavailability explicit supervision.Recently, StarGAN-VC has garnered attention its ability solve this problem only using single generator.However, there still gap between real converted speech.To bridge gap, we rethink conditional methods StarGAN-VC, which are key components achieving...

10.21437/interspeech.2019-2236 article EN Interspeech 2022 2019-09-13

Generative adversarial network-based postfilter for statistical parametric speech synthesis

OPENALEX - Publications

Takuhiro Kaneko Hirokazu Kameoka Nobukatsu Hojo Yusuke Ijima Kaoru Hiramatsu and 1 more

We propose a postfilter based on generative adversarial network (GAN) to compensate for the differences between natural speech and synthesized by statistical parametric synthesis. In particular, we focus caused over-smoothing, which makes sounds muffled. Over-smoothing occurs in time frequency directions is highly correlated both directions, conventional methods heuristics are too limited cover all factors (e.g., global variance was designed only recover dynamic range). To solve this...

10.1109/icassp.2017.7953090 article EN 2017-03-01

Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks

OPENALEX - Publications

Takuhiro Kaneko Hirokazu Kameoka Kaoru Hiramatsu Kunio Kashino

10.21437/interspeech.2017-970 article EN Interspeech 2022 2017-08-16

ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder

OPENALEX - Publications

Hirokazu Kameoka Takuhiro Kaneko Kou Tanaka Nobukatsu Hojo

This paper proposes a non-parallel voice conversion (VC) method using variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE. The proposed has two key features. First, it adopts fully convolutional architectures to construct encoder and decoder networks so that can learn rules capture time dependencies in acoustic feature sequences source target speech. Second, uses information-theoretic regularization for model training ensure information attribute class...

10.1109/taslp.2019.2917232 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2019-05-25

ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

OPENALEX - Publications

Kou Tanaka Hirokazu Kameoka Takuhiro Kaneko Nobukatsu Hojo

This paper describes a method based on sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis recognition, machine translation, image captioning. In contrast to current VC techniques, our 1) stabilizes accelerates the training procedure by considering guided proposed losses, 2) allows not only spectral envelopes but also...

10.1109/icassp.2019.8683282 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-16

Supervised Determined Source Separation with Multichannel Variational Autoencoder

OPENALEX - Publications

Hirokazu Kameoka Li Li Shota Inoue Shoji Makino

This letter proposes a multichannel source separation technique, the variational autoencoder (MVAE) method, which uses conditional VAE (CVAE) to model and estimate power spectrograms of sources in mixture. By training CVAE using examples with source-class labels, we can use trained decoder distribution as universal generative capable generating conditioned on specified class index. treating latent space variables index unknown parameters this model, develop convergence-guaranteed algorithm...

10.1162/neco_a_01217 article EN Neural Computation 2019-07-23

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion

OPENALEX - Publications

Takuhiro Kaneko Hirokazu Kameoka Kou Tanaka Nobukatsu Hojo

Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using parallel corpus.Recently, cycle-consistent adversarial network (CycleGAN)-VC CycleGAN-VC2 have shown promising results regarding this problem been widely used as benchmark methods.However, owing to the ambiguity of effectiveness CycleGAN-VC/VC2 mel-spectrogram conversion, they are typically mel-cepstrum even when comparative methods employ target.To address this, we...

10.21437/interspeech.2020-2280 article EN Interspeech 2022 2020-10-25

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

OPENALEX - Publications

Wen-Chin Huang Tomoki Hayashi Yi-Chiao Wu Hirokazu Kameoka Tomoki Toda

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.Seq2seq VC models are attractive owing to their ability convert prosody.While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated.Nonetheless, data-hungry property mispronunciation...

10.21437/interspeech.2020-1066 article EN Interspeech 2022 2020-10-25

ISTFTNET: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

OPENALEX - Publications

Takuhiro Kaneko Kou Tanaka Hirokazu Kameoka Shogo Seki

In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, the necessity for vocoder increasing. A must solve three inverse problems: recovery of original-scale magnitude spectrogram, phase reconstruction, frequency-to-time conversion. typical convolutional solves these problems jointly implicitly using neural network, including temporal upsampling layers, when directly calculating raw waveform. Such approach...

10.1109/icassp43922.2022.9746713 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Generative Adversarial Network-Based Postfilter for STFT Spectrograms

OPENALEX - Publications

Takuhiro Kaneko Shinji Takaki Hirokazu Kameoka Junichi Yamagishi

We propose a learning-based postfilter to reconstruct the high-fidelity spectral texture in short-term Fourier transform (STFT) spectrograms.In speech-processing systems, such as speech synthesis, voice conversion, and enhancement, STFT spectrograms have been widely used key acoustic representations.In these tasks, we normally need precisely generate or predict representations from inputs; however, generated spectra typically lack fine structures close true data.To overcome limitations...

10.21437/interspeech.2017-962 article EN Interspeech 2022 2017-08-16

ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

OPENALEX - Publications

Hirokazu Kameoka Kou Tanaka Damian Kwaśny Takuhiro Kaneko Nobukatsu Hojo

This article proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the characteristics but also pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses model with fully convolutional architecture. is particularly advantageous in that suitable for parallel computations GPUs. It beneficial since enables effective normalization techniques such as batch to...

10.1109/taslp.2020.3001456 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2020-01-01

Maskcyclegan-VC: Learning Non-Parallel Voice Conversion with Filling in Frames

OPENALEX - Publications

Takuhiro Kaneko Hirokazu Kameoka Kou Tanaka Nobukatsu Hojo

Non-parallel voice conversion (VC) is a technique for training converters without parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability grasp time-frequency structures, application limited mel-cepstrum not mel-spectrogram despite recent advances in vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates additional module...

10.1109/icassp39728.2021.9414851 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Coming Soon ...