Hiroshi Saruwatari

ORCID: 0000-0003-0876-5617
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Blind Source Separation Techniques
  • Speech Recognition and Synthesis
  • Advanced Adaptive Filtering Techniques
  • Music and Audio Processing
  • Speech and dialogue systems
  • Natural Language Processing Techniques
  • Acoustic Wave Phenomena Research
  • Hearing Loss and Rehabilitation
  • Image and Signal Denoising Methods
  • Topic Modeling
  • Advanced Algorithms and Applications
  • Neural Networks and Applications
  • Underwater Acoustics Research
  • Advanced Data Compression Techniques
  • Phonetics and Phonology Research
  • Music Technology and Sound Studies
  • Aerodynamics and Acoustics in Jet Flows
  • Robotics and Automated Systems
  • Structural Health Monitoring Techniques
  • Ultrasonics and Acoustic Wave Propagation
  • Sparse and Compressive Sensing Techniques
  • Spectroscopy and Chemometric Analyses
  • Direction-of-Arrival Estimation Techniques
  • Vehicle Noise and Vibration Control

The University of Tokyo
2016-2025

The Graduate University for Advanced Studies, SOKENDAI
2017

Nara Institute of Science and Technology
2005-2014

Kagoshima University
2009

Nagoya University
1999-2002

Secom (Japan)
1999

Kyushu University
1995

This paper addresses the determined blind source separation problem and proposes a new effective method unifying independent vector analysis (IVA) nonnegative matrix factorization (NMF). IVA is state-of-the-art technique that utilizes statistical independence between sources in mixture signal, an efficient optimization scheme has been proposed for IVA. However, since model based on spherical multivariate distribution, cannot utilize specific spectral structures such as harmonic of pitched...

10.1109/taslp.2016.2577880 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-06-07

A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural techniques can be applied to artificially synthesize waveform, the synthetic quality low compared with that of natural speech. One issues causing degradation an oversmoothing effect often observed in generated parameters. GAN introduced this paper consists two networks: a discriminator distinguish and samples, generator deceive discriminator. In...

10.1109/taslp.2017.2761547 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-10-09

Despite several recent proposals to achieve blind source separation (BSS) for realistic acoustic signals, the performance is still not good enough. In particular, when impulse responses are long, highly limited. this paper, we consider a two-input, two-output convolutive BSS problem. First, show that it be constrained by condition T>P, where T frame length of DFT and P room responses. We there an optimum size determined trade-off between maintaining number samples in each frequency bin...

10.1109/tsa.2003.809193 article EN IEEE Transactions on Speech and Audio Processing 2003-03-01

We propose a new algorithm for blind source separation (BSS), in which independent component analysis (ICA) and beamforming are combined to resolve the slow-convergence problem through optimization ICA. The proposed method consists of following three parts: (a) frequency-domain ICA with direction-of-arrival (DOA) estimation, (b) null based on estimated DOA, (c) integration diversity both iteration frequency domain. unmixing matrix obtained by is temporally substituted iterative optimization,...

10.1109/tsa.2005.855832 article EN IEEE Transactions on Audio Speech and Language Processing 2006-02-21

We describe a new method of blind source separation (BSS) on microphone array combining subband independent component analysis (ICA) and beamforming. The proposed system consists the following three sections: (1) ICA-based BSS section with estimation direction arrival (DOA) sound source, (2) null beamforming based estimated DOA, (3) integration algorithm diversity. Using this technique, we can resolve low-convergence problem through optimization in ICA. To evaluate its effectiveness,...

10.1155/s1110865703305104 article EN cc-by EURASIP Journal on Advances in Signal Processing 2003-10-05

We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022.The challenge is predict MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion for two tracks: a main track in-domain an out-of-domain (OOD) which there less labeled data different listening tests.Our based on ensemble learning strong weak learners.Strong learners incorporate several improvements finetuning models self-supervised (SSL) models,...

10.21437/interspeech.2022-439 article EN Interspeech 2022 2022-09-16

This paper describes a new blind signal separation method using the directivity patterns of microphone array. In this method, to deal with arriving lags among each microphone, inverses mixing matrices are calculated in frequency domain so that separated signals mutually independent. Since calculations carried out independently, following problems arise: (1) permutation sound source, (2) arbitrariness source gain. paper, we propose solution explicitly used estimate direction. As results...

10.1109/icassp.2000.861203 article EN 2002-11-07

In the voice conversion algorithm based on Gaussian Mixture Model (GMM) applied to STRAIGHT, quality of converted speech is degraded because spectrum exceedingly smooth. We propose GMM-based with dynamic frequency warping avoid over-smoothing. also an addition weighted residual spectrum, which difference between and frequency-warped deterioration conversion-accuracy speaker individuality. Results evaluation experiments clarify that better than algorithm, individuality same as in proposed...

10.1109/icassp.2001.941046 article EN 2002-11-13

We propose a new blind spatial subtraction array (BSSA) consisting of noise estimator based on independent component analysis (ICA) for efficient speech enhancement. In this paper, first, we theoretically and experimentally point out that ICA is proficient in estimation under non-point-source condition rather than estimation. Therefore, BSSA utilizes as estimator. BSSA, extraction achieved by subtracting the power spectrum signals estimated using from partly enhanced target signal with...

10.1109/tasl.2008.2011517 article EN IEEE Transactions on Audio Speech and Language Processing 2009-03-24

Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such for Japanese synthesis does not exist. In this paper, we designed novel corpus, named the "JSUT corpus," is aimed at achieving end-to-end synthesis. The consists of 10 hours reading-style data its transcription covers all main pronunciations daily-use characters. describe...

10.48550/arxiv.1711.00354 preprint EN cc-by-sa arXiv (Cornell University) 2017-01-01

A sound field recording method based on spherical or circular harmonic analysis for arbitrary array geometry and directivity of microphones is proposed. In current methods analysis, a decomposed into functions with center given in advance, which called global origin, their coefficients are obtained up to certain truncation order using microphone measurements. However, the accuracy reconstructed depends predefined position origin order, makes it difficult apply this technique an asymmetric...

10.1109/lsp.2017.2775242 article EN IEEE Signal Processing Letters 2017-11-21

In this paper, we propose a new framework called independent deeply learned matrix analysis (IDLMA), which unifies deep neural network (DNN) and independence-based multichannel audio source separation. IDLMA utilizes both pretrained DNN models statistical independence between sources for the separation, where time-frequency structures of each are iteratively optimized by while enhancing estimation accuracy spatial demixing filters. As generative model, introduce complex heavy-tailed...

10.1109/taslp.2019.2925450 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2019-06-27

INTERSPEECH2006: the 9th International Conference on Spoken Language Processing (ICSLP), September 17-21, 2006, Pittsburgh, Pennsylvania, USA.

10.21437/interspeech.2006-582 article EN Interspeech 2022 2006-09-17

In this paper, we provide a theoretical analysis of the amount musical noise in iterative spectral subtraction, and its optimization method for least generation. To achieve high-quality reduction with low noise, i.e., iteratively applied weak nonlinear signal processing, has been proposed. Although effectiveness reported experimentally, there have no studies. Therefore, formulate generation process by tracing change kurtosis spectra, conduct comparison different parameter settings but same...

10.1109/tasl.2012.2196513 article EN IEEE Transactions on Audio Speech and Language Processing 2012-04-27

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed.Conventional VC shared predicts target speech parameters from the estimated source parameters.Although conventional can be built non-parallel data, it difficult to convert speaker individuality such as phonetic property and speaking rate contained in because are directly used for predicting parameters.In this work, we assume that training data partly include parallel propose between...

10.21437/interspeech.2017-247 preprint EN Interspeech 2022 2017-08-16

In this paper, statistical-model generalizations of independent low-rank matrix analysis (ILRMA) are proposed for achieving high-quality blind source separation (BSS). BSS is a crucial problem in realizing many audio applications, where the sources must be separated using only observed mixture signal. Many algorithms solving have been proposed, especially history component and nonnegative factorization. particular, ILRMA can achieve highest performance music or speech mixtures, assumes both...

10.1186/s13634-018-0549-5 article EN cc-by EURASIP Journal on Advances in Signal Processing 2018-05-02

EUROSPEECH2003: 8th European Conference on Speech Communication and Technology, September 1-4, 2003, Geneva, Switzerland.

10.21437/eurospeech.2003-661 article EN 2003-09-01

Frequency-domain blind source separation (BSS) is shown to be equivalent two sets of frequency-domain adaptive beamformers (ABFs) under certain conditions. The zero search the off-diagonal components in BSS update equation can viewed as minimization mean square error ABFs. unmixing matrix and filter coefficients ABFs converge same solution if signals are ideally independent. If they dependent, this results a bias for correct coefficients. Therefore, performance limited that ABF use exact...

10.1155/s1110865703305074 article EN cc-by EURASIP Journal on Advances in Signal Processing 2003-10-05

In this paper, we present novel speaking-aid systems based on one-to-many eigenvoice conversion (EVC) to enhance three types of alaryngeal speech: esophageal speech, electrolaryngeal and body-conducted silent speech. Although speech allows laryngectomees utter sounds, it suffers from the lack quality speaker individuality. To improve alaryngeal-speech-to-speech (AL-to-Speech) methods statistical voice have been proposed. EVC capable flexibly controlling converted by adapting model given...

10.1109/taslp.2013.2286917 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2013-10-23

This paper proposes a new efficient multichannel nonnegative matrix factorization (NMF) method. Recently, NMF (MNMF) has been proposed as means of solving the blind source separation problem. method estimates mixing system sources and attempts to separate them in fashion. However, this is strongly dependent on its initial values because there are no constraints spatial models. To solve problem, we introduce rank-1 model into MNMF. The demixing while representing using bases can be optimized...

10.1109/icassp.2015.7177975 article EN 2015-04-01

This paper presents a deep neural network (DNN)-based phase reconstruction from amplitude spectrograms. In audio signal and speech processing, the spectrogram is often used for corresponding reconstructed on basis of Griffin-Lim method. However, method causes unnatural artifacts in synthetic speech. Addressing this problem, we introduce von-Mises-distribution DNN reconstruction. The generative model having von Mises distribution that can distributions periodic variable such as phase,...

10.1109/iwaenc.2018.8521313 preprint EN 2018-09-01

A sound field reproduction method based on the spherical wavefunction expansion of fields is proposed, which can be flexibly applied to various array geometries and directivities. First, we formulate synthesis as a minimization problem some norm difference between desired synthesized fields, then optimal driving signals are derived by using fields. This formulation closely related mode-matching method; major advantage proposed weight mode determined according minimized instead empirical...

10.1109/taslp.2019.2934834 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2019-08-14
Coming Soon ...