- Speech and Audio Processing
- Blind Source Separation Techniques
- Speech Recognition and Synthesis
- Advanced Adaptive Filtering Techniques
- Music and Audio Processing
- Speech and dialogue systems
- Natural Language Processing Techniques
- Acoustic Wave Phenomena Research
- Hearing Loss and Rehabilitation
- Image and Signal Denoising Methods
- Topic Modeling
- Advanced Algorithms and Applications
- Neural Networks and Applications
- Underwater Acoustics Research
- Advanced Data Compression Techniques
- Phonetics and Phonology Research
- Music Technology and Sound Studies
- Aerodynamics and Acoustics in Jet Flows
- Robotics and Automated Systems
- Structural Health Monitoring Techniques
- Ultrasonics and Acoustic Wave Propagation
- Sparse and Compressive Sensing Techniques
- Spectroscopy and Chemometric Analyses
- Direction-of-Arrival Estimation Techniques
- Vehicle Noise and Vibration Control
The University of Tokyo
2016-2025
The Graduate University for Advanced Studies, SOKENDAI
2017
Nara Institute of Science and Technology
2005-2014
Kagoshima University
2009
Nagoya University
1999-2002
Secom (Japan)
1999
Kyushu University
1995
This paper addresses the determined blind source separation problem and proposes a new effective method unifying independent vector analysis (IVA) nonnegative matrix factorization (NMF). IVA is state-of-the-art technique that utilizes statistical independence between sources in mixture signal, an efficient optimization scheme has been proposed for IVA. However, since model based on spherical multivariate distribution, cannot utilize specific spectral structures such as harmonic of pitched...
A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural techniques can be applied to artificially synthesize waveform, the synthetic quality low compared with that of natural speech. One issues causing degradation an oversmoothing effect often observed in generated parameters. GAN introduced this paper consists two networks: a discriminator distinguish and samples, generator deceive discriminator. In...
Despite several recent proposals to achieve blind source separation (BSS) for realistic acoustic signals, the performance is still not good enough. In particular, when impulse responses are long, highly limited. this paper, we consider a two-input, two-output convolutive BSS problem. First, show that it be constrained by condition T>P, where T frame length of DFT and P room responses. We there an optimum size determined trade-off between maintaining number samples in each frequency bin...
We propose a new algorithm for blind source separation (BSS), in which independent component analysis (ICA) and beamforming are combined to resolve the slow-convergence problem through optimization ICA. The proposed method consists of following three parts: (a) frequency-domain ICA with direction-of-arrival (DOA) estimation, (b) null based on estimated DOA, (c) integration diversity both iteration frequency domain. unmixing matrix obtained by is temporally substituted iterative optimization,...
We describe a new method of blind source separation (BSS) on microphone array combining subband independent component analysis (ICA) and beamforming. The proposed system consists the following three sections: (1) ICA-based BSS section with estimation direction arrival (DOA) sound source, (2) null beamforming based estimated DOA, (3) integration algorithm diversity. Using this technique, we can resolve low-convergence problem through optimization in ICA. To evaluate its effectiveness,...
We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022.The challenge is predict MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion for two tracks: a main track in-domain an out-of-domain (OOD) which there less labeled data different listening tests.Our based on ensemble learning strong weak learners.Strong learners incorporate several improvements finetuning models self-supervised (SSL) models,...
This paper describes a new blind signal separation method using the directivity patterns of microphone array. In this method, to deal with arriving lags among each microphone, inverses mixing matrices are calculated in frequency domain so that separated signals mutually independent. Since calculations carried out independently, following problems arise: (1) permutation sound source, (2) arbitrariness source gain. paper, we propose solution explicitly used estimate direction. As results...
In the voice conversion algorithm based on Gaussian Mixture Model (GMM) applied to STRAIGHT, quality of converted speech is degraded because spectrum exceedingly smooth. We propose GMM-based with dynamic frequency warping avoid over-smoothing. also an addition weighted residual spectrum, which difference between and frequency-warped deterioration conversion-accuracy speaker individuality. Results evaluation experiments clarify that better than algorithm, individuality same as in proposed...
We propose a new blind spatial subtraction array (BSSA) consisting of noise estimator based on independent component analysis (ICA) for efficient speech enhancement. In this paper, first, we theoretically and experimentally point out that ICA is proficient in estimation under non-point-source condition rather than estimation. Therefore, BSSA utilizes as estimator. BSSA, extraction achieved by subtracting the power spectrum signals estimated using from partly enhanced target signal with...
Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such for Japanese synthesis does not exist. In this paper, we designed novel corpus, named the "JSUT corpus," is aimed at achieving end-to-end synthesis. The consists of 10 hours reading-style data its transcription covers all main pronunciations daily-use characters. describe...
A sound field recording method based on spherical or circular harmonic analysis for arbitrary array geometry and directivity of microphones is proposed. In current methods analysis, a decomposed into functions with center given in advance, which called global origin, their coefficients are obtained up to certain truncation order using microphone measurements. However, the accuracy reconstructed depends predefined position origin order, makes it difficult apply this technique an asymmetric...
In this paper, we propose a new framework called independent deeply learned matrix analysis (IDLMA), which unifies deep neural network (DNN) and independence-based multichannel audio source separation. IDLMA utilizes both pretrained DNN models statistical independence between sources for the separation, where time-frequency structures of each are iteratively optimized by while enhancing estimation accuracy spatial demixing filters. As generative model, introduce complex heavy-tailed...
INTERSPEECH2006: the 9th International Conference on Spoken Language Processing (ICSLP), September 17-21, 2006, Pittsburgh, Pennsylvania, USA.
In this paper, we provide a theoretical analysis of the amount musical noise in iterative spectral subtraction, and its optimization method for least generation. To achieve high-quality reduction with low noise, i.e., iteratively applied weak nonlinear signal processing, has been proposed. Although effectiveness reported experimentally, there have no studies. Therefore, formulate generation process by tracing change kurtosis spectra, conduct comparison different parameter settings but same...
Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed.Conventional VC shared predicts target speech parameters from the estimated source parameters.Although conventional can be built non-parallel data, it difficult to convert speaker individuality such as phonetic property and speaking rate contained in because are directly used for predicting parameters.In this work, we assume that training data partly include parallel propose between...
In this paper, statistical-model generalizations of independent low-rank matrix analysis (ILRMA) are proposed for achieving high-quality blind source separation (BSS). BSS is a crucial problem in realizing many audio applications, where the sources must be separated using only observed mixture signal. Many algorithms solving have been proposed, especially history component and nonnegative factorization. particular, ILRMA can achieve highest performance music or speech mixtures, assumes both...
EUROSPEECH2003: 8th European Conference on Speech Communication and Technology, September 1-4, 2003, Geneva, Switzerland.
Frequency-domain blind source separation (BSS) is shown to be equivalent two sets of frequency-domain adaptive beamformers (ABFs) under certain conditions. The zero search the off-diagonal components in BSS update equation can viewed as minimization mean square error ABFs. unmixing matrix and filter coefficients ABFs converge same solution if signals are ideally independent. If they dependent, this results a bias for correct coefficients. Therefore, performance limited that ABF use exact...
In this paper, we present novel speaking-aid systems based on one-to-many eigenvoice conversion (EVC) to enhance three types of alaryngeal speech: esophageal speech, electrolaryngeal and body-conducted silent speech. Although speech allows laryngectomees utter sounds, it suffers from the lack quality speaker individuality. To improve alaryngeal-speech-to-speech (AL-to-Speech) methods statistical voice have been proposed. EVC capable flexibly controlling converted by adapting model given...
This paper proposes a new efficient multichannel nonnegative matrix factorization (NMF) method. Recently, NMF (MNMF) has been proposed as means of solving the blind source separation problem. method estimates mixing system sources and attempts to separate them in fashion. However, this is strongly dependent on its initial values because there are no constraints spatial models. To solve problem, we introduce rank-1 model into MNMF. The demixing while representing using bases can be optimized...
This paper presents a deep neural network (DNN)-based phase reconstruction from amplitude spectrograms. In audio signal and speech processing, the spectrogram is often used for corresponding reconstructed on basis of Griffin-Lim method. However, method causes unnatural artifacts in synthetic speech. Addressing this problem, we introduce von-Mises-distribution DNN reconstruction. The generative model having von Mises distribution that can distributions periodic variable such as phase,...
A sound field reproduction method based on the spherical wavefunction expansion of fields is proposed, which can be flexibly applied to various array geometries and directivities. First, we formulate synthesis as a minimization problem some norm difference between desired synthesized fields, then optimal driving signals are derived by using fields. This formulation closely related mode-matching method; major advantage proposed weight mode determined according minimized instead empirical...