- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Topic Modeling
- Natural Language Processing Techniques
- Speech and dialogue systems
- Blind Source Separation Techniques
- Advanced Data Compression Techniques
- Phonetics and Phonology Research
- EEG and Brain-Computer Interfaces
- Bone Tumor Diagnosis and Treatments
- Emotion and Mood Recognition
- Head and Neck Surgical Oncology
- Renal Diseases and Glomerulopathies
- Face recognition and analysis
- Computational and Text Analysis Methods
- Oral and Maxillofacial Pathology
- Renal and Vascular Pathologies
- Systemic Lupus Erythematosus Research
- Inertial Sensor and Navigation
- Video Analysis and Summarization
- Sarcoma Diagnosis and Treatment
- Vasculitis and related conditions
- Steroid Chemistry and Biochemistry
- Social Robot Interaction and HRI
The University of Tokyo
1981-2025
University of Electro-Communications
2024
Osaka University
2023
Tohoku University
2015
Yamagata University Hospital
2014
Yamagata University
2008-2011
Tokyo University of Agriculture and Technology
2008-2009
Tokyo University of Agriculture
2008-2009
St. Marianna University School of Medicine
1999
High Energy Accelerator Research Organization
1990
A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural techniques can be applied to artificially synthesize waveform, the synthetic quality low compared with that of natural speech. One issues causing degradation an oversmoothing effect often observed in generated parameters. GAN introduced this paper consists two networks: a discriminator distinguish and samples, generator deceive discriminator. In...
This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained speech corpora with given speaker representations, phonetic contents of the converted tend to vanish because an over-regularization issue often observed in latent variables VAEs. To overcome issue, this a conditioned by not only representations but also represented as posteriorgrams (PPGs). Since are during training, we...
Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed.Conventional VC shared predicts target speech parameters from the estimated source parameters.Although conventional can be built non-parallel data, it difficult to convert speaker individuality such as phonetic property and speaking rate contained in because are directly used for predicting parameters.In this work, we assume that training data partly include parallel propose between...
This paper presents a deep neural network (DNN)-based phase reconstruction from amplitude spectrograms. In audio signal and speech processing, the spectrogram is often used for corresponding reconstructed on basis of Griffin-Lim method. However, method causes unnatural artifacts in synthetic speech. Addressing this problem, we introduce von-Mises-distribution DNN reconstruction. The generative model having von Mises distribution that can distributions periodic variable such as phase,...
Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a task. To accelerate research, we are developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, released the JSUT corpus, which contains 10 hours of reading-style uttered by single speaker, for end-to-end text-to-speech synthesis. For more general use e.g., conversion and multi-speaker modeling, this paper,...
In this paper, we develop two corpora for speech synthesis research. Thanks to improvements in machine learning techniques, including deep learning, is becoming a task. To accelerate research, aim at developing Japanese voice reasonably accessible from not only academic institutions but also commercial companies. construct the JSUT and JVS corpora. They are designed mainly text-to-speech conversion, respectively. The corpus contains 10 hours of reading-style uttered by single speaker, 30...
We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by large-scale language model. Existing corpora lack not only proper but also (NVs) that essential expressions in spoken to express emotions. propose an automatic script generation method produce providing seed words sentiment polarity phrases of ChatGPT using prompt engineering.We select 514 balanced phoneme coverage from candidate assistance emotion confidence...
This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output parameters, and DNN-based acoustic models for are used to estimate the parameters from parameters. Given often in same domain (e.g., cepstrum) VC, this networks connected output. The predict weighted spectral differentials between architecture not only alleviates over-smoothing effects degrade quality,...
This paper proposes novel training algorithms for vocoder-free statistical parametric speech synthesis (SPSS) using short-term Fourier transform (STFT) spectra. Recently, text-to-speech STFT spectra has been investigated since it can avoid quality degradation caused by the vocoder-based parameterization in conventional SPSS a vocoder. In vocoder, we previously proposed algorithm integrating generative adversarial network (GAN)-based distribution compensation. To extend to SPSS, propose low-...
This paper presents a deep neural network (DNN)-based phase reconstruction method from amplitude spectrograms. In speech processing, an spectrogram is often used for and the corresponding phases are reconstructed by using Griffin-Lim method. However, causes unnatural artifacts in synthetic speech. To solve this problem, we propose directional-statistics DNNs predicting We first von Mises distribution DNN, which generative model having models histograms of periodic variable. extend it...
This paper proposes a novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis. The parameters of synthetic tend to be over-smoothed, and this causes significant quality degradation in speech. proposed takes into account an Anti-Spoofing Verification (ASV) as additional constraint the acoustic model training. ASV is discriminator trained distinguish natural Since models synthesis are so that recognizes speech, distributed same manner parameters....
Abstract Background Hypopharyngeal cancer, constituting 3%–5% of head and neck cancers, predominantly presents as squamous cell carcinoma, with a 5‐year overall survival rate approximately 40%. Treatment modalities for locally advanced cases include chemoradiotherapy; however, the role upfront dissection (UND) remains controversial. This study aimed to investigate effect UND on definitive radiotherapy in hypopharyngeal carcinoma. Methods retrospective analysis included consecutive patients...
A practical method for extracting and enhancing a rhythmic waveform appearing in multi-channel electroencephalogram (EEG) data is proposed. In order to facilitate clinical diagnosis and/or implement so-called brain computer interface (BCI), detecting the activity from EEG recorded noisy environment crucial; however, classical signal processing techniques like linear filtering or Fourier transform cannot detect such if power of noise so large. This paper presents simple but by fully...
Although there are various operative approaches for clival tumors, a transsphenoidal approach is one of choices when the main tumor extention in an anterior-posterior direction with slight lateral extension. However, this sometimes provides only narrow and deep field. Recently, endoscopic transnasal quite effective tumors because improvement surgical instruments, image guidance systems, techniques materials wound closure. In paper, we describe effectiveness, technical problems, solution...
This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used conventional TTS. Our previous work proposed a method incorporating GAN-based distribution compensation into acoustic...
We propose novel deep speaker representation learning that considers perceptual similarity among speakers for multi-speaker generative modeling. Following its success in accurate discriminative modeling of individuality, knowledge (i.e., using neural networks) has been introduced to However, the conventional algorithm does not necessarily learn embeddings suitable such modeling, which may result lower quality and less controllability synthetic speech. three algorithms utilize a matrix...