Yuki Saito

ORCID: 0000-0002-7967-2613
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Topic Modeling
  • Natural Language Processing Techniques
  • Speech and dialogue systems
  • Blind Source Separation Techniques
  • Advanced Data Compression Techniques
  • Phonetics and Phonology Research
  • EEG and Brain-Computer Interfaces
  • Bone Tumor Diagnosis and Treatments
  • Emotion and Mood Recognition
  • Head and Neck Surgical Oncology
  • Renal Diseases and Glomerulopathies
  • Face recognition and analysis
  • Computational and Text Analysis Methods
  • Oral and Maxillofacial Pathology
  • Renal and Vascular Pathologies
  • Systemic Lupus Erythematosus Research
  • Inertial Sensor and Navigation
  • Video Analysis and Summarization
  • Sarcoma Diagnosis and Treatment
  • Vasculitis and related conditions
  • Steroid Chemistry and Biochemistry
  • Social Robot Interaction and HRI

The University of Tokyo
1981-2025

University of Electro-Communications
2024

Osaka University
2023

Tohoku University
2015

Yamagata University Hospital
2014

Yamagata University
2008-2011

Tokyo University of Agriculture and Technology
2008-2009

Tokyo University of Agriculture
2008-2009

St. Marianna University School of Medicine
1999

High Energy Accelerator Research Organization
1990

A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural techniques can be applied to artificially synthesize waveform, the synthetic quality low compared with that of natural speech. One issues causing degradation an oversmoothing effect often observed in generated parameters. GAN introduced this paper consists two networks: a discriminator distinguish and samples, generator deceive discriminator. In...

10.1109/taslp.2017.2761547 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-10-09

This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained speech corpora with given speaker representations, phonetic contents of the converted tend to vanish because an over-regularization issue often observed in latent variables VAEs. To overcome issue, this a conditioned by not only representations but also represented as posteriorgrams (PPGs). Since are during training, we...

10.1109/icassp.2018.8461384 article EN 2018-04-01

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed.Conventional VC shared predicts target speech parameters from the estimated source parameters.Although conventional can be built non-parallel data, it difficult to convert speaker individuality such as phonetic property and speaking rate contained in because are directly used for predicting parameters.In this work, we assume that training data partly include parallel propose between...

10.21437/interspeech.2017-247 preprint EN Interspeech 2022 2017-08-16

This paper presents a deep neural network (DNN)-based phase reconstruction from amplitude spectrograms. In audio signal and speech processing, the spectrogram is often used for corresponding reconstructed on basis of Griffin-Lim method. However, method causes unnatural artifacts in synthetic speech. Addressing this problem, we introduce von-Mises-distribution DNN reconstruction. The generative model having von Mises distribution that can distributions periodic variable such as phase,...

10.1109/iwaenc.2018.8521313 preprint EN 2018-09-01

Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a task. To accelerate research, we are developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, released the JSUT corpus, which contains 10 hours of reading-style uttered by single speaker, for end-to-end text-to-speech synthesis. For more general use e.g., conversion and multi-speaker modeling, this paper,...

10.48550/arxiv.1908.06248 preprint EN cc-by-sa arXiv (Cornell University) 2019-01-01

In this paper, we develop two corpora for speech synthesis research. Thanks to improvements in machine learning techniques, including deep learning, is becoming a task. To accelerate research, aim at developing Japanese voice reasonably accessible from not only academic institutions but also commercial companies. construct the JSUT and JVS corpora. They are designed mainly text-to-speech conversion, respectively. The corpus contains 10 hours of reading-style uttered by single speaker, 30...

10.1250/ast.41.761 article EN Nippon Onkyo Gakkaishi/Acoustical science and technology/Nihon Onkyo Gakkaishi 2020-08-31

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by large-scale language model. Existing corpora lack not only proper but also (NVs) that essential expressions in spoken to express emotions. propose an automatic script generation method produce providing seed words sentiment polarity phrases of ChatGPT using prompt engineering.We select 514 balanced phoneme coverage from candidate assistance emotion confidence...

10.1109/access.2024.3360885 article EN cc-by IEEE Access 2024-01-01

This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output parameters, and DNN-based acoustic models for are used to estimate the parameters from parameters. Given often in same domain (e.g., cepstrum) VC, this networks connected output. The predict weighted spectral differentials between architecture not only alleviates over-smoothing effects degrade quality,...

10.1587/transinf.2017edl8034 article EN IEICE Transactions on Information and Systems 2017-01-01

This paper proposes novel training algorithms for vocoder-free statistical parametric speech synthesis (SPSS) using short-term Fourier transform (STFT) spectra. Recently, text-to-speech STFT spectra has been investigated since it can avoid quality degradation caused by the vocoder-based parameterization in conventional SPSS a vocoder. In vocoder, we previously proposed algorithm integrating generative adversarial network (GAN)-based distribution compensation. To extend to SPSS, propose low-...

10.1109/icassp.2018.8461714 article EN 2018-04-01

This paper presents a deep neural network (DNN)-based phase reconstruction method from amplitude spectrograms. In speech processing, an spectrogram is often used for and the corresponding phases are reconstructed by using Griffin-Lim method. However, causes unnatural artifacts in synthetic speech. To solve this problem, we propose directional-statistics DNNs predicting We first von Mises distribution DNN, which generative model having models histograms of periodic variable. extend it...

10.1016/j.sigpro.2019.107368 article EN cc-by Signal Processing 2019-11-11

This paper proposes a novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis. The parameters of synthetic tend to be over-smoothed, and this causes significant quality degradation in speech. proposed takes into account an Anti-Spoofing Verification (ASV) as additional constraint the acoustic model training. ASV is discriminator trained distinguish natural Since models synthesis are so that recognizes speech, distributed same manner parameters....

10.1109/icassp.2017.7953088 article EN 2017-03-01

10.5220/0012359500003636 article EN cc-by-nc-nd Proceedings of the 14th International Conference on Agents and Artificial Intelligence 2024-01-01

Abstract Background Hypopharyngeal cancer, constituting 3%–5% of head and neck cancers, predominantly presents as squamous cell carcinoma, with a 5‐year overall survival rate approximately 40%. Treatment modalities for locally advanced cases include chemoradiotherapy; however, the role upfront dissection (UND) remains controversial. This study aimed to investigate effect UND on definitive radiotherapy in hypopharyngeal carcinoma. Methods retrospective analysis included consecutive patients...

10.1002/hed.27839 article EN cc-by-nc Head & Neck 2024-06-06

A practical method for extracting and enhancing a rhythmic waveform appearing in multi-channel electroencephalogram (EEG) data is proposed. In order to facilitate clinical diagnosis and/or implement so-called brain computer interface (BCI), detecting the activity from EEG recorded noisy environment crucial; however, classical signal processing techniques like linear filtering or Fourier transform cannot detect such if power of noise so large. This paper presents simple but by fully...

10.1109/icassp.2008.4517637 article EN Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing 2008-03-01

Although there are various operative approaches for clival tumors, a transsphenoidal approach is one of choices when the main tumor extention in an anterior-posterior direction with slight lateral extension. However, this sometimes provides only narrow and deep field. Recently, endoscopic transnasal quite effective tumors because improvement surgical instruments, image guidance systems, techniques materials wound closure. In paper, we describe effectiveness, technical problems, solution...

10.1155/2011/953047 article EN cc-by Sarcoma 2011-01-01

This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used conventional TTS. Our previous work proposed a method incorporating GAN-based distribution compensation into acoustic...

10.1016/j.csl.2019.05.008 article EN cc-by Computer Speech & Language 2019-06-01

We propose novel deep speaker representation learning that considers perceptual similarity among speakers for multi-speaker generative modeling. Following its success in accurate discriminative modeling of individuality, knowledge (i.e., using neural networks) has been introduced to However, the conventional algorithm does not necessarily learn embeddings suitable such modeling, which may result lower quality and less controllability synthetic speech. three algorithms utilize a matrix...

10.1109/taslp.2021.3059114 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01
Coming Soon ...