NFDI4DS | UHH-SEMS - Publication Details

Yuki Saito

ORCID: 0000-0002-7967-2613

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5083394213

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Topic Modeling
Natural Language Processing Techniques
Speech and dialogue systems
Blind Source Separation Techniques
Advanced Data Compression Techniques
Phonetics and Phonology Research
EEG and Brain-Computer Interfaces
Bone Tumor Diagnosis and Treatments
Emotion and Mood Recognition
Head and Neck Surgical Oncology
Renal Diseases and Glomerulopathies
Face recognition and analysis
Computational and Text Analysis Methods
Oral and Maxillofacial Pathology
Renal and Vascular Pathologies
Systemic Lupus Erythematosus Research
Inertial Sensor and Navigation
Video Analysis and Summarization
Sarcoma Diagnosis and Treatment
Vasculitis and related conditions
Steroid Chemistry and Biochemistry
Social Robot Interaction and HRI

The University of Tokyo
1981-2025

University of Electro-Communications
2024

Osaka University
2023

Tohoku University
2015

Yamagata University Hospital
2014

Yamagata University
2008-2011

Tokyo University of Agriculture and Technology
2008-2009

Tokyo University of Agriculture
2008-2009

St. Marianna University School of Medicine
1999

High Energy Accelerator Research Organization
1990

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

OPENALEX - Publications

Yuki Saito Shinnosuke Takamichi Hiroshi Saruwatari

A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural techniques can be applied to artificially synthesize waveform, the synthetic quality low compared with that of natural speech. One issues causing degradation an oversmoothing effect often observed in generated parameters. GAN introduced this paper consists two networks: a discriminator distinguish and samples, generator deceive discriminator. In...

10.1109/taslp.2017.2761547 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-10-09

Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors

OPENALEX - Publications

Yuki Saito Yusuke Ijima Kyosuke Nishida Shinnosuke Takamichi

This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained speech corpora with given speaker representations, phonetic contents of the converted tend to vanish because an over-regularization issue often observed in latent variables VAEs. To overcome issue, this a conditioned by not only representations but also represented as posteriorgrams (PPGs). Since are during training, we...

10.1109/icassp.2018.8461384 article EN 2018-04-01

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

OPENALEX - Publications

Hiroyuki Miyoshi Yuki Saito Shinnosuke Takamichi Hiroshi Saruwatari

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed.Conventional VC shared predicts target speech parameters from the estimated source parameters.Although conventional can be built non-parallel data, it difficult to convert speaker individuality such as phonetic property and speaking rate contained in because are directly used for predicting parameters.In this work, we assume that training data partly include parallel propose between...

10.21437/interspeech.2017-247 preprint EN Interspeech 2022 2017-08-16

Phase Reconstruction from Amplitude Spectrograms Based on Von-Mises-Distribution Deep Neural Network

OPENALEX - Publications

Shinnosuke Takamichi Yuki Saito Norihiro Takamune Daichi Kitamura Hiroshi Saruwatari

This paper presents a deep neural network (DNN)-based phase reconstruction from amplitude spectrograms. In audio signal and speech processing, the spectrogram is often used for corresponding reconstructed on basis of Griffin-Lim method. However, method causes unnatural artifacts in synthetic speech. Addressing this problem, we introduce von-Mises-distribution DNN reconstruction. The generative model having von Mises distribution that can distributions periodic variable such as phase,...

10.1109/iwaenc.2018.8521313 preprint EN 2018-09-01

Real-Time System Identification of an Ill-Posed Nonlinear Spacecraft Thermal Mathematical Model

OPENALEX - Publications

Yuki Saito Samir Khan Seiji Tsutsumi

10.1109/anzcc65042.2025.10873339 article EN 2025-01-30

JVS corpus: free Japanese multi-speaker voice corpus

OPENALEX - Publications

Shinnosuke Takamichi Kentaro Mitsui Yuki Saito Tomoki Koriyama Naoko Tanji and 1 more

Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a task. To accelerate research, we are developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, released the JSUT corpus, which contains 10 hours of reading-style uttered by single speaker, for end-to-end text-to-speech synthesis. For more general use e.g., conversion and multi-speaker modeling, this paper,...

10.48550/arxiv.1908.06248 preprint EN cc-by-sa arXiv (Cornell University) 2019-01-01

JSUT and JVS: Free Japanese voice corpora for accelerating speech synthesis research

OPENALEX - Publications

Shinnosuke Takamichi Ryosuke Sonobe Kentaro Mitsui Yuki Saito Tomoki Koriyama and 2 more

In this paper, we develop two corpora for speech synthesis research. Thanks to improvements in machine learning techniques, including deep learning, is becoming a task. To accelerate research, aim at developing Japanese voice reasonably accessible from not only academic institutions but also commercial companies. construct the JSUT and JVS corpora. They are designed mainly text-to-speech conversion, respectively. The corpus contains 10 hours of reading-style uttered by single speaker, 30...

10.1250/ast.41.761 article EN Nippon Onkyo Gakkaishi/Acoustical science and technology/Nihon Onkyo Gakkaishi 2020-08-31

JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions

OPENALEX - Publications

Detai Xin Junfeng Jiang Shinnosuke Takamichi Yuki Saito Akiko Aizawa and 1 more

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by large-scale language model. Existing corpora lack not only proper but also (NVs) that essential expressions in spoken to express emotions. propose an automatic script generation method produce providing seed words sentiment polarity phrases of ChatGPT using prompt engineering.We select 514 balanced phoneme coverage from candidate assistance emotion confidence...

10.1109/access.2024.3360885 article EN cc-by IEEE Access 2024-01-01

Voice Conversion Using Input-to-Output Highway Networks

OPENALEX - Publications

Yuki Saito Shinnosuke Takamichi Hiroshi Saruwatari

This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output parameters, and DNN-based acoustic models for are used to estimate the parameters from parameters. Given often in same domain (e.g., cepstrum) VC, this networks connected output. The predict weighted spectral differentials between architecture not only alleviates over-smoothing effects degrade quality,...

10.1587/transinf.2017edl8034 article EN IEICE Transactions on Information and Systems 2017-01-01

Text-to-Speech Synthesis Using STFT Spectra Based on Low-/Multi-Resolution Generative Adversarial Networks

OPENALEX - Publications

Yuki Saito Shinnosuke Takamichi Hiroshi Saruwatari

This paper proposes novel training algorithms for vocoder-free statistical parametric speech synthesis (SPSS) using short-term Fourier transform (STFT) spectra. Recently, text-to-speech STFT spectra has been investigated since it can avoid quality degradation caused by the vocoder-based parameterization in conventional SPSS a vocoder. In vocoder, we previously proposed algorithm integrating generative adversarial network (GAN)-based distribution compensation. To extend to SPSS, propose low-...

10.1109/icassp.2018.8461714 article EN 2018-04-01

Phase reconstruction from amplitude spectrograms based on directional-statistics deep neural networks

OPENALEX - Publications

Shinnosuke Takamichi Yuki Saito Norihiro Takamune Daichi Kitamura Hiroshi Saruwatari

This paper presents a deep neural network (DNN)-based phase reconstruction method from amplitude spectrograms. In speech processing, an spectrogram is often used for and the corresponding phases are reconstructed by using Griffin-Lim method. However, causes unnatural artifacts in synthetic speech. To solve this problem, we propose directional-statistics DNNs predicting We first von Mises distribution DNN, which generative model having models histograms of periodic variable. extend it...

10.1016/j.sigpro.2019.107368 article EN cc-by Signal Processing 2019-11-11

Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis

OPENALEX - Publications

Yuki Saito Shinnosuke Takamichi Hiroshi Saruwatari

This paper proposes a novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis. The parameters of synthetic tend to be over-smoothed, and this causes significant quality degradation in speech. proposed takes into account an Anti-Spoofing Verification (ASV) as additional constraint the acoustic model training. ASV is discriminator trained distinguish natural Since models synthesis are so that recognizes speech, distributed same manner parameters....

10.1109/icassp.2017.7953088 article EN 2017-03-01

Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image

OPENALEX - Publications

Shunsuke Goto Kotaro Onishi Yuki Saito Kentaro Tachibana Koichiro Mori

10.21437/interspeech.2020-2136 article EN Interspeech 2022 2020-10-25

Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space

OPENALEX - Publications

Detai Xin Yuki Saito Shinnosuke Takamichi Tomoki Koriyama Hiroshi Saruwatari

10.21437/interspeech.2020-2070 article EN Interspeech 2022 2020-10-25

An Analysis of Knowledge Representation for Anime Recommendation Using Graph Neural Networks

OPENALEX - Publications

Yuki Saito Shusaku Egami Yuichi Sei Yasuyuki Tahara Akihiko Ohsuga

10.5220/0012359500003636 article EN cc-by-nc-nd Proceedings of the 14th International Conference on Agents and Artificial Intelligence 2024-01-01

The role of upfront neck dissection in definitive radiotherapy for locally advanced hypopharyngeal squamous cell carcinoma: A single‐center retrospective analysis

OPENALEX - Publications

Atsuto Katano Hideomi Yamashita Yuki Saito Kenya Kobayashi

Abstract Background Hypopharyngeal cancer, constituting 3%–5% of head and neck cancers, predominantly presents as squamous cell carcinoma, with a 5‐year overall survival rate approximately 40%. Treatment modalities for locally advanced cases include chemoradiotherapy; however, the role upfront dissection (UND) remains controversial. This study aimed to investigate effect UND on definitive radiotherapy in hypopharyngeal carcinoma. Methods retrospective analysis included consecutive patients...

10.1002/hed.27839 article EN cc-by-nc Head & Neck 2024-06-06

An influence of the first adlayer structure on the sticking coefficient of the successive adsorption in a system of bismuth on silicon (111) surface

OPENALEX - Publications

Yuki Saito Akira Kawazu G. Tominaga

10.1016/0039-6028(81)90286-7 article EN Surface Science 1981-02-01

Rhythmic component extraction for multi-channel EEG data analysis

OPENALEX - Publications

Toshihisa Tanaka Yuki Saito

A practical method for extracting and enhancing a rhythmic waveform appearing in multi-channel electroencephalogram (EEG) data is proposed. In order to facilitate clinical diagnosis and/or implement so-called brain computer interface (BCI), detecting the activity from EEG recorded noisy environment crucial; however, classical signal processing techniques like linear filtering or Fourier transform cannot detect such if power of noise so large. This paper presents simple but by fully...

10.1109/icassp.2008.4517637 article EN Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing 2008-03-01

Technical Notes on Endoscopic Transnasal Transsphenoidal Approach for Clival Chondrosarcoma

OPENALEX - Publications

Atsushi Kuge Shinya Sato Kaori Sakurada Sunao Takemura Zensho Kikuchi and 2 more

Although there are various operative approaches for clival tumors, a transsphenoidal approach is one of choices when the main tumor extention in an anterior-posterior direction with slight lateral extension. However, this sometimes provides only narrow and deep field. Recently, endoscopic transnasal quite effective tumors because improvement surgical instruments, image guidance systems, techniques materials wound closure. In paper, we describe effectiveness, technical problems, solution...

10.1155/2011/953047 article EN cc-by Sarcoma 2011-01-01

Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra

OPENALEX - Publications

Yuki Saito Shinnosuke Takamichi Hiroshi Saruwatari

This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used conventional TTS. Our previous work proposed a method incorporating GAN-based distribution compensation into acoustic...

10.1016/j.csl.2019.05.008 article EN cc-by Computer Speech & Language 2019-06-01

Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling

OPENALEX - Publications

Yuki Saito Shinnosuke Takamichi Hiroshi Saruwatari

We propose novel deep speaker representation learning that considers perceptual similarity among speakers for multi-speaker generative modeling. Following its success in accurate discriminative modeling of individuality, knowledge (i.e., using neural networks) has been introduced to However, the conventional algorithm does not necessarily learn embeddings suitable such modeling, which may result lower quality and less controllability synthetic speech. three algorithms utilize a matrix...

10.1109/taslp.2021.3059114 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

Coming Soon ...