- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Voice and Speech Disorders
- Topic Modeling
- Phonetics and Phonology Research
- Speech and dialogue systems
- Cancer Immunotherapy and Biomarkers
- Advanced Data Compression Techniques
- Dysphagia Assessment and Management
- CAR-T cell therapy research
- COVID-19 diagnosis using AI
- Infant Health and Development
- Immunotherapy and Immune Responses
- Software Reliability and Analysis Research
- Business Process Modeling and Analysis
- Superconducting Materials and Applications
- BIM and Construction Integration
- Neural Networks and Applications
- Data Mining Algorithms and Applications
- Construction Project Management and Performance
- Machine Learning and Data Classification
- Asian Culture and Media Studies
- Cancer Research and Treatments
Nagoya University
2019-2025
Academia Sinica
2019-2021
Institute of Information Science, Academia Sinica
2018-2021
Google (United States)
2021
Nagoya City University
2020
Nanjing University of Science and Technology
2020
Nanjing University
2020
Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models such criteria may effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment to predict ratings We adopt the convolutional recurrent neural network build a mean opinion score (MOS) predictor, termed as MOSNet. The proposed tested on large-scale listening test results Voice Conversion...
Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics the speech, and even instructions given rating scale all vary from test test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples same test, they typically fail generalize well new listening contexts. In this paper, using a variety networks MOS including MOSNet self-supervised models...
The voice conversion challenge is a bi-annual scientific event held to compare and understand different (VC) systems built on common dataset.In 2020, we organized the third edition of constructed distributed new database for two tasks, intra-lingual semiparallel cross-lingual VC.After two-month period, received 33 submissions, including 3 baselines database.From results crowd-sourced listening tests, observed that VC methods have progressed rapidly thanks advanced deep learning methods.In...
We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.Seq2seq VC models are attractive owing to their ability convert prosody.While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated.Nonetheless, data-hungry property mispronunciation...
Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy Liu, Cheng-I Lai, Jiatong Shi, Xuankai Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
The clinical success of the immune checkpoint inhibitor (ICI) targeting programmed cell death protein 1 (PD-1) has revolutionized cancer treatment. However, full potential PD-1 blockade therapy remains unrealized, as response rates are still low across many types. Interleukin-2 (IL-2)-based immunotherapies hold promise, they can stimulate robust T expansion and enhance effector function - activities that could synergize potently with blockade. Yet, IL-2 therapies also carry a significant...
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability convert prosody. Nonetheless, without sufficient data, seq2seq VC can suffer from unstable training and mispronunciation problems in the converted speech, thus far practical. To tackle these shortcomings, we propose transfer knowledge other speech processing tasks where large-scale corpora easily available, typically text-to-speech (TTS) automatic recognition (ASR). We argue that initialized...
We present the latest iteration of voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different (VC) systems based on common dataset. This year we shifted our focus singing (SVC), thus named Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely in-domain cross-domain SVC. The run months, in total received 26 submissions, including 2 baselines. Through large-scale crowd-sourced listening test,...
An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features different properties, improve performance VAE-VC. We believed that success came more disentangled latent representations. article, extend...
The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and crosslingual voice conversion (VC).While primary evaluation of challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment submitted systems.The aim to provide complementary performance analysis may be more beneficial than time-consuming tests.In this study, examined five types assessments using automatic speaker...
This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. project was initiated in December 2017 to mainly deal with recognition experiments based on sequence-to-sequence modeling. The has grown rapidly and now covers a wide range applications. Now also includes text (TTS), voice conversation (VC), translation (ST), enhancement (SE) support for beamforming, separation, denoising, dereverberation. All applications are...
This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace expensive supervised adopted by state-of-the-art VC systems. Moreover, we claim that a good probing task for S3R analysis. this work, provide series in-depth analyses benchmarking two tasks VCC2020, namely intra-/cross-lingual any-to-one (A2O) as well any-to-any...
We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability convert prosody. While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated. Nonetheless, data-hungry property mispronunciation...
The voice conversion challenge is a bi-annual scientific event held to compare and understand different (VC) systems built on common dataset. In 2020, we organized the third edition of constructed distributed new database for two tasks, intra-lingual semi-parallel cross-lingual VC. After two-month period, received 33 submissions, including 3 baselines database. From results crowd-sourced listening tests, observed that VC methods have progressed rapidly thanks advanced deep learning methods....
An effective approach to automatically predict the subjective rating for synthetic speech is train on a listening test dataset with human-annotated scores. Although each sample in rated by several listeners, most previous works only used mean score as training target. In this work, we present LDNet, unified framework opinion (MOS) prediction that predicts listener-wise perceived quality given input and listener identity. We reflect recent advances LD modeling, including design choices of...
An effective approach to non-parallel voice conversion (VC) is utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), model the latent structure of speech in an unsupervised manner. A previous study has confirmed ef- fectiveness VAE using STRAIGHT spectra for VC. How- ever, other types spectral features such as mel- cepstral coefficients (MCCs), which are related human per- ception and have been widely used VC, not prop- erly investigated. Instead one specific...
This paper proposes a voice conversion (VC) method based on sequence-to-sequence (S2S) learning framework, which enables simultaneous of the characteristics, pitch contour, and duration input speech. We previously proposed an S2S-based VC using transformer network architecture called (VTN). The original VTN was designed to learn only mapping speech feature sequences from one speaker another. Here, main idea we propose is extension that can simultaneously mappings among multiple speakers....
This paper presents the sequence-to-sequence (seq2seq) baseline system for voice conversion challenge (VCC) 2020. We consider a naive approach (VC), which is to first transcribe input speech with an automatic recognition (ASR) model, followed using transcriptions generate of target text-to-speech (TTS) model. revisit this method under framework by utilizing ESPnet, open-source end-to-end processing toolkit, and many well-configured pretrained models provided community. Official evaluation...
We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption that given history context sequence, powerful LM narrow the range of possible choices signal be used as clue. Hence, comparing to conventional ASR systems train acoustic (AM) from scratch, we believe simply BERT model. As an initial study, demonstrate effectiveness...
We present a novel approach to any-to-one (A2O) voice conversion (VC) in sequence-to-sequence (seq2seq) framework. A2O VC aims convert any speaker, including those unseen during training, fixed target speaker. utilize vq-wav2vec (VQW2V), discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed be speaker-independent and well corresponds underlying linguistic contents. Given training dataset of the we extract VQW2V acoustic features...
We present the second edition of VoiceMOS Challenge, a scientific event that aims to promote study automatic prediction mean opinion score (MOS) synthesized and processed speech. This year, we emphasize real-world challenging zero-shot out-of-domain MOS with three tracks for different voice evaluation scenarios. Ten teams from industry academia in seven countries participated. Surprisingly, found two sub-tracks French text-to-speech synthesis had large differences their predictability,...