Wen-Chin Huang

ORCID: 0000-0003-3172-3335
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Voice and Speech Disorders
  • Topic Modeling
  • Phonetics and Phonology Research
  • Speech and dialogue systems
  • Cancer Immunotherapy and Biomarkers
  • Advanced Data Compression Techniques
  • Dysphagia Assessment and Management
  • CAR-T cell therapy research
  • COVID-19 diagnosis using AI
  • Infant Health and Development
  • Immunotherapy and Immune Responses
  • Software Reliability and Analysis Research
  • Business Process Modeling and Analysis
  • Superconducting Materials and Applications
  • BIM and Construction Integration
  • Neural Networks and Applications
  • Data Mining Algorithms and Applications
  • Construction Project Management and Performance
  • Machine Learning and Data Classification
  • Asian Culture and Media Studies
  • Cancer Research and Treatments

Nagoya University
2019-2025

Academia Sinica
2019-2021

Institute of Information Science, Academia Sinica
2018-2021

Google (United States)
2021

Nagoya City University
2020

Nanjing University of Science and Technology
2020

Nanjing University
2020

Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models such criteria may effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment to predict ratings We adopt the convolutional recurrent neural network build a mean opinion score (MOS) predictor, termed as MOSNet. The proposed tested on large-scale listening test results Voice Conversion...

10.21437/interspeech.2019-2003 preprint EN Interspeech 2022 2019-09-13

Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics the speech, and even instructions given rating scale all vary from test test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples same test, they typically fail generalize well new listening contexts. In this paper, using a variety networks MOS including MOSNet self-supervised models...

10.1109/icassp43922.2022.9746395 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

The voice conversion challenge is a bi-annual scientific event held to compare and understand different (VC) systems built on common dataset.In 2020, we organized the third edition of constructed distributed new database for two tasks, intra-lingual semiparallel cross-lingual VC.After two-month period, received 33 submissions, including 3 baselines database.From results crowd-sourced listening tests, observed that VC methods have progressed rapidly thanks advanced deep learning methods.In...

10.21437/vcc_bc.2020-14 preprint EN 2020-10-16

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.Seq2seq VC models are attractive owing to their ability convert prosody.While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated.Nonetheless, data-hungry property mispronunciation...

10.21437/interspeech.2020-1066 article EN Interspeech 2022 2020-10-25

Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy Liu, Cheng-I Lai, Jiatong Shi, Xuankai Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.

10.18653/v1/2022.acl-long.580 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

The clinical success of the immune checkpoint inhibitor (ICI) targeting programmed cell death protein 1 (PD-1) has revolutionized cancer treatment. However, full potential PD-1 blockade therapy remains unrealized, as response rates are still low across many types. Interleukin-2 (IL-2)-based immunotherapies hold promise, they can stimulate robust T expansion and enhance effector function - activities that could synergize potently with blockade. Yet, IL-2 therapies also carry a significant...

10.3389/fimmu.2025.1537466 article EN cc-by Frontiers in Immunology 2025-02-18

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability convert prosody. Nonetheless, without sufficient data, seq2seq VC can suffer from unstable training and mispronunciation problems in the converted speech, thus far practical. To tackle these shortcomings, we propose transfer knowledge other speech processing tasks where large-scale corpora easily available, typically text-to-speech (TTS) automatic recognition (ASR). We argue that initialized...

10.1109/taslp.2021.3049336 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

We present the latest iteration of voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different (VC) systems based on common dataset. This year we shifted our focus singing (SVC), thus named Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely in-domain cross-domain SVC. The run months, in total received 26 submissions, including 2 baselines. Through large-scale crowd-sourced listening test,...

10.1109/asru57964.2023.10389671 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023-12-16

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features different properties, improve performance VAE-VC. We believed that success came more disentangled latent representations. article, extend...

10.1109/tetci.2020.2977678 article EN cc-by IEEE Transactions on Emerging Topics in Computational Intelligence 2020-04-07

The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and crosslingual voice conversion (VC).While primary evaluation of challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment submitted systems.The aim to provide complementary performance analysis may be more beneficial than time-consuming tests.In this study, examined five types assessments using automatic speaker...

10.21437/vcc_bc.2020-15 preprint EN 2020-10-16

This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. project was initiated in December 2017 to mainly deal with recognition experiments based on sequence-to-sequence modeling. The has grown rapidly and now covers a wide range applications. Now also includes text (TTS), voice conversation (VC), translation (ST), enhancement (SE) support for beamforming, separation, denoising, dereverberation. All applications are...

10.1109/dslw51110.2021.9523402 article EN 2021-06-05

This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace expensive supervised adopted by state-of-the-art VC systems. Moreover, we claim that a good probing task for S3R analysis. this work, provide series in-depth analyses benchmarking two tasks VCC2020, namely intra-/cross-lingual any-to-one (A2O) as well any-to-any...

10.1109/icassp43922.2022.9746430 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

10.1109/icassp49660.2025.10889744 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability convert prosody. While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated. Nonetheless, data-hungry property mispronunciation...

10.48550/arxiv.1912.06813 preprint EN other-oa arXiv (Cornell University) 2019-01-01

The voice conversion challenge is a bi-annual scientific event held to compare and understand different (VC) systems built on common dataset. In 2020, we organized the third edition of constructed distributed new database for two tasks, intra-lingual semi-parallel cross-lingual VC. After two-month period, received 33 submissions, including 3 baselines database. From results crowd-sourced listening tests, observed that VC methods have progressed rapidly thanks advanced deep learning methods....

10.48550/arxiv.2008.12527 preprint EN other-oa arXiv (Cornell University) 2020-01-01

An effective approach to automatically predict the subjective rating for synthetic speech is train on a listening test dataset with human-annotated scores. Although each sample in rated by several listeners, most previous works only used mean score as training target. In this work, we present LDNet, unified framework opinion (MOS) prediction that predicts listener-wise perceived quality given input and listener identity. We reflect recent advances LD modeling, including design choices of...

10.1109/icassp43922.2022.9747222 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

An effective approach to non-parallel voice conversion (VC) is utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), model the latent structure of speech in an unsupervised manner. A previous study has confirmed ef- fectiveness VAE using STRAIGHT spectra for VC. How- ever, other types spectral features such as mel- cepstral coefficients (MCCs), which are related human per- ception and have been widely used VC, not prop- erly investigated. Instead one specific...

10.1109/iscslp.2018.8706604 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2018-11-01

This paper proposes a voice conversion (VC) method based on sequence-to-sequence (S2S) learning framework, which enables simultaneous of the characteristics, pitch contour, and duration input speech. We previously proposed an S2S-based VC using transformer network architecture called (VTN). The original VTN was designed to learn only mapping speech feature sequences from one speaker another. Here, main idea we propose is extension that can simultaneously mappings among multiple speakers....

10.1109/taslp.2020.3047262 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2020-12-24

This paper presents the sequence-to-sequence (seq2seq) baseline system for voice conversion challenge (VCC) 2020. We consider a naive approach (VC), which is to first transcribe input speech with an automatic recognition (ASR) model, followed using transcriptions generate of target text-to-speech (TTS) model. revisit this method under framework by utilizing ESPnet, open-source end-to-end processing toolkit, and many well-configured pretrained models provided community. Official evaluation...

10.21437/vcc_bc.2020-24 preprint EN 2020-10-16

We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption that given history context sequence, powerful LM narrow the range of possible choices signal be used as clue. Hence, comparing to conventional ASR systems train acoustic (AM) from scratch, we believe simply BERT model. As an initial study, demonstrate effectiveness...

10.1109/icassp39728.2021.9413668 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

We present a novel approach to any-to-one (A2O) voice conversion (VC) in sequence-to-sequence (seq2seq) framework. A2O VC aims convert any speaker, including those unseen during training, fixed target speaker. utilize vq-wav2vec (VQW2V), discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed be speaker-independent and well corresponds underlying linguistic contents. Given training dataset of the we extract VQW2V acoustic features...

10.1109/icassp39728.2021.9415079 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

We present the second edition of VoiceMOS Challenge, a scientific event that aims to promote study automatic prediction mean opinion score (MOS) synthesized and processed speech. This year, we emphasize real-world challenging zero-shot out-of-domain MOS with three tracks for different voice evaluation scenarios. Ten teams from industry academia in seven countries participated. Surprisingly, found two sub-tracks French text-to-speech synthesis had large differences their predictability,...

10.1109/asru57964.2023.10389763 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023-12-16
Coming Soon ...