NFDI4DS | UHH-SEMS - Publication Details

Wen-Chin Huang

ORCID: 0000-0003-3172-3335

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5000377034

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Natural Language Processing Techniques
Voice and Speech Disorders
Topic Modeling
Phonetics and Phonology Research
Speech and dialogue systems
Cancer Immunotherapy and Biomarkers
Advanced Data Compression Techniques
Dysphagia Assessment and Management
CAR-T cell therapy research
COVID-19 diagnosis using AI
Infant Health and Development
Immunotherapy and Immune Responses
Software Reliability and Analysis Research
Business Process Modeling and Analysis
Superconducting Materials and Applications
BIM and Construction Integration
Neural Networks and Applications
Data Mining Algorithms and Applications
Construction Project Management and Performance
Machine Learning and Data Classification
Asian Culture and Media Studies
Cancer Research and Treatments

Nagoya University
2019-2025

Academia Sinica
2019-2021

Institute of Information Science, Academia Sinica
2018-2021

Google (United States)
2021

Nagoya City University
2020

Nanjing University of Science and Technology
2020

Nanjing University
2020

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

OPENALEX - Publications

Xin Wang Junichi Yamagishi Massimiliano Todisco Héctor Delgado Andreas Nautsch and 35 more

10.1016/j.csl.2020.101114 article EN publisher-specific-oa Computer Speech & Language 2020-05-20

MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion

OPENALEX - Publications

Chen-Chou Lo Szu‐Wei Fu Wen-Chin Huang Xin Wang Junichi Yamagishi and 2 more

Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models such criteria may effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment to predict ratings We adopt the convolutional recurrent neural network build a mean opinion score (MOS) predictor, termed as MOSNet. The proposed tested on large-scale listening test results Voice Conversion...

10.21437/interspeech.2019-2003 preprint EN Interspeech 2022 2019-09-13

Generalization Ability of MOS Prediction Networks

OPENALEX - Publications

Erica Cooper Wen-Chin Huang Tomoki Toda Junichi Yamagishi

Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics the speech, and even instructions given rating scale all vary from test test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples same test, they typically fail generalize well new listening contexts. In this paper, using a variety networks MOS including MOSNet self-supervised models...

10.1109/icassp43922.2022.9746395 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Voice Conversion Challenge 2020 –- Intra-lingual semi-parallel and cross-lingual voice conversion –-

OPENALEX - Publications

Zhao Yi Wen-Chin Huang Xiaohai Tian Junichi Yamagishi Rohan Kumar Das and 3 more

The voice conversion challenge is a bi-annual scientific event held to compare and understand different (VC) systems built on common dataset.In 2020, we organized the third edition of constructed distributed new database for two tasks, intra-lingual semiparallel cross-lingual VC.After two-month period, received 33 submissions, including 3 baselines database.From results crowd-sourced listening tests, observed that VC methods have progressed rapidly thanks advanced deep learning methods.In...

10.21437/vcc_bc.2020-14 preprint EN 2020-10-16

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

OPENALEX - Publications

Wen-Chin Huang Tomoki Hayashi Yi-Chiao Wu Hirokazu Kameoka Tomoki Toda

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.Seq2seq VC models are attractive owing to their ability convert prosody.While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated.Nonetheless, data-hungry property mispronunciation...

10.21437/interspeech.2020-1066 article EN Interspeech 2022 2020-10-25

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

OPENALEX - Publications

Hsiang-Sheng Tsai Heng-Jui Chang Wen-Chin Huang Zi−Li Huang Kushal Lakhotia and 12 more

Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy Liu, Cheng-I Lai, Jiatong Shi, Xuankai Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.

10.18653/v1/2022.acl-long.580 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

AWT020: a novel fusion protein harnessing PD-1 blockade and selective IL-2 Cis-activation for enhanced anti-tumor immunity and diminished toxicity

OPENALEX - Publications

Fan Ye Jianing Huang Xiaoli Cheng Shih Chieh Chen Fang Huang and 18 more

The clinical success of the immune checkpoint inhibitor (ICI) targeting programmed cell death protein 1 (PD-1) has revolutionized cancer treatment. However, full potential PD-1 blockade therapy remains unrealized, as response rates are still low across many types. Interleukin-2 (IL-2)-based immunotherapies hold promise, they can stimulate robust T expansion and enhance effector function - activities that could synergize potently with blockade. Yet, IL-2 therapies also carry a significant...

10.3389/fimmu.2025.1537466 article EN cc-by Frontiers in Immunology 2025-02-18

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

OPENALEX - Publications

Wen-Chin Huang Tomoki Hayashi Yi-Chiao Wu Hirokazu Kameoka Tomoki Toda

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability convert prosody. Nonetheless, without sufficient data, seq2seq VC can suffer from unstable training and mispronunciation problems in the converted speech, thus far practical. To tackle these shortcomings, we propose transfer knowledge other speech processing tasks where large-scale corpora easily available, typically text-to-speech (TTS) automatic recognition (ASR). We argue that initialized...

10.1109/taslp.2021.3049336 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

The Singing Voice Conversion Challenge 2023

OPENALEX - Publications

Wen-Chin Huang Lester Phillip Violeta Songxiang Liu Jiatong Shi Tomoki Toda

We present the latest iteration of voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different (VC) systems based on common dataset. This year we shifted our focus singing (SVC), thus named Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely in-domain cross-domain SVC. The run months, in total received 26 submissions, including 2 baselines. Through large-scale crowd-sourced listening test,...

10.1109/asru57964.2023.10389671 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023-12-16

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

OPENALEX - Publications

Wen-Chin Huang Hao Luo Hsin-Te Hwang Chen-Chou Lo Yuhuai Peng and 2 more

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features different properties, improve performance VAE-VC. We believed that success came more disentangled latent representations. article, extend...

10.1109/tetci.2020.2977678 article EN cc-by IEEE Transactions on Emerging Topics in Computational Intelligence 2020-04-07

Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions

OPENALEX - Publications

Rohan Kumar Das Tomi Kinnunen Wen-Chin Huang Zhen-Hua Ling Junichi Yamagishi and 3 more

The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and crosslingual voice conversion (VC).While primary evaluation of challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment submitted systems.The aim to provide complementary performance analysis may be more beneficial than time-consuming tests.In this study, examined five types assessments using automatic speaker...

10.21437/vcc_bc.2020-15 preprint EN 2020-10-16

The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans

OPENALEX - Publications

Shinji Watanabe Florian Boyer Xuankai Chang Pengcheng Guo Tomoki Hayashi and 10 more

This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. project was initiated in December 2017 to mainly deal with recognition experiments based on sequence-to-sequence modeling. The has grown rapidly and now covers a wide range applications. Now also includes text (TTS), voice conversation (VC), translation (ST), enhancement (SE) support for beamforming, separation, denoising, dereverberation. All applications are...

10.1109/dslw51110.2021.9523402 article EN 2021-06-05

S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations

OPENALEX - Publications

Wen-Chin Huang Shu-Wen Yang Tomoki Hayashi Hung-yi Lee Shinji Watanabe and 1 more

This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace expensive supervised adopted by state-of-the-art VC systems. Moreover, we claim that a good probing task for S3R analysis. this work, provide series in-depth analyses benchmarking two tasks VCC2020, namely intra-/cross-lingual any-to-one (A2O) as well any-to-any...

10.1109/icassp43922.2022.9746430 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Anti-LAG-3-IL-2c: A Bi-Functional Fusion Protein Enhancing Immune Checkpoint Inhibitor and CAR T-Cell Therapy for Cancer

OPENALEX - Publications

Fang Huang Jianing Huang Fan Ye Shih Chieh Chen Wen-Chin Huang and 11 more

10.2139/ssrn.5146644 preprint EN 2025-01-01

Investigating Factors Related to the Naturalness of Synthesized Unison Singing

OPENALEX - Publications

K Nishizawa Ryuichi Yamamoto Wen-Chin Huang Tomoki Toda

10.1109/icassp49660.2025.10889744 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

OPENALEX - Publications

Wen-Chin Huang Tomoki Hayashi Yi-Chiao Wu Hirokazu Kameoka Tomoki Toda

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability convert prosody. While seq2seq recurrent neural networks (RNNs) and convolutional (CNNs) have been successfully applied VC, use of network, which has shown promising results in various speech processing tasks, not yet investigated. Nonetheless, data-hungry property mispronunciation...

10.48550/arxiv.1912.06813 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

OPENALEX - Publications

Yi Zhao Wen-Chin Huang Xiaohai Tian Junichi Yamagishi Rohan Kumar Das and 3 more

The voice conversion challenge is a bi-annual scientific event held to compare and understand different (VC) systems built on common dataset. In 2020, we organized the third edition of constructed distributed new database for two tasks, intra-lingual semi-parallel cross-lingual VC. After two-month period, received 33 submissions, including 3 baselines database. From results crowd-sourced listening tests, observed that VC methods have progressed rapidly thanks advanced deep learning methods....

10.48550/arxiv.2008.12527 preprint EN other-oa arXiv (Cornell University) 2020-01-01

LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

OPENALEX - Publications

Wen-Chin Huang Erica Cooper Junichi Yamagishi Tomoki Toda

An effective approach to automatically predict the subjective rating for synthetic speech is train on a listening test dataset with human-annotated scores. Although each sample in rated by several listeners, most previous works only used mean score as training target. In this work, we present LDNet, unified framework opinion (MOS) prediction that predicts listener-wise perceived quality given input and listener identity. We reflect recent advances LD modeling, including design choices of...

10.1109/icassp43922.2022.9747222 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders

OPENALEX - Publications

Wen-Chin Huang Hsin-Te Hwang Yuhuai Peng Yu Tsao Hsin‐Min Wang

An effective approach to non-parallel voice conversion (VC) is utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), model the latent structure of speech in an unsupervised manner. A previous study has confirmed ef- fectiveness VAE using STRAIGHT spectra for VC. How- ever, other types spectral features such as mel- cepstral coefficients (MCCs), which are related human per- ception and have been widely used VC, not prop- erly investigated. Instead one specific...

10.1109/iscslp.2018.8706604 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2018-11-01

Many-to-Many Voice Transformer Network

OPENALEX - Publications

Hirokazu Kameoka Wen-Chin Huang Kou Tanaka Takuhiro Kaneko Nobukatsu Hojo and 1 more

This paper proposes a voice conversion (VC) method based on sequence-to-sequence (S2S) learning framework, which enables simultaneous of the characteristics, pitch contour, and duration input speech. We previously proposed an S2S-based VC using transformer network architecture called (VTN). The original VTN was designed to learn only mapping speech feature sequences from one speaker another. Here, main idea we propose is extension that can simultaneously mappings among multiple speakers....

10.1109/taslp.2020.3047262 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2020-12-24

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

OPENALEX - Publications

Wen-Chin Huang Tomoki Hayashi Shinji Watanabe Tomoki Toda

This paper presents the sequence-to-sequence (seq2seq) baseline system for voice conversion challenge (VCC) 2020. We consider a naive approach (VC), which is to first transcribe input speech with an automatic recognition (ASR) model, followed using transcriptions generate of target text-to-speech (TTS) model. revisit this method under framework by utilizing ESPnet, open-source end-to-end processing toolkit, and many well-configured pretrained models provided community. Official evaluation...

10.21437/vcc_bc.2020-24 preprint EN 2020-10-16

Speech Recognition by Simply Fine-Tuning Bert

OPENALEX - Publications

Wen-Chin Huang Chia-Hua Wu Shang-Bao Luo Kuan‐Yu Chen Hsin‐Min Wang and 1 more

We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption that given history context sequence, powerful LM narrow the range of possible choices signal be used as clue. Hence, comparing to conventional ASR systems train acoustic (AM) from scratch, we believe simply BERT model. As an initial study, demonstrate effectiveness...

10.1109/icassp39728.2021.9413668 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Any-to-One Sequence-to-Sequence Voice Conversion Using Self-Supervised Discrete Speech Representations

OPENALEX - Publications

Wen-Chin Huang Yi-Chiao Wu Tomoki Hayashi

We present a novel approach to any-to-one (A2O) voice conversion (VC) in sequence-to-sequence (seq2seq) framework. A2O VC aims convert any speaker, including those unseen during training, fixed target speaker. utilize vq-wav2vec (VQW2V), discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed be speaker-independent and well corresponds underlying linguistic contents. Given training dataset of the we extract VQW2V acoustic features...

10.1109/icassp39728.2021.9415079 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

The Voicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains

OPENALEX - Publications

Erica Cooper Wen-Chin Huang Yu Tsao Hsin‐Min Wang Tomoki Toda and 1 more

We present the second edition of VoiceMOS Challenge, a scientific event that aims to promote study automatic prediction mean opinion score (MOS) synthesized and processed speech. This year, we emphasize real-world challenging zero-shot out-of-domain MOS with three tracks for different voice evaluation scenarios. Ten teams from industry academia in seven countries participated. Surprisingly, found two sub-tracks French text-to-speech synthesis had large differences their predictability,...

10.1109/asru57964.2023.10389763 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023-12-16

Coming Soon ...