Shan Yang

ORCID: 0000-0003-4464-146X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Topic Modeling
  • Video Surveillance and Tracking Methods
  • Human Pose and Action Recognition
  • Speech and dialogue systems
  • Face recognition and analysis
  • Machine Learning in Healthcare
  • Advanced Neural Network Applications
  • Artificial Intelligence in Healthcare
  • Traditional Chinese Medicine Studies
  • Remote Sensing and LiDAR Applications
  • Music Technology and Sound Studies
  • Anomaly Detection Techniques and Applications
  • Advanced Image and Video Retrieval Techniques
  • Health disparities and outcomes
  • Autonomous Vehicle Technology and Safety
  • Gait Recognition and Analysis
  • Voice and Speech Disorders
  • Domain Adaptation and Few-Shot Learning
  • Human Motion and Animation
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Text Analysis Techniques

Tencent (China)
2021-2025

China United Network Communications Group (China)
2022-2024

Chinese Academy of Medical Sciences & Peking Union Medical College
2023-2024

Beihang University
2023-2024

China Telecom (China)
2023

Sichuan University
2023

West China Hospital of Sichuan University
2023

China Telecom
2023

Google (United States)
2021-2023

Central University of Finance and Economics
2023

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision audio. Machine perception models, in stark contrast, are typically modality-specific optimised for unimodal benchmarks, hence late-stage fusion of final representations or predictions each modality (`late-fusion') is still a dominant paradigm multimodal video classification. Instead, we introduce novel transformer based architecture that uses `fusion bottlenecks' at...

10.48550/arxiv.2107.00135 preprint EN other-oa arXiv (Cornell University) 2021-01-01

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, improve the original MelGAN by following aspects. First, increase receptive field of generator, which is proven be beneficial speech generation. Second, substitute feature matching loss with multi-resolution STFT better measure difference between fake and real speech. Together pre-training, improvement leads both quality training stability. More...

10.1109/slt48900.2021.9383551 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Abstract Background The key to modern drug discovery is find, identify and prepare molecular targets. However, due the influence of throughput, precision cost, traditional experimental methods are difficult be widely used infer these potential Drug-Target Interactions (DTIs). Therefore, it urgent develop effective computational validate interaction between drugs target. Methods We developed a deep learning-based model for DTIs prediction. proteins evolutionary features extracted via Position...

10.1186/s12911-020-1052-0 article EN cc-by BMC Medical Informatics and Decision Making 2020-03-01

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, thus synthesizing expressive has attracted much attention in recent years. Previous methods performed the synthesis either with explicit labels or a fixed-length style embedding extracted from reference audio, both of which can only learn an average ignores multi-scale nature prosody. In this paper, we propose MsEmoTTS, emotional framework, to model emotion different levels....

10.1109/taslp.2022.3145293 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred synthetic not accurate and expressive enough with category confusions. Moreover, it hard select an appropriate reference deliver desired strength. To solve these problems, we propose novel based on Tacotron. First, plug two classifiers - one after encoder, decoder output enhance...

10.1109/iscslp49672.2021.9362069 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021-01-24

Shan Yang, Yongfei Zhang, Guanglin Niu, Qinghua Zhao, Shiliang Pu. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2021.

10.18653/v1/2021.acl-short.124 article EN cc-by 2021-01-01

This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis. Conventional synthesis often needs manual labels or reference audio determine the expressions of synthesized speech. Such coarse cannot details emotion, resulting in an averaged expression delivery, it is also hard choose suitable during inference. To generation, we introduce phoneme-level strength representations through learned...

10.1109/slt48900.2021.9383524 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

In this paper, we aim at improving the performance of synthesized speech in statistical parametric synthesis (SPSS) based on a generative adversarial network (GAN). particular, propose novel architecture combining traditional acoustic loss function and GAN's discriminative under multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate parameters deep neural networks, which only considers numerical difference between raw audio one. To mitigate problem,...

10.1109/asru.2017.8269003 preprint EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017-12-01

Recently, attention-based end-to-end speech synthesis has achieved superior performance compared to traditional models, and several approaches like global style tokens are proposed explore the controllability of model. Although existing methods show good in disentanglement transfer, it is still unable control explicit emotion generated speech. In this paper, we mainly focus on subtle expressive synthesis, where category strength can be easily controlled with a discrete emotional vector...

10.1109/asru46091.2019.9003829 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Abstract Investing in human capital can assist achieving technological innovations, while the spatial spillover effects of on urban innovation agglomeration are largely ignored. Using panel data 108 cities China’s Yangtze River Economic Belt (YREB) during 2011–2020, this paper explores interactions between and with a two-way fixed Spatial Durbin Model framework, which incorporates interpretation effects. The results show that YREB has heterogeneity structure, is reflected its diffusion from...

10.1057/s41599-023-01809-5 article EN cc-by Humanities and Social Sciences Communications 2023-06-29

Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, choosing generating voice that matches character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, novel multi-stage framework allows for more flexible manipulation of attributes by...

10.48550/arxiv.2501.04644 preprint EN arXiv (Cornell University) 2025-01-08

10.1109/icassp49660.2025.10889767 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder structure, Tacotron2 no longer needs separately learned text analysis front-end, duration model, acoustic audio synthesis module. The key of lies in attention mechanism, which learns an alignment between encoder decoder, serving implicit model...

10.1109/access.2019.2914149 article EN cc-by-nc-nd IEEE Access 2019-01-01

Data efficient voice cloning aims at synthesizing target speaker's with only a few enrollment samples hand.To this end, speaker adaptation and encoding are two typical methods based on base model trained from multiple speakers.The former uses small set of data to transfer the multi-speaker through direct update, while in latter, seconds audio directly goes an extra along synthesize without update.Nevertheless, need clean data.However, provided by user may inevitably contain acoustic noise...

10.21437/interspeech.2020-2530 article EN Interspeech 2022 2020-10-25

This paper proposes an interesting voice and accent joint conversion approach, which can convert arbitrary source speaker's to a target speaker with non-native accent. problem is challenging as each only has training data in native we need disentangle information the model re-combine them stage. In our recognition-synthesis framework, manage solve this by two proposed tricks. First, use accent-dependent speech recognizers obtain bottleneck features for different accented speakers. aims wipe...

10.1109/iscslp49672.2021.9362120 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021-01-24

Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they mixture style attributes without explicitly considering the factorization multiple-level styles. In this work, we introduce hierarchical GST architecture with residuals Tacotron, which learns disentangled representations model and control different granularities synthesized speech. We make evaluations conditioned on individual tokens from layers. As number...

10.1109/asru46091.2019.9003859 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Current two-stage TTS framework typically integrates an acoustic model with a vocoder -- the predicts low resolution intermediate representation such as Mel-spectrum while generates waveform from representation. Although is served bridge, there still exists critical mismatch between and they are commonly separately learned work on different distributions of representation, leading to inevitable artifacts in synthesized speech. In this work, using pre-designed most previous studies, we...

10.21437/interspeech.2021-414 article EN Interspeech 2022 2021-08-27

Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes novel system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) used the content encoder VC derive discrete phoneme-like acoustic...

10.1109/icassp43922.2022.9747427 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Chronic disease multimorbidity is prevalent among older Chinese people, seriously affecting their well-being and quality of life.

10.46234/ccdcw2024.156 article EN Deleted Journal 2024-01-01

Through borrowing emotional expressions from an speaker, cross-speaker emotion transfer is effective way to produce speech for target speakers without training data. Since and timbre of the source speaker are heavily entangled in speech, existing approaches often struggle trade off between similarity expression synthetic speaker. In this letter, we propose disentangle through information perturbation conduct transfer, which effectively learns maintains Specifically, separately perturb...

10.1109/lsp.2022.3203888 article EN IEEE Signal Processing Letters 2022-01-01
Coming Soon ...