NFDI4DS | UHH-SEMS - Publication Details

Shan Yang

ORCID: 0000-0003-4464-146X

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5101736420

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Natural Language Processing Techniques
Topic Modeling
Video Surveillance and Tracking Methods
Human Pose and Action Recognition
Speech and dialogue systems
Face recognition and analysis
Machine Learning in Healthcare
Advanced Neural Network Applications
Artificial Intelligence in Healthcare
Traditional Chinese Medicine Studies
Remote Sensing and LiDAR Applications
Music Technology and Sound Studies
Anomaly Detection Techniques and Applications
Advanced Image and Video Retrieval Techniques
Health disparities and outcomes
Autonomous Vehicle Technology and Safety
Gait Recognition and Analysis
Voice and Speech Disorders
Domain Adaptation and Few-Shot Learning
Human Motion and Animation
Generative Adversarial Networks and Image Synthesis
Advanced Text Analysis Techniques

Tencent (China)
2021-2025

China United Network Communications Group (China)
2022-2024

Chinese Academy of Medical Sciences & Peking Union Medical College
2023-2024

Beihang University
2023-2024

China Telecom (China)
2023

Sichuan University
2023

West China Hospital of Sichuan University
2023

China Telecom
2023

Google (United States)
2021-2023

Central University of Finance and Economics
2023

Attention Bottlenecks for Multimodal Fusion

OPENALEX - Publications

Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid and 1 more

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision audio. Machine perception models, in stark contrast, are typically modality-specific optimised for unimodal benchmarks, hence late-stage fusion of final representations or predictions each modality (`late-fusion') is still a dominant paradigm multimodal video classification. Instead, we introduce novel transformer based architecture that uses `fusion bottlenecks' at...

10.48550/arxiv.2107.00135 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

OPENALEX - Publications

Geng Yang Shan Yang Kai Liu Peng Fang Wei Chen and 1 more

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, improve the original MelGAN by following aspects. First, increase receptive field of generator, which is proven be beneficial speech generation. Second, substitute feature matching loss with multi-resolution STFT better measure difference between fake and real speech. Together pre-training, improvement leads both quality training stability. More...

10.1109/slt48900.2021.9383551 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

A deep learning-based method for drug-target interaction prediction based on long short-term memory neural network

OPENALEX - Publications

Yanbin Wang Zhu‐Hong You Shan Yang Hai-Cheng Yi Zhan‐Heng Chen and 1 more

Abstract Background The key to modern drug discovery is find, identify and prepare molecular targets. However, due the influence of throughput, precision cost, traditional experimental methods are difficult be widely used infer these potential Drug-Target Interactions (DTIs). Therefore, it urgent develop effective computational validate interaction between drugs target. Methods We developed a deep learning-based model for DTIs prediction. proteins evolutionary features extracted via Position...

10.1186/s12911-020-1052-0 article EN cc-by BMC Medical Informatics and Decision Making 2020-03-01

MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

OPENALEX - Publications

Yi Lei Shan Yang Xinsheng Wang Lei Xie

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, thus synthesizing expressive has attracted much attention in recent years. Previous methods performed the synthesis either with explicit labels or a fixed-length style embedding extracted from reference audio, both of which can only learn an average ignores multi-scale nature prosody. In this paper, we propose MsEmoTTS, emotional framework, to model emotion different levels....

10.1109/taslp.2022.3145293 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

Controllable Emotion Transfer For End-to-End Speech Synthesis

OPENALEX - Publications

Tao Li Shan Yang Liumeng Xue Lei Xie

Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred synthetic not accurate and expressive enough with category confusions. Moreover, it hard select an appropriate reference deliver desired strength. To solve these problems, we propose novel based on Tacotron. First, plug two classifiers - one after encoder, decoder output enhance...

10.1109/iscslp49672.2021.9362069 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021-01-24

Entity Concept-enhanced Few-shot Relation Extraction

OPENALEX - Publications

Shan Yang Yongfei Zhang Guanglin Niu Qinghua Zhao Shiliang Pu

Shan Yang, Yongfei Zhang, Guanglin Niu, Qinghua Zhao, Shiliang Pu. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2021.

10.18653/v1/2021.acl-short.124 article EN cc-by 2021-01-01

A deep bidirectional LSTM approach for video-realistic talking head

OPENALEX - Publications

Bo Fan Lei Xie Shan Yang Lijuan Wang Frank K. Soong

10.1007/s11042-015-2944-3 article EN Multimedia Tools and Applications 2015-09-29

On the localness modeling for the self-attention based end-to-end speech synthesis

OPENALEX - Publications

Shan Yang Heng Lu Shiyin Kang Liumeng Xue Jinba Xiao and 3 more

10.1016/j.neunet.2020.01.034 article EN Neural Networks 2020-02-11

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

OPENALEX - Publications

Yi Lei Shan Yang Lei Xie

This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis. Conventional synthesis often needs manual labels or reference audio determine the expressions of synthesized speech. Such coarse cannot details emotion, resulting in an averaged expression delivery, it is also hard choose suitable during inference. To generation, we introduce phoneme-level strength representations through learned...

10.1109/slt48900.2021.9383524 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

OPENALEX - Publications

Shan Yang Lei Xie Xiao Dong Chen Xiaoyan Lou Xuan Zhu and 2 more

In this paper, we aim at improving the performance of synthesized speech in statistical parametric synthesis (SPSS) based on a generative adversarial network (GAN). particular, propose novel architecture combining traditional acoustic loss function and GAN's discriminative under multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate parameters deep neural networks, which only considers numerical difference between raw audio one. To mitigate problem,...

10.1109/asru.2017.8269003 preprint EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017-12-01

Controlling Emotion Strength with Relative Attribute for End-to-End Speech Synthesis

OPENALEX - Publications

Xiaolian Zhu Shan Yang Geng Yang Lei Xie

Recently, attention-based end-to-end speech synthesis has achieved superior performance compared to traditional models, and several approaches like global style tokens are proposed explore the controllability of model. Although existing methods show good in disentanglement transfer, it is still unable control explicit emotion generated speech. In this paper, we mainly focus on subtle expressive synthesis, where category strength can be easily controlled with a discrete emotional vector...

10.1109/asru46091.2019.9003829 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Heterogeneous human capital, spatial spillovers and regional innovation: evidence from the Yangtze River Economic Belt, China

OPENALEX - Publications

Fenghua Wen Shan Yang Daohan Huang

Abstract Investing in human capital can assist achieving technological innovations, while the spatial spillover effects of on urban innovation agglomeration are largely ignored. Using panel data 108 cities China’s Yangtze River Economic Belt (YREB) during 2011–2020, this paper explores interactions between and with a two-way fixed Spatial Durbin Model framework, which incorporates interpretation effects. The results show that YREB has heterogeneity structure, is reflected its diffusion from...

10.1057/s41599-023-01809-5 article EN cc-by Humanities and Social Sciences Communications 2023-06-29

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

OPENALEX - Publications

Hanzhao Li Yuke Li Xinsheng Wang Jingbin Hu Qicong Xie and 2 more

Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, choosing generating voice that matches character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, novel multi-stage framework allows for more flexible manipulation of attributes by...

10.48550/arxiv.2501.04644 preprint EN arXiv (Cornell University) 2025-01-08

View-aware feature learning for person re-identification

OPENALEX - Publications

Shan Yang Yongfei Zhang Yanglin Pu Hangyuan Yang

10.11834/jig.240038 article EN Journal of Image and Graphics 2025-01-01

Impact of Flight Altitude and Photo Control Points on the Accuracy of 2D and 3D Modeling in UAV Surveys: A Case Study of Jurong Dongshan River

OPENALEX - Publications

Hongfei Lü Shan Yang Hao Zhou Jin Zhao Fan Bai and 2 more

10.2352/j.imagingsci.technol.2025.69.4.040506 article EN Journal of Imaging Science and Technology 2025-02-27

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

OPENALEX - Publications

Weidong Chen Shan Yang Guangzhi Li Xixin Wu

10.1109/icassp49660.2025.10889767 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

OPENALEX - Publications

Xiaolian Zhu Yuchao Zhang Shan Yang Liumeng Xue Lei Xie

Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder structure, Tacotron2 no longer needs separately learned text analysis front-end, duration model, acoustic audio synthesis module. The key of lies in attention mechanism, which learns an alignment between encoder decoder, serving implicit model...

10.1109/access.2019.2914149 article EN cc-by-nc-nd IEEE Access 2019-01-01

Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training

OPENALEX - Publications

Jian Cong Shan Yang Lei Xie Guoqiao Yu Guanglu Wan

Data efficient voice cloning aims at synthesizing target speaker's with only a few enrollment samples hand.To this end, speaker adaptation and encoding are two typical methods based on base model trained from multiple speakers.The former uses small set of data to transfer the multi-speaker through direct update, while in latter, seconds audio directly goes an extra along synthesize without update.Nevertheless, need clean data.However, provided by user may inevitably contain acoustic noise...

10.21437/interspeech.2020-2530 article EN Interspeech 2022 2020-10-25

Accent and Speaker Disentanglement in Many-to-many Voice Conversion

OPENALEX - Publications

Zhichao Wang Wenshuo Ge Xiong Wang Shan Yang Wendong Gan and 4 more

This paper proposes an interesting voice and accent joint conversion approach, which can convert arbitrary source speaker's to a target speaker with non-native accent. problem is challenging as each only has training data in native we need disentangle information the model re-combine them stage. In our recognition-synthesis framework, manage solve this by two proposed tricks. First, use accent-dependent speech recognizers obtain bottleneck features for different accented speakers. aims wipe...

10.1109/iscslp49672.2021.9362120 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021-01-24

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis

OPENALEX - Publications

Xiaochun An Yuxuan Wang Shan Yang Zejun Ma Lei Xie

Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they mixture style attributes without explicitly considering the factorization multiple-level styles. In this work, we introduce hierarchical GST architecture with residuals Tacotron, which learns disentangled representations model and control different granularities synthesized speech. We make evaluations conditioned on individual tokens from layers. As number...

10.1109/asru46091.2019.9003859 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis

OPENALEX - Publications

Jian Cong Shan Yang Lei Xie Dan Su

Current two-stage TTS framework typically integrates an acoustic model with a vocoder -- the predicts low resolution intermediate representation such as Mel-spectrum while generates waveform from representation. Although is served bridge, there still exists critical mismatch between and they are commonly separately learned work on different distributions of representation, leading to inevitable artifacts in synthesized speech. In this work, using pre-designed most previous studies, we...

10.21437/interspeech.2021-414 article EN Interspeech 2022 2021-08-27

VCVTS: Multi-Speaker Video-to-Speech Synthesis Via Cross-Modal Knowledge Transfer from Voice Conversion

OPENALEX - Publications

Disong Wang Shan Yang Dan Su Xunying Liu Dong Yu and 1 more

Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes novel system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) used the content encoder VC derive discrete phoneme-like acoustic...

10.1109/icassp43922.2022.9747427 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

A Multi-State Model Study of the Disability-Free Life Expectancy Among Older Adults with Chronic Multimorbidity Based on CHARLS — China, 2011–2020

OPENALEX - Publications

Shan Yang Panliang Zhong Xinran Shen Binbin Su Zuliyaer Talifu and 2 more

Chronic disease multimorbidity is prevalent among older Chinese people, seriously affecting their well-being and quality of life.

10.46234/ccdcw2024.156 article EN Deleted Journal 2024-01-01

Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis

OPENALEX - Publications

Yi Lei Shan Yang Xinfa Zhu Lei Xie Dan Su

Through borrowing emotional expressions from an speaker, cross-speaker emotion transfer is effective way to produce speech for target speakers without training data. Since and timbre of the source speaker are heavily entangled in speech, existing approaches often struggle trade off between similarity expression synthetic speaker. In this letter, we propose disentangle through information perturbation conduct transfer, which effectively learns maintains Specifically, separately perturb...

10.1109/lsp.2022.3203888 article EN IEEE Signal Processing Letters 2022-01-01

Coming Soon ...