- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Topic Modeling
- Video Surveillance and Tracking Methods
- Human Pose and Action Recognition
- Speech and dialogue systems
- Face recognition and analysis
- Machine Learning in Healthcare
- Advanced Neural Network Applications
- Artificial Intelligence in Healthcare
- Traditional Chinese Medicine Studies
- Remote Sensing and LiDAR Applications
- Music Technology and Sound Studies
- Anomaly Detection Techniques and Applications
- Advanced Image and Video Retrieval Techniques
- Health disparities and outcomes
- Autonomous Vehicle Technology and Safety
- Gait Recognition and Analysis
- Voice and Speech Disorders
- Domain Adaptation and Few-Shot Learning
- Human Motion and Animation
- Generative Adversarial Networks and Image Synthesis
- Advanced Text Analysis Techniques
Tencent (China)
2021-2025
China United Network Communications Group (China)
2022-2024
Chinese Academy of Medical Sciences & Peking Union Medical College
2023-2024
Beihang University
2023-2024
China Telecom (China)
2023
Sichuan University
2023
West China Hospital of Sichuan University
2023
China Telecom
2023
Google (United States)
2021-2023
Central University of Finance and Economics
2023
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision audio. Machine perception models, in stark contrast, are typically modality-specific optimised for unimodal benchmarks, hence late-stage fusion of final representations or predictions each modality (`late-fusion') is still a dominant paradigm multimodal video classification. Instead, we introduce novel transformer based architecture that uses `fusion bottlenecks' at...
In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, improve the original MelGAN by following aspects. First, increase receptive field of generator, which is proven be beneficial speech generation. Second, substitute feature matching loss with multi-resolution STFT better measure difference between fake and real speech. Together pre-training, improvement leads both quality training stability. More...
Abstract Background The key to modern drug discovery is find, identify and prepare molecular targets. However, due the influence of throughput, precision cost, traditional experimental methods are difficult be widely used infer these potential Drug-Target Interactions (DTIs). Therefore, it urgent develop effective computational validate interaction between drugs target. Methods We developed a deep learning-based model for DTIs prediction. proteins evolutionary features extracted via Position...
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, thus synthesizing expressive has attracted much attention in recent years. Previous methods performed the synthesis either with explicit labels or a fixed-length style embedding extracted from reference audio, both of which can only learn an average ignores multi-scale nature prosody. In this paper, we propose MsEmoTTS, emotional framework, to model emotion different levels....
Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred synthetic not accurate and expressive enough with category confusions. Moreover, it hard select an appropriate reference deliver desired strength. To solve these problems, we propose novel based on Tacotron. First, plug two classifiers - one after encoder, decoder output enhance...
Shan Yang, Yongfei Zhang, Guanglin Niu, Qinghua Zhao, Shiliang Pu. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2021.
This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis. Conventional synthesis often needs manual labels or reference audio determine the expressions of synthesized speech. Such coarse cannot details emotion, resulting in an averaged expression delivery, it is also hard choose suitable during inference. To generation, we introduce phoneme-level strength representations through learned...
In this paper, we aim at improving the performance of synthesized speech in statistical parametric synthesis (SPSS) based on a generative adversarial network (GAN). particular, propose novel architecture combining traditional acoustic loss function and GAN's discriminative under multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate parameters deep neural networks, which only considers numerical difference between raw audio one. To mitigate problem,...
Recently, attention-based end-to-end speech synthesis has achieved superior performance compared to traditional models, and several approaches like global style tokens are proposed explore the controllability of model. Although existing methods show good in disentanglement transfer, it is still unable control explicit emotion generated speech. In this paper, we mainly focus on subtle expressive synthesis, where category strength can be easily controlled with a discrete emotional vector...
Abstract Investing in human capital can assist achieving technological innovations, while the spatial spillover effects of on urban innovation agglomeration are largely ignored. Using panel data 108 cities China’s Yangtze River Economic Belt (YREB) during 2011–2020, this paper explores interactions between and with a two-way fixed Spatial Durbin Model framework, which incorporates interpretation effects. The results show that YREB has heterogeneity structure, is reflected its diffusion from...
Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, choosing generating voice that matches character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, novel multi-stage framework allows for more flexible manipulation of attributes by...
Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder structure, Tacotron2 no longer needs separately learned text analysis front-end, duration model, acoustic audio synthesis module. The key of lies in attention mechanism, which learns an alignment between encoder decoder, serving implicit model...
Data efficient voice cloning aims at synthesizing target speaker's with only a few enrollment samples hand.To this end, speaker adaptation and encoding are two typical methods based on base model trained from multiple speakers.The former uses small set of data to transfer the multi-speaker through direct update, while in latter, seconds audio directly goes an extra along synthesize without update.Nevertheless, need clean data.However, provided by user may inevitably contain acoustic noise...
This paper proposes an interesting voice and accent joint conversion approach, which can convert arbitrary source speaker's to a target speaker with non-native accent. problem is challenging as each only has training data in native we need disentangle information the model re-combine them stage. In our recognition-synthesis framework, manage solve this by two proposed tricks. First, use accent-dependent speech recognizers obtain bottleneck features for different accented speakers. aims wipe...
Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they mixture style attributes without explicitly considering the factorization multiple-level styles. In this work, we introduce hierarchical GST architecture with residuals Tacotron, which learns disentangled representations model and control different granularities synthesized speech. We make evaluations conditioned on individual tokens from layers. As number...
Current two-stage TTS framework typically integrates an acoustic model with a vocoder -- the predicts low resolution intermediate representation such as Mel-spectrum while generates waveform from representation. Although is served bridge, there still exists critical mismatch between and they are commonly separately learned work on different distributions of representation, leading to inevitable artifacts in synthesized speech. In this work, using pre-designed most previous studies, we...
Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes novel system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) used the content encoder VC derive discrete phoneme-like acoustic...
Chronic disease multimorbidity is prevalent among older Chinese people, seriously affecting their well-being and quality of life.
Through borrowing emotional expressions from an speaker, cross-speaker emotion transfer is effective way to produce speech for target speakers without training data. Since and timbre of the source speaker are heavily entangled in speech, existing approaches often struggle trade off between similarity expression synthetic speaker. In this letter, we propose disentangle through information perturbation conduct transfer, which effectively learns maintains Specifically, separately perturb...