- Speech Recognition and Synthesis
- Stability and Control of Uncertain Systems
- Speech and Audio Processing
- Natural Language Processing Techniques
- Control Systems and Identification
- Fault Detection and Control Systems
- Topic Modeling
- Cooperative Communication and Network Coding
- Music and Audio Processing
- Advanced MIMO Systems Optimization
- Advanced Control Systems Optimization
- Neural Networks Stability and Synchronization
- Vibration Control and Rheological Fluids
- Wireless Communication Security Techniques
- Speech and dialogue systems
- Cognitive Radio Networks and Spectrum Sensing
- Advanced Wireless Communication Techniques
- Structural Engineering and Vibration Analysis
- Energy Harvesting in Wireless Networks
- Vehicle Dynamics and Control Systems
- PAPR reduction in OFDM
- Domain Adaptation and Few-Shot Learning
- Advanced Graph Theory Research
- Plant and Fungal Interactions Research
- Error Correcting Code Techniques
Bank of China
2024
Microsoft Research Asia (China)
2022-2024
National Supercomputing Center in Wuxi
2023
Xidian University
2022-2023
Microsoft (United States)
2020-2023
Beijing Forestry University
2022-2023
State Administration of Traditional Chinese Medicine of the People's Republic of China
2023
Yangzhou University
2023
Tsinghua University
2022
Jiangnan University
2013-2022
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-theart performance, they still suffer from two problems: 1) low efficiency during training inference; 2) hard to model long dependency using current recurrent networks (RNNs). Inspired by the success of Transformer network in machine translation (NMT), this paper, we introduce adapt multi-head attention mechanism replace RNN structures also original Tacotron2. With help...
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train neural codec model (called Vall-E) using discrete codes derived from an off-the-shelf audio model, and regard TTS as conditional task rather than continuous signal regression in previous work. During the pre-training stage, scale up training data 60K hours of English which is hundreds times larger existing systems. Vall-E emerges in-context learning capabilities can be used synthesize...
Text-to-speech (TTS) has made rapid progress in both academia and industry recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge it. In this paper, we answer these by first defining the quality based on statistical significance of subjective measure introducing appropriate guidelines judge it, then developing called NaturalSpeech achieves benchmark datasets. Specifically, leverage variational auto-encoder (VAE) for...
Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid for automatic speech recognition.In this paper, we describe our recent development RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling future lookahead.When trained Microsoft's 65 thousand hours anonymized training data, developed surpasses well both recognition...
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important capture the diversity in human speech such as speaker identities, prosodies, styles (e.g., singing). Current large TTS systems usually quantize into discrete tokens use language models generate these one by one, which suffer from unstable prosody, word skipping/repeating issue, poor voice quality. In this paper, we develop NaturalSpeech 2, a system that leverages neural audio codec with residual...
Abstract Non-Hermiticity has widespread applications in quantum physics. It brings about distinct topological phases without Hermitian counterparts, and gives rise to the fundamental challenge of phase classification. Here, we report an experimental demonstration unsupervised learning non-Hermitian with nitrogen-vacancy center platform. In particular, implement twister model, which hosts peculiar knotted phases, a solid-state simulator consisting electron spin nearby 13 C nuclear diamond. By...
Text to speech (TTS) has made rapid progress in both academia and industry recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how define/judge quality it. In this paper, we answer these by first defining the based on statistical significance of subjective measure introducing appropriate guidelines judge it, then developing called NaturalSpeech achieves benchmark dataset. Specifically, leverage variational autoencoder (VAE) for end-to-end...
We propose a cross-lingual neural codec language model, VALL-E X, for speech synthesis. Specifically, we extend and train multi-lingual conditional model to predict the acoustic token sequences of target by using both source text as prompts. X inherits strong in-context learning capabilities can be applied zero-shot text-to-speech synthesis speech-to-speech translation tasks. Experimental results show that it generate high-quality in via just one utterance prompt while preserving unseen...
In a MIMO cognitive radio network, multiple secondary users sense the spatial channels and share spectrum use with incumbent primary users. Each transmitter competes others to increase its own information rate while generating limited total interference receivers. order maximize sum-rate of problem user transmission is modeled as cooperative game. The strategy each transmit covariance matrix, utility an approximation rate. negotiate over allocation budget reach at bargaining solution that...
Summary This paper addresses the problem of event‐triggered stabilization for positive systems subject to input saturation, where state variables are in nonnegative orthant. An linear feedback law is constructed. By expressing saturated on a convex hull group auxiliary laws, we establish conditions under which closed‐loop system asymptotically stable with given set contained domain attraction. On basis these conditions, designing gain and event‐triggering strategy attaining largest...
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training inference; 2) hard to model long dependency using current recurrent networks (RNNs). Inspired by the success of Transformer network in machine translation (NMT), this paper, we introduce adapt multi-head attention mechanism replace RNN structures also original Tacotron2. With help...
Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current TTS models suffer from robustness issue, results in abnormal (bad cases) especially for unusual text (unseen context). To build a model can synthesize both natural stable audios, this paper, we make deep analysis why previous not robust, on propose RobuTrans (Robust Transformer), robust Transformer. Comparing to...
To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language with token-wise flow matching. By leveraging the nature of models generative efficacy matching, FELLE effectively predicts tokens (mel-spectrograms). For each token, modifies general prior distribution in matching by incorporating information from previous step, improving coherence stability. Furthermore, to enhance synthesis quality, introduces a...
Contextual biasing is an important and challenging task for end-to-end automatic speech recognition (ASR) systems, which aims to achieve better performance by the ASR system particular context phrases such as person names, music list, proper nouns, etc. Existing methods mainly include contextual LM adding bias encoder into models. In this work, we introduce a novel approach do spelling correction model on top of system. We incorporate information sequence-to-sequence with shared encoder. The...
Most passive vibration isolation systems are composed of springs and dampers. Although it is possible to improve the performance by active control, complexity, power requirements cost such a system have restricted its use. A with variable damping practical has good in high frequency region, but was found not responses low region. On base on-off control method, stiffness method combination were proposed. Comparison among proposed methods conventional showed that had best properties whole new...
A vibration isolation system with variable damping and stiffness control is practical has good performances. However, conventional devices of are usually complicated. magnetorheological (MR) fluid damper only needs a small electric current to provide the magnetic field. It easy achieve an MR in systems. In this paper, two dampers series were used for system. The passive, damping, stiffness, systems investigated experiment theoretical calculation. time frequency responses sinusoidal, sweep...