- Advanced Photonic Communication Systems
- Optical Network Technologies
- Advanced Fiber Laser Technologies
- Speech Recognition and Synthesis
- Photonic and Optical Devices
- Natural Language Processing Techniques
- Speech and Audio Processing
- Music and Audio Processing
- Advanced Fiber Optic Sensors
- Topic Modeling
- Full-Duplex Wireless Communications
- Semiconductor Lasers and Optical Devices
- Speech and dialogue systems
- Wireless Signal Modulation Classification
- PAPR reduction in OFDM
- Radar Systems and Signal Processing
- Advanced Optical Network Technologies
- Optical Wireless Communication Technologies
- Advancements in Battery Materials
- Ion-surface interactions and analysis
- Quantum optics and atomic interactions
- Ear and Head Tumors
- Advanced Battery Technologies Research
- Photonic Crystal and Fiber Optics
- Indoor and Outdoor Localization Technologies
Southwest Jiaotong University
2015-2024
Sichuan University
2023-2024
Google (United States)
2018-2022
Wuhan National Laboratory for Optoelectronics
2020-2021
Huazhong University of Science and Technology
2010-2021
Applied Materials (United States)
2020-2021
The University of Melbourne
2012-2013
Wenzhou Medical University
2011
Tongji Hospital
2010
University of Kentucky
2010
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use.It is derived from the original audio and text materials of LibriSpeech corpus, which has been used training evaluating automatic recognition systems.The inherits desired properties while addressing number issues make less than ideal work.The released consists 585 hours data at 24kHz sampling rate 2,456 speakers corresponding texts.Experimental results show that neural end-to-end TTS models trained...
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, state-of-the-art end-to-end speech synthesis system. The with no explicit labels, yet learn to model large range acoustic expressiveness. GSTs lead rich set significant results. soft interpretable "labels" they generate can be used control in novel ways, such as varying speed and speaking - independently the text content. They also for transfer, replicating single audio clip...
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our consists three independently trained components: (1) speaker encoder network, on verification task using an independent dataset noisy from thousands speakers without transcripts, fixed-dimensional embedding vector seconds reference target speaker; (2) sequence-to-sequence network based Tacotron...
In this paper, we present a novel system that separates the voice of target speaker from multi-speaker signals, by making use reference signal speaker.We achieve training two separate neural networks: (1) A recognition network produces speaker-discriminative embeddings;(2) spectrogram masking takes both noisy and embedding as input, mask.Our significantly reduces speech WER on with minimal degradation single-speaker signals.
Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly.Noisy training is an iterative self-training that leverages augmentation network performance.In this work, we adapt and noisy for automatic speech recognition, employing (adaptive) SpecAugment the method.We find effective methods filter, balance augment data generated in between iterations.By doing so, are able obtain word...
Lingvo is a Tensorflow framework offering complete solution for collaborative deep learning research, with particular focus towards sequence-to-sequence models. models are composed of modular building blocks that flexible and easily extensible, experiment configurations centralized highly customizable. Distributed training quantized inference supported directly within the framework, it contains existing implementations large number utilities, helper functions, newest research ideas. has been...
End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Recognition (ASR) and text Machine (MT) models, including lowered inference latency avoidance error compounding. However, quality end-to-end ST is often limited by a paucity training data, since it difficult collect large parallel corpora speech translated transcript pairs. Previous studies proposed use pre-trained components multi-task learning in order benefit from weakly...
We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.Moreover, the transfer voices across languages, e.g.synthesize fluent Spanish using an English speaker's voice, without training any bilingual or parallel examples.Such works distantly related e.g.English and Mandarin.Critical achieving this result are: 1. phonemic input representation encourage sharing of capacity 2. incorporating...
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into in another language, without relying on intermediate text representation.The is trained end-to-end, learning to map spectrograms target corresponding the translated content (in a different canonical voice).We further demonstrate ability synthesize using voice of source speaker.We conduct experiments two Spanish-to-English translation datasets, and find that proposed...
An all-fiber approach to generate triangular-shaped pulses based on frequency-to-time conversion is proposed and demonstrated. Two filter modules that have sinusoidal spectral responses are cascaded create a optical spectrum. Through the in dispersive fiber, periodic triangular with same shape as spectrum obtained. The repetition rate pulse width of generated signals can be tuned by adjusting modulation dispersion value, respectively.
We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation.The network is composed of encoder, and phoneme decoders, followed by a vocoder synthesize time-domain waveform.We demonstrate this can be trained normalize speech from speaker regardless accent, prosody, background noise, into the voice single canonical target with fixed accent consistent...
Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized has raised exciting possibility replacing expensive, manually transcribed, domain-specific, human that is used to train recognizers. The can learn latent embedding spaces prosody, speaker style variations derived from input acoustic representations thereby allowing for manipulation speech. In this paper, we evaluate feasibility enhancing recognition...
This paper presents Non-Attentive Tacotron based on the 2 text-to-speech model, replacing attention mechanism with an explicit duration predictor. improves robustness significantly as measured by unaligned ratio and word deletion rate, two metrics introduced in this for large-scale evaluation using a pre-trained speech recognition model. With use of Gaussian upsampling, achieves 5-scale mean opinion score naturalness 4.41, slightly outperforming 2. The predictor enables both utterance-wide...
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive model augmented with variational autoencoder-based residual encoder. model, called Parallel Tacotron, parallelizable during both training inference, allowing efficient synthesis on modern parallel hardware. The use of the autoencoder relaxes one-to-many mapping nature problem improves To further...
A novel millimeter-Wave (mm-Wave) joint radar and communication (JRC) system based on photonic spectrum-spreading phase-coding is proposed. The key to the proposed convergence of microwave multiplexing techniques. Phase-coding enables generation wideband mm-Wave JRC signal, leading both high range resolution (< 3.5 cm) for detection capacity (>1 Gb/s) wireless communication. orthogonalizes signal data, highly reducing mutual interferences. In addition, able improve peak sidelobe ratio (PSR)...
Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned downstream tasks from a variety domains languages. This paper takes universality unsupervised language one step further, by unifying within single model. We build encoder with BERT objective unlabeled together w2v-BERT speech. To further align our model representations across...
This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with fully differentiable duration which does not require supervised signals.The is based on novel attention mechanism and an iterative reconstruction loss Soft Dynamic Time Warping, this can learn token-frame alignments as well token durations automatically.Experimental results show that 2 outperforms baselines in subjective naturalness several diverse multi speaker evaluations.
This paper introduces PnG BERT, a new encoder model for neural TTS.This is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, well word-level alignment between them.It can be pre-trained on large corpus in selfsupervised manner, fine-tuned TTS task.Experimental results show that using its yields more natural prosody accurate pronunciation than baseline only input with no pre-training.Subjective side-by-side preference evaluations...
Space-division multiplexed (SDM) transmission based on multi-core (MCF) or multi-mode fiber (MMF) emerges as one of the most promising solutions for overcoming capacity limit standard single mode (SSMF). This paper places a focus MMF which data rate 100-Gb/s and beyond has been recently demonstrated through coherent-optical OFDM (CO-OFDM) single-carrier (SC) superchannel specially-designed few-mode fibers (FMF). However, to unleash full potential high SDM requires brand-new research from...
A photonic method used to simultaneously measure the Doppler-frequency-shift (DFS) and angle-of-arrival (AOA) of microwave signals is proposed experimentally demonstrated. At remote antenna unit (RAU), local oscillator (LO) signal two echo are applied a phase modulator (PM) polarization-division-multiplexed Mach-Zehnder (PDM-MZM), respectively. After transmission over fiber link, DFS AOA parameters can be obtained by processing low-frequency electrical at central office (CO). Experimental...