- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Phonetics and Phonology Research
- Speech and dialogue systems
- Advanced Data Compression Techniques
- Anomaly Detection Techniques and Applications
- Advanced Adaptive Filtering Techniques
- Blind Source Separation Techniques
- Topic Modeling
- Smart Grid and Power Systems
- Animal Vocal Communication and Behavior
- Infant Health and Development
- Direction-of-Arrival Estimation Techniques
- Emotion and Mood Recognition
- Indoor and Outdoor Localization Technologies
- Particle accelerators and beam dynamics
- Magnetic confinement fusion research
- Energy Load and Power Forecasting
- Fault Detection and Control Systems
- Sentiment Analysis and Opinion Mining
- Network Security and Intrusion Detection
- Artificial Intelligence in Healthcare
- Human Pose and Action Recognition
Shanghai Normal University
2016-2025
Xi'an Technological University
2024
Northwest Institute of Nuclear Technology
2024
Shandong University of Science and Technology
2024
University of Cambridge
2012-2013
University of Science and Technology of China
2008-2011
Microsoft Research Asia (China)
2011
Code-switching (CS) occurs when a speaker alternates words of two or more languages within single sentence across sentences.Automatic speech recognition (ASR) CS has to deal with at the same time.In this study, we propose Transformer-based architecture symmetric language-specific encoders capture individual language attributes, that improve acoustic representation each language.These representations are combined using multi-head attention mechanism in decoder module.Each encoder and its...
With the rapid development of intelligent speech technologies, automatic speaker verification (ASV) has become one most natural and convenient biometric recognition approaches. However, state-of-the-art ASV systems are vulnerable to spoofing attack techniques, such as synthesis, voice conversion, replay speech. Due symmetry distribution characteristic between genuine (true) spoof (fake) pair, detection is challenging. Many recent research works have been focusing on anti-spoofing solutions....
Personalized speech enhancement (PSE) methods typically rely on pre-trained speaker verification models or self-designed encoders to extract target clues, guiding the PSE model in isolating desired speech. However, these approaches suffer from significant complexity and often underutilize enrollment information, limiting potential performance of model. To address limitations, we propose a novel Speaker Encoder-Free network, termed SEF-PNet, which fully exploits information present both noisy...
This article, empowered by ChatGPT and through retrieving relevant historical literature, explores how the translator Sun Yat-sen flexibly employed strategies like domestication foreignization, as well methods omission, addition, modification in his translation of Ambulance Lectures: First Aid to Injured. Additionally, research highlights use a tool assist study. While is able provide comprehensive knowledge quickly proper translations, improvements are still needed terms image accuracy...
We describe our work on developing a speech recognition system for multi-genre media archives. The high diversity of the data makes this challenging task, which may benefit from systems trained combination in-domain and out-of-domain data. Working with tandem HMMs, we present Multi-level Adaptive Networks (MLAN), novel technique incorporating information posterior features using deep neural networks. show that it provides substantial reduction in WER over other systems, relative reductions...
This paper investigates improving lightly supervised acoustic model training for an archive of broadcast data.Standard uses automatically derived decoding hypotheses using a biased language model.However, as the actual speech can deviate significantly from original programme scripts that are supplied, quality standard be poor.To address this issue, word and segment level combination approaches used between transcripts which yield improved transcriptions.Experimental results show systems...
This study investigated large-scale semi-supervised training (SST) to improve acoustic models for automatic speech recognition. The conventional self-training, the recently proposed committee-based SST using heterogeneous neural networks and lattice-based were examined compared. was studied in deep network modeling with respect transcription quality, importance data filtering, quantity other attributes of a large multi-genre unsupervised live data. We found that behavior on ASR tasks very...
In recent years, a number of time-domain speech separation methods have been proposed. However, most them are very sensitive to the environments and wide domain coverage tasks. this paper, from time-frequency perspective, we propose densely-connected pyramid complex convolutional network, termed DPCCN, improve robustness under complicated conditions. Furthermore, generalize DPCCN target extraction (TSE) by integrating new specially designed speaker encoder. Moreover, also investigate...
The end-to-end approaches for single-channel target speech extraction have attracted widespread attention. However, the studies multi-channel are still relatively limited. In this work, we propose two methods exploiting spatial information to extract speech. first one is using a adaptation layer in parallel encoder architecture. second designing channel decorrelation mechanism inter-channel differential enhance representation. We compare proposed with strong state-of-the-art baselines....
In this paper, we propose a new continual learning framework for few-shot bioacoustic event detection (BED). First, modify the recently proposed dynamic (DFSL) and generalize it to BED task. Then, introduce weight alignment loss enhance generator of modified DFSL detecting novel events. Furthermore, augment few positive samples each target event, enhancement approach is select high-confidence pseudo positives using cumulative distribution initial posterior probabilities. All experiments are...
Multi-accent speech recognition is a key challenge in current due to the pronunciation variations of different accents. In this study, we propose Cross-modal Parallel Training (CPT) approach for improving accent robustness state-of-the-art Conformer-Transducer (Conformer-T) ASR system. Specifically, CPT, novel cross-modal attention and fusion module first designed as frontend align low-level acoustic representations with phonetic embeddings, thus normalizing into shared standard latent...
The state-of-the-art acoustic modeling for Keyword Spotting (KWS) systems is mainly based on the hybrid model of Hidden Markov Model (HMM) and Neural Network (NN). However, it challenging to efficiently train such a system, since dependence intermediate phonetic representation. Motivated by end-to-end speech recognition systems, we propose Mandarin KWS system using method, which directly predict posterior units. Connectionist Temporal Classifier (CTC) Recurrent (RNN). main difference between...
The target speech extraction has attracted widespread attention in recent years. In this work, we focus on investigating the dynamic interaction between different mixtures and speaker to exploit discriminative clues. We propose a special mechanism without introducing any additional parameters scaling adaptation layer better adapt network towards extracting speech. Furthermore, by mixture embedding matrix pooling method, our proposed attention-based (ASA) can clues more efficient way....