- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Topic Modeling
- Speech and dialogue systems
- Combustion and flame dynamics
- Turbomachinery Performance and Optimization
- Phonetics and Phonology Research
- Fluid Dynamics and Turbulent Flows
- Advanced Text Analysis Techniques
Northwestern Polytechnical University
2018-2025
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ weakly and about 10000 unlabeled with 22400+ in total. We collect the data from YouTube Podcast, which covers variety speaking styles, scenarios, domains, topics noisy conditions. An optical character recognition (OCR) method is introduced to generate audio/text segmentation candidates for on corresponding video subtitles, while ASR transcription system used...
Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by industry, leveraging large-scale datasets and computational resources that not readily available academic community. Moreover, lack transparency training details creates additional barriers further innovation. In this study, we present OSUM, an Open...
Data2vec is a self-supervised learning (SSL) approach that employs teacher-student architecture for contextual representation via masked prediction, demonstrating remarkable performance in monolingual ASR. Previous studies have revealed data2vec's shallow layers capture speaker and language information, middle encode phoneme word features, while deep are responsible reconstruction. Language features crucial multilingual However, generation relies on multi-layer averaging, inevitably coupling...
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ weakly and about 10000 unlabeled with 22400+ in total. We collect the data from YouTube Podcast, which covers variety speaking styles, scenarios, domains, topics, noisy conditions. An optical character recognition (OCR) based method is introduced to generate audio/text segmentation candidates for on its corresponding video captions, while ASR transcription...
Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic (ASR) and accent (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related characteristics, while coarse-grained are better learning linguistic information. Moreover, an explicit interaction of two tasks...
General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering can be regarded as a series of shifts relative native pronunciation, distinguishing accents will an easier task with shift input. But due the lack utterance anchor, estimating is difficult. In this paper, we propose linguistic-acoustic similarity based (LASAS) for AR tasks. For speech utterance, after mapping...
This paper describes the system developed by NPU team for 2020 personalized voice trigger challenge. Our submitted consists of two independently trained subsystems: a small footprint keyword spotting (KWS) and speaker verification (SV) system. For KWS system, multi-scale dilated temporal convolutional (MDTC) network is proposed to detect wake-up word (WuW). SV Write something here. The predicts posterior probabilities whether an audio utterance contains WuW estimates location at same time....
Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of applying a customized keyword spotting task.Compared with other tasks, has following three characteristics: 1) The focuses on problem spotting, where target device can only be awakened by an enrolled speaker his specified keyword.The use any language and accent define keyword.2) All dataset is recorded in realistic environment.It simulate different user scenarios.3) "code competition",...
Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive ASR correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed hypotheses, insufficient utilization acoustic information, and...
Integrating audio encoders with LLMs through connectors has enabled these models to process and comprehend modalities, significantly enhancing speech-to-text tasks, including automatic speech recognition (ASR) translation (AST). However, methods often overlook the critical aspect of language adaptation in multilingual settings, relying instead on data without adequately addressing differences. To address this gap, we propose Ideal-LLM model, which employs dual enrich feature information...
The present paper performed a numerical study on high-loaded and high turning compressor cascade, where the unsteady boundary layer transition behavior cascade blade undergoing negative jet flow is revealed. two-equation SST turbulence model coupled with Langtry-Menter verified applied all computations in study. Reynolds number turbulent intensity are selected as two dominate candidates which can significantly influence their effect were examined. Results show that under tested case (i.e.,...
Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) has demonstrated its effectiveness in multilingual ASR, it is worth noting that the various layers' representations of SSL potentially contain distinct information not been fully leveraged. In this study, we propose a novel method leverages hierarchical (SSHR) fine-tune ASR. We first analyze different layers model...
UniSpeech has achieved superior performance in cross-lingual automatic speech recognition (ASR) by explicitly aligning latent representations to phoneme units using multi-task self-supervised learning. While the learned transfer well from high-resource low-resource languages, predicting words directly these phonetic downstream ASR is challenging. In this paper, we propose TranUSR, a two-stage model comprising pre-trained UniData2vec and phoneme-to-word Transcoder. Different UniSpeech,...
General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering can be regarded as a series of shifts relative native pronunciation, distinguishing accents will an easier task with shift input. But due the lack utterance anchor, estimating is difficult. In this paper, we propose linguistic-acoustic similarity based (LASAS) for AR tasks. For speech utterance, after mapping...
Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of applying a customized keyword spotting task. Compared with other tasks, has following three characteristics: 1) The focuses on problem spotting, where target device can only be awakened by an enrolled speaker his specified keyword. use any language and accent define 2) All dataset is recorded in realistic environment. It simulate different user scenarios. 3) "code competition",...