- Speech Recognition and Synthesis
- Music and Audio Processing
- Natural Language Processing Techniques
- Speech and Audio Processing
- Topic Modeling
- Ferroelectric and Piezoelectric Materials
- Microwave Dielectric Ceramics Synthesis
- Advanced Sensor and Energy Harvesting Materials
- Advanced Image and Video Retrieval Techniques
- Retinal Imaging and Analysis
- Multimodal Machine Learning Applications
- Machine Learning in Healthcare
- Multiferroics and related materials
- Acoustic Wave Resonator Technologies
- Thermal Expansion and Ionic Conductivity
- Dielectric properties of ceramics
- Advanced Fiber Optic Sensors
- Surface Modification and Superhydrophobicity
- Electronic and Structural Properties of Oxides
- Complex Network Analysis Techniques
- Spectroscopy and Chemometric Analyses
- Glaucoma and retinal disorders
- Water Quality Monitoring and Analysis
- Advanced Text Analysis Techniques
- Web Data Mining and Analysis
Shenzhen Institutes of Advanced Technology
2021-2024
Chinese Academy of Sciences
2013-2024
Shandong Institute of Automation
2017-2024
Dalian Polytechnic University
2024
Shanghai University
2024
Shaanxi University of Science and Technology
2021-2022
Institute of Automation
2018-2021
University of Chinese Academy of Sciences
2017-2018
Abstract In this study, a high‐entropy perovskite oxide Sr(Zr 0.2 Sn Hf Ti Nb )O 3 (SZSHTN) was first introduced to Na 0.5 Bi TiO (NBT) lead‐free ferroelectric ceramics boost both the high‐temperature dielectric stability and energy storage performance. Excellent comprehensive performance simultaneously obtained in 0.8NBT–0.2SZSHTN ceramic with high ε ′ value (> 2000), wide ′‐temperature stable range (TCC < 5%, 52.4–362°C), low tan δ (<0.01, 90–341°C) ( W rec = 3.52 J/cm , η varies...
Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning.It follows two-stage training process of pre-training and fine-tuning, performs well in recognition tasks especially ultra-low resource cases.In this work, we attempt to extend the speaker verification language identification.First, use some preliminary experiments indicate that wav2vec can capture information about language.Then demonstrate effectiveness on two respectively.For verification,...
Abstract In pulse power systems, multilayer ceramic capacitors (MLCCs) encounter significant challenges due to the heightened loading electric field ( E ), which can lead fatigue damage and ultrasonic concussion caused by electrostrictive strain. To address these issues, an innovative strategy focused on achieving ultra‐weak polarization‐strain coupling effect is proposed, effectively reduces strain in MLCCs. Remarkably, ultra‐low coefficient Q 33 ) of 0.012 m 4 C −2 achieved composition...
Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into a single neural network.In these models, the Transformer, new sequence-to-sequence relying entirely self-attention without using RNNs or convolutions, achieves single-model state-of-the-art BLEU machine translation (NMT) tasks.Since outstanding performance of we extend it to concentrate as basic...
Sequence-to-sequence attention-based models integrate an acoustic, pronunciation and language model into a single neural network, which make them very suitable for multilingual automatic speech recognition (ASR). In this paper, we are concerned with on low-resource languages by Transformer, one of sequence-to-sequence models. Sub-words employed as the modeling unit without using any lexicon. First, show that ASR Transformer performs well despite some confusion. We then look at incorporating...
There are several domains that own corresponding widely used feature extractors, such as ResNet, BERT, and GPT-x. These models usually pre-trained on large amounts of unlabeled data by self-supervision can be effectively applied to downstream tasks. In the speech domain, wav2vec2.0 starts show its powerful representation ability feasibility ultra-low resource recognition Librispeech corpus, which belongs audiobook domain. However, has not been examined real spoken scenarios languages other...
End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy demand end-to-end models. Self-supervised acoustic pre-training has already shown its amazing performance, while transcription is still inadequate for language modeling in In this work, we fuse a pre-trained encoder (wav2vec2.0) and linguistic (BERT) into an model. The fused model only needs to learn transfer from during...
Nowadays, most methods for end-to-end contextual speech recognition bias the process towards knowledge. Since all-neural biasing rely on phrase-level modeling and attention-based relevance modeling, they may suffer from confusion between similar context-specific phrases, which hurts predictions at token level. In this work, we focus mitigating problems with fine-grained knowledge selection (FineCoS). FineCoS, introduce to reduce uncertainty of predictions. Specifically, first apply phrase...
Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into a single neural network. In these models, the Transformer, new sequence-to-sequence relying entirely self-attention without using RNNs or convolutions, achieves single-model state-of-the-art BLEU machine translation (NMT) tasks. Since outstanding performance of we extend it to concentrate as...
Recently, end-to-end (E2E) models become a competitive alternative to the conventional hybrid automatic speech recognition (ASR) systems. However, they still suffer from speaker mismatch in training and testing condition. In this paper, we use Speech-Transformer (ST) as study platform investigate aware of E2E models. We propose model called Speaker-Aware (SAST), which is standard ST equipped with attention module (SAM). The SAM has static knowledge block (SKB) that made i-vectors. At each...
End-to-end models have been showing superiority in Automatic Speech Recognition (ASR).At the same time, capacity of streaming recognition has become a growing requirement for end-to-end models.Following these trends, an encoder-decoder recurrent neural network called Recurrent Neural Aligner (RNA) freshly proposed and shown its competitiveness on two English ASR tasks.However, it is not clear if RNA can be further improved applied to other spoken language.In this work, we explore...
This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing linguistic pre-training. In the stage, we use large amount of encoder by predicting masked feature chunks its context. generate synthesized from number using single-speaker text (TTS) system, paired data decoder. two-stage integrates rich knowledge seq2seq...
End-to-end models are gaining wider attention in the field of automatic speech recognition (ASR). One their advantages is simplicity building that directly recognizes frame sequence into text label by neural networks. According to driving end process, end-to-end ASR could be categorized two types: label-synchronous and frame-synchronous, each which has unique model behaviour characteristic. In this work, we make a detailed comparison on representative (transformer) soft frame-synchronous...
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text audio resources. OPT is constructed in encoder-decoder framework, including three single-modal encoders to generate token-based embeddings each modality, a encoder encode the correlations among modalities, two decoders image respectively. For OPT's pre-training, design multi-task pretext learning scheme model multi-modal resources from different data...
The shared-hidden-layer multilingual deep neural network (SHL-MDNN), in which the hidden layers of feed-forward (DNN) are shared across multiple languages while softmax language dependent, has been shown to be effective on acoustic modeling low-resource speech recognition. In this paper, we propose that with Long Short-Term Memory (LSTM) recurrent networks can achieve further performance improvement considering LSTM outperformed DNN as model automatic recognition (ASR). Moreover, reveal...
Recently, there are several domains that have their own feature extractors, such as ResNet, BERT, and GPT-x, which widely used for various down-stream tasks. These models pre-trained on large amounts of unlabeled data by self-supervision. In the speech domain, wav2vec2.0 starts to show its powerful representation ability feasibility ultra-low resource recognition This extractor is monolingual audiobook corpus, whereas it has not been thoroughly examined in real spoken scenarios other...
End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become mainstream. However, unified structure E2E training hamper injecting context information into them for contextual biasing. Though LAS (CLAS) gives an excellent all-neural solution, degree of biasing given is not explicitly controllable. In this paper, we focus incorporating continuous integrate-and-fire (CIF) based model that supports in a more controllable...
The choice of modeling units is critical to automatic speech recognition (ASR) tasks. Conventional ASR systems typically choose context-dependent states (CD-states) or phonemes (CD-phonemes) as their units. However, it has been challenged by sequence-to-sequence attention-based models, which integrate an acoustic, pronunciation and language model into a single neural network. On English tasks, previous attempts have already shown that the unit graphemes can outperform model. In this paper,...
Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning. It follows two-stage training process of pre-training and fine-tuning, performs well in recognition tasks especially ultra-low resource cases. In this work, we attempt to extend speaker verification language identification. First, use some preliminary experiments indicate that wav2vec can capture the information about language. Then demonstrate effectiveness on two respectively. For verification,...