Shiyu Zhou

ORCID: 0000-0002-6889-0316
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Speech and Audio Processing
  • Topic Modeling
  • Ferroelectric and Piezoelectric Materials
  • Microwave Dielectric Ceramics Synthesis
  • Advanced Sensor and Energy Harvesting Materials
  • Advanced Image and Video Retrieval Techniques
  • Retinal Imaging and Analysis
  • Multimodal Machine Learning Applications
  • Machine Learning in Healthcare
  • Multiferroics and related materials
  • Acoustic Wave Resonator Technologies
  • Thermal Expansion and Ionic Conductivity
  • Dielectric properties of ceramics
  • Advanced Fiber Optic Sensors
  • Surface Modification and Superhydrophobicity
  • Electronic and Structural Properties of Oxides
  • Complex Network Analysis Techniques
  • Spectroscopy and Chemometric Analyses
  • Glaucoma and retinal disorders
  • Water Quality Monitoring and Analysis
  • Advanced Text Analysis Techniques
  • Web Data Mining and Analysis

Shenzhen Institutes of Advanced Technology
2021-2024

Chinese Academy of Sciences
2013-2024

Shandong Institute of Automation
2017-2024

Dalian Polytechnic University
2024

Shanghai University
2024

Shaanxi University of Science and Technology
2021-2022

Institute of Automation
2018-2021

University of Chinese Academy of Sciences
2017-2018

Abstract In this study, a high‐entropy perovskite oxide Sr(Zr 0.2 Sn Hf Ti Nb )O 3 (SZSHTN) was first introduced to Na 0.5 Bi TiO (NBT) lead‐free ferroelectric ceramics boost both the high‐temperature dielectric stability and energy storage performance. Excellent comprehensive performance simultaneously obtained in 0.8NBT–0.2SZSHTN ceramic with high ε ′ value (> 2000), wide ′‐temperature stable range (TCC < 5%, 52.4–362°C), low tan δ (<0.01, 90–341°C) ( W rec = 3.52 J/cm , η varies...

10.1111/jace.18455 article EN Journal of the American Ceramic Society 2022-04-01

Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning.It follows two-stage training process of pre-training and fine-tuning, performs well in recognition tasks especially ultra-low resource cases.In this work, we attempt to extend the speaker verification language identification.First, use some preliminary experiments indicate that wav2vec can capture information about language.Then demonstrate effectiveness on two respectively.For verification,...

10.21437/interspeech.2021-1280 article EN Interspeech 2022 2021-08-27

Abstract In pulse power systems, multilayer ceramic capacitors (MLCCs) encounter significant challenges due to the heightened loading electric field ( E ), which can lead fatigue damage and ultrasonic concussion caused by electrostrictive strain. To address these issues, an innovative strategy focused on achieving ultra‐weak polarization‐strain coupling effect is proposed, effectively reduces strain in MLCCs. Remarkably, ultra‐low coefficient Q 33 ) of 0.012 m 4 C −2 achieved composition...

10.1002/adma.202406219 article EN Advanced Materials 2024-08-12

Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into a single neural network.In these models, the Transformer, new sequence-to-sequence relying entirely self-attention without using RNNs or convolutions, achieves single-model state-of-the-art BLEU machine translation (NMT) tasks.Since outstanding performance of we extend it to concentrate as basic...

10.21437/interspeech.2018-1107 article EN Interspeech 2022 2018-08-28

Sequence-to-sequence attention-based models integrate an acoustic, pronunciation and language model into a single neural network, which make them very suitable for multilingual automatic speech recognition (ASR). In this paper, we are concerned with on low-resource languages by Transformer, one of sequence-to-sequence models. Sub-words employed as the modeling unit without using any lexicon. First, show that ASR Transformer performs well despite some confusion. We then look at incorporating...

10.48550/arxiv.1806.05059 preprint EN other-oa arXiv (Cornell University) 2018-01-01

There are several domains that own corresponding widely used feature extractors, such as ResNet, BERT, and GPT-x. These models usually pre-trained on large amounts of unlabeled data by self-supervision can be effectively applied to downstream tasks. In the speech domain, wav2vec2.0 starts show its powerful representation ability feasibility ultra-low resource recognition Librispeech corpus, which belongs audiobook domain. However, has not been examined real spoken scenarios languages other...

10.48550/arxiv.2012.12121 preprint EN other-oa arXiv (Cornell University) 2020-01-01

End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy demand end-to-end models. Self-supervised acoustic pre-training has already shown its amazing performance, while transcription is still inadequate for language modeling in In this work, we fuse a pre-trained encoder (wav2vec2.0) and linguistic (BERT) into an model. The fused model only needs to learn transfer from during...

10.1109/lsp.2021.3071668 article EN IEEE Signal Processing Letters 2021-01-01

Nowadays, most methods for end-to-end contextual speech recognition bias the process towards knowledge. Since all-neural biasing rely on phrase-level modeling and attention-based relevance modeling, they may suffer from confusion between similar context-specific phrases, which hurts predictions at token level. In this work, we focus mitigating problems with fine-grained knowledge selection (FineCoS). FineCoS, introduce to reduce uncertainty of predictions. Specifically, first apply phrase...

10.1109/icassp43922.2022.9747101 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into a single neural network. In these models, the Transformer, new sequence-to-sequence relying entirely self-attention without using RNNs or convolutions, achieves single-model state-of-the-art BLEU machine translation (NMT) tasks. Since outstanding performance of we extend it to concentrate as...

10.48550/arxiv.1804.10752 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Recently, end-to-end (E2E) models become a competitive alternative to the conventional hybrid automatic speech recognition (ASR) systems. However, they still suffer from speaker mismatch in training and testing condition. In this paper, we use Speech-Transformer (ST) as study platform investigate aware of E2E models. We propose model called Speaker-Aware (SAST), which is standard ST equipped with attention module (SAM). The SAM has static knowledge block (SKB) that made i-vectors. At each...

10.1109/asru46091.2019.9003844 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

End-to-end models have been showing superiority in Automatic Speech Recognition (ASR).At the same time, capacity of streaming recognition has become a growing requirement for end-to-end models.Following these trends, an encoder-decoder recurrent neural network called Recurrent Neural Aligner (RNA) freshly proposed and shown its competitiveness on two English ASR tasks.However, it is not clear if RNA can be further improved applied to other spoken language.In this work, we explore...

10.21437/interspeech.2018-1086 article EN Interspeech 2022 2018-08-28

This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing linguistic pre-training. In the stage, we use large amount of encoder by predicting masked feature chunks its context. generate synthesized from number using single-speaker text (TTS) system, paired data decoder. two-stage integrates rich knowledge seq2seq...

10.48550/arxiv.1910.12418 preprint EN other-oa arXiv (Cornell University) 2019-01-01

End-to-end models are gaining wider attention in the field of automatic speech recognition (ASR). One their advantages is simplicity building that directly recognizes frame sequence into text label by neural networks. According to driving end process, end-to-end ASR could be categorized two types: label-synchronous and frame-synchronous, each which has unique model behaviour characteristic. In this work, we make a detailed comparison on representative (transformer) soft frame-synchronous...

10.48550/arxiv.2005.10113 preprint EN other-oa arXiv (Cornell University) 2020-01-01

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text audio resources. OPT is constructed in encoder-decoder framework, including three single-modal encoders to generate token-based embeddings each modality, a encoder encode the correlations among modalities, two decoders image respectively. For OPT's pre-training, design multi-task pretext learning scheme model multi-modal resources from different data...

10.48550/arxiv.2107.00249 preprint EN cc-by arXiv (Cornell University) 2021-01-01

The shared-hidden-layer multilingual deep neural network (SHL-MDNN), in which the hidden layers of feed-forward (DNN) are shared across multiple languages while softmax language dependent, has been shown to be effective on acoustic modeling low-resource speech recognition. In this paper, we propose that with Long Short-Term Memory (LSTM) recurrent networks can achieve further performance improvement considering LSTM outperformed DNN as model automatic recognition (ASR). Moreover, reveal...

10.21437/interspeech.2017-111 article EN Interspeech 2022 2017-08-16

Recently, there are several domains that have their own feature extractors, such as ResNet, BERT, and GPT-x, which widely used for various down-stream tasks. These models pre-trained on large amounts of unlabeled data by self-supervision. In the speech domain, wav2vec2.0 starts to show its powerful representation ability feasibility ultra-low resource recognition This extractor is monolingual audiobook corpus, whereas it has not been thoroughly examined in real spoken scenarios other...

10.1109/ijcnn52387.2021.9533587 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2021-07-18

End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become mainstream. However, unified structure E2E training hamper injecting context information into them for contextual biasing. Though LAS (CLAS) gives an excellent all-neural solution, degree of biasing given is not explicitly controllable. In this paper, we focus incorporating continuous integrate-and-fire (CIF) based model that supports in a more controllable...

10.1109/icassp39728.2021.9415054 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

The choice of modeling units is critical to automatic speech recognition (ASR) tasks. Conventional ASR systems typically choose context-dependent states (CD-states) or phonemes (CD-phonemes) as their units. However, it has been challenged by sequence-to-sequence attention-based models, which integrate an acoustic, pronunciation and language model into a single neural network. On English tasks, previous attempts have already shown that the unit graphemes can outperform model. In this paper,...

10.48550/arxiv.1805.06239 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning. It follows two-stage training process of pre-training and fine-tuning, performs well in recognition tasks especially ultra-low resource cases. In this work, we attempt to extend speaker verification language identification. First, use some preliminary experiments indicate that wav2vec can capture the information about language. Then demonstrate effectiveness on two respectively. For verification,...

10.48550/arxiv.2012.06185 preprint EN other-oa arXiv (Cornell University) 2020-01-01
Coming Soon ...