Yao Qian

ORCID: 0000-0003-1855-9630
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Natural Language Processing Techniques
  • Music and Audio Processing
  • Speech and dialogue systems
  • Topic Modeling
  • Phonetics and Phonology Research
  • Multimodal Machine Learning Applications
  • Blind Source Separation Techniques
  • Machine Fault Diagnosis Techniques
  • Plant Reproductive Biology
  • Covalent Organic Framework Applications
  • Plant Stress Responses and Tolerance
  • Subtitles and Audiovisual Media
  • Advanced Algorithms and Applications
  • Advanced Image and Video Retrieval Techniques
  • Grouting, Rheology, and Soil Mechanics
  • Plant Molecular Biology Research
  • Face and Expression Recognition
  • Text Readability and Simplification
  • Wireless Networks and Protocols
  • Explainable Artificial Intelligence (XAI)
  • Geomechanics and Mining Engineering
  • Metabolomics and Mass Spectrometry Studies
  • Video Analysis and Summarization

Xuzhou Medical College
2025

Research Institute of Precision Instruments (Russia)
2025

Tsinghua University
2023-2025

Yangzhou University
2025

China Agricultural University
2023-2024

Microsoft (Germany)
2024

Xiamen University
2023

Microsoft (United States)
2008-2023

Microsoft Research (United Kingdom)
2007-2023

Microsoft (Finland)
2022-2023

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other processing tasks. As signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., universal representations all tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream WavLM jointly learns masked prediction and denoising pre-training. By this means,...

10.1109/jstsp.2022.3188113 article EN IEEE Journal of Selected Topics in Signal Processing 2022-07-04

Feed-forward, Deep neural networks (DNN)-based text-tospeech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS [1, 4]. However, the long time span contextual effect in a speech utterance is still not easy accommodate, due intrinsic, feed-forward nature DNN-based modeling. Also, synthesize smooth trajectory, dynamic features are commonly used constrain parameter trajectory generation HMM-based [2]. In this paper, Recurrent Neural Networks...

10.21437/interspeech.2014-443 article EN Interspeech 2022 2014-09-14

Deep Neural Network (DNN), which can model a long-span, intricate transform compactly with deep-layered structure, has recently been investigated for parametric TTS synthesis fairly large corpus (33,000 utterances) [6]. In this paper, we examine DNN moderate size of 5 hours, is more commonly used training. to map input text features into output acoustic (LSP, F0 and V/U). Experimental results show that outperform the conventional HMM, trained in ML first then refined by MGE. Both objective...

10.1109/icassp.2014.6854318 article EN 2014-05-01

Junyi Ao, Rui Wang, Long Zhou, Chengyi Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Zhang, Zhihua Wei, Yao Qian, Jinyu Furu Wei. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Papers). 2022.

10.18653/v1/2022.acl-long.393 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, Yao Qian. Proceedings of the 12th Workshop on Innovative Use NLP for Building Educational Applications. 2017.

10.18653/v1/w17-5007 article EN cc-by 2017-01-01

In DNN-based TTS synthesis, DNNs hidden layers can be viewed as deep transformation for linguistic features and the output representation of acoustic space to regress transformed parameters. The deep-layered architectures DNN not only represent highly-complex compactly, but also take advantage huge amount training data. this paper, we propose an approach model multiple speakers with a general DNN, where same are shared among different while composed speaker-dependent nodes explaining target...

10.1109/icassp.2015.7178817 article EN 2015-04-01

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e.g. speech utterances or handwritten documents. While word embedding demoed as a powerful representation characterizing the statistical properties of natural language. In this study, we propose use BLSTM-RNN with part-of-speech (POS) task. When tested on Penn Treebank WSJ test set, state-of-the-art performance 97.40 accuracy is achieved. Without using...

10.48550/arxiv.1510.06168 preprint EN other-oa arXiv (Cornell University) 2015-01-01

The speech representations learned from large-scale unlabeled data have shown better generalizability than those supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits by different self-supervised objectives datasets automatic speaker verification (ASV), especially with well-recognized SOTA ASV model, ECAPA-TDNN [1], as model. all hidden layers pre-trained model are firstly averaged learnable weights then fed...

10.1109/icassp43922.2022.9747814 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Self-supervised learning (SSL) is a long-standing goal for speech processing, since it utilizes large-scale unlabeled data and avoids extensive human labeling. Recent years have witnessed great successes in applying self-supervised recognition, while limited exploration was attempted SSL modeling speaker characteristics. In this paper, we aim to improve the existing framework representation learning. Two methods are introduced enhancing unsupervised information extraction. First, apply...

10.1109/icassp43922.2022.9747077 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Dermal adipocyte lineage cells are highly plastic and can undergo reversible differentiation dedifferentiation in response to various stimuli. Using single-cell RNA sequencing of developing or wounded mouse skin, we classify dermal fibroblasts (dFBs) into distinct non-adipogenic adipogenic cell states. Cell trajectory analyses identify IL-1-NF-κB WNT-β-catenin as top signaling pathways that positively negatively associate with adipogenesis, respectively. Upon wounding, activation progenitors...

10.1016/j.celrep.2023.112647 article EN cc-by-nc-nd Cell Reports 2023-06-01

We introduce a new method to grade non-native spoken language tests automatically. Traditional automated response grading approaches use manually engineered time-aggregated features (such as mean length of pauses). propose incorporate general time-sequence pitch) which preserve more information than and do not require human effort design. type recurrent neural network jointly optimize the learning high level abstractions from with features. first automatically learn Bidirectional Long Short...

10.1109/asru.2015.7404814 article EN 2015-12-01

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for modeling and predicting sequential data, e.g. speech utterances or handwritten documents. In this study, we propose use BLSTM-RNN a unified tagging solution that can applied various tasks including part-of-speech tagging, chunking named entity recognition. Instead of exploiting specific features carefully optimized each task, our only uses one set task-independent internal...

10.48550/arxiv.1511.00215 preprint EN other-oa arXiv (Cornell University) 2015-01-01

Spoken language understanding (SLU) in dialog systems is generally performed using a natural (NLU) model based on the hypotheses produced by an automatic speech recognition (ASR) system. However, when new spoken applications are built from scratch real user environments that often have sub-optimal audio characteristics, ASR performance can suffer due to factors such as paucity of training data or mismatch between and test data. To address this issue, paper proposes ASR-free, end-to-end (E2E)...

10.1109/asru.2017.8268987 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017-12-01

In recent years, machine learning models for automated speech scoring systems were mainly built using data-driven approaches with handcrafted features as one of the main components. However, remarkable successes deep (DL) technology in a variety tasks has demonstrated its effectiveness extracting features. Although there have been some efforts utilizing DL task, thorough investigation useful is still missing. this paper, we propose an end-to-end solution that consists neural network to...

10.1109/icassp.2018.8462562 article EN 2018-04-01

The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good representations from a large amount unlabeled the downstream ASR task. However, most SSL frameworks do not consider noise robustness which crucial real-world applications. In this paper we propose wav2vec-Switch, method encode into contextualized via contrastive learning. Specifically, feed original-noisy pairs simultaneously wav2vec 2.0 network. addition existing task, switch quantized...

10.1109/icassp43922.2022.9746929 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited one or two modalities. We present i-Code, self-supervised framework where users may flexibly combine the modalities of vision, speech, language into unified general-purpose vector representations. In this framework, data from each modality first given pretrained single-modality encoders. The encoder outputs then...

10.1609/aaai.v37i9.26290 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

Purpose: This study implements and demonstrates a deep learning (DL) approach for screening referable horizontal strabismus based on primary gaze photographs using clinical assessments as reference. The purpose of this was to develop evaluate algorithms that screen in children's photographs. Methods: DL were developed trained from two tertiary hospitals children with who underwent surgery well orthotropic routine refractive tests. A total 7026 images (3829 non-strabismus 3021 orthoptics...

10.1167/tvst.10.1.33 article EN cc-by-nc-nd Translational Vision Science & Technology 2021-01-27

This paper proposes a three-tier prosodic hierarchy, including word, intermediate phrase and intonational tiers, for Mandarin that emphasizes the use of word instead lexical as basic unit. Both surface difference perceptual show this is helpful achieving high naturalness in text-to-speech conversion. Three approaches, CART approach, bottom-up hierarchical approach modified are presented locating boundaries three constituents unrestricted texts. Two sets features used method: one contains...

10.30019/ijclclp.200102.0003 article EN 2001-02-01

We propose a hidden Markov model (HMM)-based bilingual (Mandarin and English) text-to-speech (TTS) system to synthesize natural speech for given text. A simple baseline consisting of two independent monolingual HMM synthesizers is built first from corresponding Mandarin English data recorded by speaker. new, mixed language TTS then constructed asking language-independent language-specific questions sharing states across the languages in decision-tree based clustering. By states, new has...

10.1109/tasl.2009.2015708 article EN IEEE Transactions on Audio Speech and Language Processing 2009-07-01

It is technically challenging to make a machine talk as naturally human so facilitate “frictionless” interactions between and human. We propose trajectory tiling-based approach high-quality speech rendering, where parameter trajectories, extracted from natural, processed, or synthesized speech, are used guide the search for best sequence of waveform “tiles” stored in pre-recorded database. test proposed unified algorithm both Text-To-Speech (TTS) synthesis cross-lingual voice transformation...

10.1109/tasl.2012.2221460 article EN IEEE Transactions on Audio Speech and Language Processing 2012-10-01

The current state of the art TTS synthesis can produce synthesized speech with highly decent quality if rich segmental and suprasegmental information are given. However, some features, e.g., Tone Break (TOBI), time consuming due to being manually labeled a high inconsistency among different annotators. In this paper, we investigate use word embedding, which represents low dimensional continuous-valued vector assumed carry certain syntactic semantic information, for bidirectional long short...

10.1109/icassp.2015.7178898 article EN 2015-04-01

Neural network (NN) based voice conversion, which employs a nonlinear function to map the features from source target speaker, has been shown outperform GMM-based conversion approach [4-7]. However, there are still limitations be overcome in NN-based e.g. NN is trained on Frame Error (FE) minimization criterion and corresponding weights adjusted minimize error squares over whole source-target, stereo training data set. In this paper, we use idea of sentence optimization based, minimum...

10.21437/interspeech.2014-448 article EN Interspeech 2022 2014-09-14
Coming Soon ...