Xiong Xiao

ORCID: 0009-0001-5128-6518
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Topic Modeling
  • Probabilistic and Robust Engineering Design
  • Speech and dialogue systems
  • Structural Health Monitoring Techniques
  • Geotechnical Engineering and Analysis
  • Advanced Adaptive Filtering Techniques
  • Frequency Control in Power Systems
  • Concrete Corrosion and Durability
  • Infrastructure Maintenance and Monitoring
  • Geotechnical Engineering and Soil Stabilization
  • Blind Source Separation Techniques
  • Acoustic Wave Phenomena Research
  • Algorithms and Data Compression
  • Time Series Analysis and Forecasting
  • Electric Power System Optimization
  • Microgrid Control and Optimization
  • Web Data Mining and Analysis
  • Grouting, Rheology, and Soil Mechanics
  • Advanced Text Analysis Techniques
  • Optimal Power Flow Distribution
  • Matrix Theory and Algorithms

Guangxi University
2023-2025

China Southern Power Grid (China)
2022-2025

Central South University
2010-2025

Microsoft (United States)
2018-2024

Tsinghua University
2022-2023

Jilin University
2021-2023

Zhongshan Hospital of Xiamen University
2023

Fudan University
2023

Hainan University
2022

China Electric Power Research Institute
2019-2022

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other processing tasks. As signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., universal representations all tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream WavLM jointly learns masked prediction and denoising pre-training. By this means,...

10.1109/jstsp.2022.3188113 article EN IEEE Journal of Selected Topics in Signal Processing 2022-07-04

This paper presents a learning-based approach to the task of direction arrival estimation (DOA) from microphone array input. Traditional signal processing methods such as classic least square (LS) method rely on strong assumptions models and accurate estimations time delay (TDOA) . They only work well in relatively clean conditions, but suffer noise reverberation distortions. In this paper, we propose that can learn large amount simulated noisy reverberant inputs for robust DOA estimation....

10.1109/icassp.2015.7178484 article EN 2015-04-01

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies use pre-segmented audio signals, which are typically generated by mixing utterances on computers so that they fully overlap. Also, the algorithms have often been evaluated based signal-based metrics such as signal-to-distortion ratio. However, in natural conversations, signals contain both overlapped overlap-free regions. In addition, only weak correlation with automatic...

10.1109/icassp40776.2020.9053426 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Although advances in close-talk speech recognition have resulted relatively low error rates, the performance far-field environments is still limited due to signal-to-noise ratio, reverberation, and overlapped from simultaneous speakers which especially more difficult. To solve these problems, beamforming separation networks were previously proposed. However, they tend suffer leakage of interfering or generalizability. In this work, we propose a simple yet effective method for multi-channel...

10.1109/slt.2018.8639593 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art verification systems. To prevent such spoofing attack enhance the security of systems, development anti-spoofing distinguish synthetic human speech is necessary. In this study, we continue quest discriminate speech. Motivated by facts that analysis-synthesis operate on frame level make frame-by-frame independence assumption, proposed adopt magnitude/phase modulation features detect from Modulation...

10.1109/icassp.2013.6639067 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances different speakers are overlapped. While overlaps have been regarded as major obstacle in accurately transcribing meetings, traditional beamformer with single output has exclusively used because previously proposed separation techniques critical constraints for application real meetings. This paper proposes new signal processing module, called an unmixing transducer, and...

10.21437/interspeech.2018-2284 preprint EN Interspeech 2022 2018-08-28

Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of and noise spatial covariance matrices (SCM) are crucial for successfully applying minimum variance distortionless response (MVDR) beamforming. Reliable estimation time-frequency (TF) masks can improve SCMs significantly performance MVDR ASR tasks. In this paper, we focus on TF mask using recurrent neural networks (RNN). Specifically, our methods include training RNN...

10.1109/icassp.2017.7952756 article EN 2017-03-01

Utterance level permutation invariant training (uPIT) technique is a state-of-the-art deep learning architecture for speaker independent multi-talker separation. uPIT solves the label ambiguity problem by minimizing mean square error (MSE) over all permutations between outputs and targets. However, may be sub-optimal at segmental because optimization not calculated individual frames. In this paper, we propose constrained (cuPIT) to solve computing weighted MSE loss using dynamic information...

10.1109/icassp.2018.8462471 article EN 2018-04-01

This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in wild, evaluated at track of VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020. We will first explain our design to address issues handling real recordings. then present details components, which include Res2Net-based embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short Diarization Output Voting Error Reduction) method...

10.1109/icassp39728.2021.9413832 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Neural network-based speech separation has received a surge of interest in recent years. Previously proposed methods either are speaker independent or extract target speaker's voice by using his her snippet. In applications such as home devices office meeting transcriptions, possible list is available, which can be leveraged for separation. This paper proposes novel extraction method that utilizes an inventory snippets interfering speakers, enrollment data, addition to the speaker....

10.1109/icassp.2019.8682245 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

This paper proposes a token-level serialized output training (t-SOT), novel framework for streaming multi-talker automatic speech recognition (ASR).Unlike existing multitalker ASR models using multiple branches, the t-SOT model has only single branch that generates tokens (e.g., words, subwords) of speakers in chronological order based on their emission times.A special token indicates change "virtual" channels is introduced to keep track overlapping utterances.Compared prior models,...

10.21437/interspeech.2022-7 article EN Interspeech 2022 2022-09-16

In this paper, we study a novel technique that normalizes the modulation spectra of speech signals for robust recognition. The signal are power spectral density (PSD) functions feature trajectories generated from signal, hence they describe temporal structure features. distorted when is corrupted by noise. We propose normalization (TSN) filter to reduce noise effects normalizing reference spectra. TSN different other methods such as histogram equalization (HEQ) only normalize probability...

10.1109/tasl.2008.2002082 article EN IEEE Transactions on Audio Speech and Language Processing 2008-10-22

In this study, we develop the keyword spotting (KWS) and acoustic model (AM) components in a far-field speaker system. Specifically, use teacher-student (T/S) learning to adapt close-talk well-trained production AM by using parallel simulated data. We also T/S compress large-size KWS into small-size one fit device computational cost. Without need of transcription, well utilizes untranscribed data boost performance both adaptation compression. further optimize models with sequence...

10.1109/icassp.2018.8462209 article EN 2018-04-01

A nickel-catalyzed decarbonylative thioetherification of carboxylic acids with thiols was developed. Under the reaction conditions, benzoic acids, cinnamic and benzyl coupled various including both aromatic aliphatic ones produce corresponding thioethers in up to 99% yields. Moreover, this applicable modification bioactive molecules such as 3-methylflavone-8-carboxylic acid, probenecid, flufenamic synthesis acaricide chlorbenside. These results well demonstrated potential synthetic value new...

10.1021/acs.joc.2c00866 article EN The Journal of Organic Chemistry 2022-06-18

This paper describes a speaker diarization model based on target voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number speakers, we investigate architectures that use input tensors with variable-length time and dimensions. Transformer layers are applied axis make output insensitive order profiles provided model. Time-wise sequential interspersed between these speaker-wise transformer allow temporal...

10.1109/icassp49357.2023.10095185 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown LSTM-RNNs achieve higher recognition accuracy than deep feed-forword networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based models not well investigated. In this paper, we study the speaker-aware training incorporates information during to normalise variability. We first present several architectures, and then...

10.1109/icassp.2016.7472685 article EN 2016-03-01

This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear adaptation approaches for reducing reverberation in speech signals. In the approach, DNN is trained from parallel clean/distorted corpus to map reverberant noisy coefficients (such as log magnitude spectrum) underlying clean coefficients. The constraint imposed by dynamic features (i.e., time derivatives of coefficients) are used enhance smoothness predicted coefficient trajectories...

10.1186/s13634-015-0300-4 article EN cc-by EURASIP Journal on Advances in Signal Processing 2016-01-13

This paper presents a deep neural network (DNN) approach to sentence boundary detection in broadcast news. We extract prosodic and lexical features at each inter-word position the transcripts learn sequential classifier label these positions as either or non-boundary. work is realized by hybrid DNN-CRF (conditional random field) architecture. The DNN accepts feature inputs non-linearly maps them into boundary/non-boundary posterior probability outputs. Subsequently, probabilities are...

10.21437/interspeech.2014-599 article EN Interspeech 2022 2014-09-14

Spoofing detection, which discriminates the spoofed speech from natural speech, has gained much attention recently. Low-dimensional features that are used in speaker recognition/verification also spoofing detection. Unfortunately, they don't capture sufficient information required for In this work, we investigate use of high-dimensional maybe more sensitive to artifacts speech. Six types feature employed. For each kind feature, four different representations extracted, i.e. original...

10.1109/icassp.2016.7472051 article EN 2016-03-01

While recent progresses in neural network approaches to singlechannel speech separation, or more generally the cocktail party problem, achieved significant improvement, their performance for complex mixtures is still not satisfactory. In this work, we propose a novel multi-channel framework multi-talker separation. proposed model, an input mixture signal firstly converted set of beamformed signals using fixed beam patterns. For beamforming, use differential beamformers as they are suitable...

10.1109/asru.2017.8268969 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017-12-01

We propose strategies for a state-of-the-art keyword search (KWS) system developed by the SINGA team in context of 2014 NIST Open Keyword Search Evaluation (OpenKWS14) using conversational Tamil provided IARPA Babel program. To tackle low-resource challenges and rich morphological nature Tamil, we present highlights our current KWS system, including: (1) Submodular optimization data selection to maximize acoustic diversity through Gaussian component indexed N-grams; (2) Keywordaware language...

10.1109/icassp.2015.7178996 article EN 2015-04-01

Overlapped speech is one of the main challenges in conversational applications such as meeting transcription. Blind separation and extraction are two common approaches to this problem. Both them, however, suffer from limitations resulting lack abilities either leverage additional information or process multiple speakers simultaneously. In work, we propose a novel method called using speaker inventory (SSUSI), which combines advantages both thus solves their problems. SSUSI makes use...

10.1109/asru46091.2019.9003884 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01
Coming Soon ...