NFDI4DS | UHH-SEMS - Publication Details

Xiong Xiao

ORCID: 0009-0001-5128-6518

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5101602536

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Natural Language Processing Techniques
Topic Modeling
Probabilistic and Robust Engineering Design
Speech and dialogue systems
Structural Health Monitoring Techniques
Geotechnical Engineering and Analysis
Advanced Adaptive Filtering Techniques
Frequency Control in Power Systems
Concrete Corrosion and Durability
Infrastructure Maintenance and Monitoring
Geotechnical Engineering and Soil Stabilization
Blind Source Separation Techniques
Acoustic Wave Phenomena Research
Algorithms and Data Compression
Time Series Analysis and Forecasting
Electric Power System Optimization
Microgrid Control and Optimization
Web Data Mining and Analysis
Grouting, Rheology, and Soil Mechanics
Advanced Text Analysis Techniques
Optimal Power Flow Distribution
Matrix Theory and Algorithms

Guangxi University
2023-2025

China Southern Power Grid (China)
2022-2025

Central South University
2010-2025

Microsoft (United States)
2018-2024

Tsinghua University
2022-2023

Jilin University
2021-2023

Zhongshan Hospital of Xiamen University
2023

Fudan University
2023

Hainan University
2022

China Electric Power Research Institute
2019-2022

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

OPENALEX - Publications

Sanyuan Chen Chengyi Wang Zhengyang Chen Yu Wu Shujie Liu and 14 more

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other processing tasks. As signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., universal representations all tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream WavLM jointly learns masked prediction and denoising pre-training. By this means,...

10.1109/jstsp.2022.3188113 article EN IEEE Journal of Selected Topics in Signal Processing 2022-07-04

A learning-based approach to direction of arrival estimation in noisy and reverberant environments

OPENALEX - Publications

Xiong Xiao Shengkui Zhao Xionghu Zhong Douglas L. Jones Eng Siong Chng and 1 more

This paper presents a learning-based approach to the task of direction arrival estimation (DOA) from microphone array input. Traditional signal processing methods such as classic least square (LS) method rely on strong assumptions models and accurate estimations time delay (TDOA) . They only work well in relatively clean conditions, but suffer noise reverberation distortions. In this paper, we propose that can learn large amount simulated noisy reverberant inputs for robust DOA estimation....

10.1109/icassp.2015.7178484 article EN 2015-04-01

Continuous Speech Separation: Dataset and Analysis

OPENALEX - Publications

Zhuo Chen Takuya Yoshioka Liang Lu Tianyan Zhou Zhong Meng and 4 more

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies use pre-segmented audio signals, which are typically generated by mixing utterances on computers so that they fully overlap. Also, the algorithms have often been evaluated based signal-based metrics such as signal-to-distortion ratio. However, in natural conversations, signals contain both overlapped overlap-free regions. In addition, only weak correlation with automatic...

10.1109/icassp40776.2020.9053426 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network

OPENALEX - Publications

Zhuo Chen Xiong Xiao Takuya Yoshioka Hakan Erdoğan Jinyu Li and 1 more

Although advances in close-talk speech recognition have resulted relatively low error rates, the performance far-field environments is still limited due to signal-to-noise ratio, reverberation, and overlapped from simultaneous speakers which especially more difficult. To solve these problems, beamforming separation networks were previously proposed. However, they tend suffer leakage of interfering or generalizability. In this work, we propose a simple yet effective method for multi-channel...

10.1109/slt.2018.8639593 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

Synthetic speech detection using temporal modulation feature

OPENALEX - Publications

Zhizheng Wu Xiong Xiao Eng Siong Chng Haizhou Li

Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art verification systems. To prevent such spoofing attack enhance the security of systems, development anti-spoofing distinguish synthetic human speech is necessary. In this study, we continue quest discriminate speech. Motivated by facts that analysis-synthesis operate on frame level make frame-by-frame independence assumption, proposed adopt magnitude/phase modulation features detect from Modulation...

10.1109/icassp.2013.6639067 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge

OPENALEX - Publications

Xiong Xiao Xiaohai Tian Steven Du Haihua Xu Eng Siong Chng and 1 more

10.21437/interspeech.2015-465 article EN Interspeech 2022 2015-09-06

Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks

OPENALEX - Publications

Takuya Yoshioka Hakan Erdoğan Zhuo Chen Xiong Xiao Fil Alleva

The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances different speakers are overlapped. While overlaps have been regarded as major obstacle in accurately transcribing meetings, traditional beamformer with single output has exclusively used because previously proposed separation techniques critical constraints for application real meetings. This paper proposes new signal processing module, called an unmixing transducer, and...

10.21437/interspeech.2018-2284 preprint EN Interspeech 2022 2018-08-28

On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition

OPENALEX - Publications

Xiong Xiao Shengkui Zhao Douglas L. Jones Eng Siong Chng Haizhou Li

Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of and noise spatial covariance matrices (SCM) are crucial for successfully applying minimum variance distortionless response (MVDR) beamforming. Reliable estimation time-frequency (TF) masks can improve SCMs significantly performance MVDR ASR tasks. In this paper, we focus on TF mask using recurrent neural networks (RNN). Specifically, our methods include training RNN...

10.1109/icassp.2017.7952756 article EN 2017-03-01

Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM

OPENALEX - Publications

Chenglin Xu Wei Rao Xiong Xiao Eng Siong Chng Haizhou Li

Utterance level permutation invariant training (uPIT) technique is a state-of-the-art deep learning architecture for speaker independent multi-talker separation. uPIT solves the label ambiguity problem by minimizing mean square error (MSE) over all permutations between outputs and targets. However, may be sub-optimal at segmental because optimization not calculated individual frames. In this paper, we propose constrained (cuPIT) to solve computing weighted MSE loss using dynamic information...

10.1109/icassp.2018.8462471 article EN 2018-04-01

Microsoft Speaker Diarization System for the Voxceleb Speaker Recognition Challenge 2020

OPENALEX - Publications

Xiong Xiao Naoyuki Kanda Zhuo Chen Tianyan Zhou Takuya Yoshioka and 8 more

This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in wild, evaluated at track of VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020. We will first explain our design to address issues handling real recordings. then present details components, which include Res2Net-based embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short Diarization Output Voting Error Reduction) method...

10.1109/icassp39728.2021.9413832 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Single-channel Speech Extraction Using Speaker Inventory and Attention Network

OPENALEX - Publications

Xiong Xiao Zhuo Chen Takuya Yoshioka Hakan Erdoğan Changliang Liu and 3 more

Neural network-based speech separation has received a surge of interest in recent years. Previously proposed methods either are speaker independent or extract target speaker's voice by using his her snippet. In applications such as home devices office meeting transcriptions, possible list is available, which can be leveraged for separation. This paper proposes novel extraction method that utilizes an inventory snippets interfering speakers, enrollment data, addition to the speaker....

10.1109/icassp.2019.8682245 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

OPENALEX - Publications

Naoyuki Kanda Jian Wu Yu Wu Xiong Xiao Zhong Meng and 5 more

This paper proposes a token-level serialized output training (t-SOT), novel framework for streaming multi-talker automatic speech recognition (ASR).Unlike existing multitalker ASR models using multiple branches, the t-SOT model has only single branch that generates tokens (e.g., words, subwords) of speakers in chronological order based on their emission times.A special token indicates change "virtual" channels is introduced to keep track overlapping utterances.Compared prior models,...

10.21437/interspeech.2022-7 article EN Interspeech 2022 2022-09-16

Normalization of the Speech Modulation Spectra for Robust Speech Recognition

OPENALEX - Publications

Xiong Xiao Eng Siong Chng Haizhou Li

In this paper, we study a novel technique that normalizes the modulation spectra of speech signals for robust recognition. The signal are power spectral density (PSD) functions feature trajectories generated from signal, hence they describe temporal structure features. distorted when is corrupted by noise. We propose normalization (TSN) filter to reduce noise effects normalizing reference spectra. TSN different other methods such as histogram equalization (HEQ) only normalize probability...

10.1109/tasl.2008.2002082 article EN IEEE Transactions on Audio Speech and Language Processing 2008-10-22

Developing Far-Field Speaker System Via Teacher-Student Learning

OPENALEX - Publications

Jinyu Li Rui Zhao Zhuo Chen Changliang Liu Xiong Xiao and 2 more

In this study, we develop the keyword spotting (KWS) and acoustic model (AM) components in a far-field speaker system. Specifically, use teacher-student (T/S) learning to adapt close-talk well-trained production AM by using parallel simulated data. We also T/S compress large-size KWS into small-size one fit device computational cost. Without need of transcription, well utilizes untranscribed data boost performance both adaptation compression. further optimize models with sequence...

10.1109/icassp.2018.8462209 article EN 2018-04-01

The compaction effect on the performance of a compaction-grouted soil nail in sand

OPENALEX - Publications

Xinyu Ye Shanyong Wang Sheng Zhang Xiong Xiao Fang Xu

10.1007/s11440-020-01017-4 article EN Acta Geotechnica 2020-07-07

Nickel-Catalyzed Decarbonylative Thioetherification of Carboxylic Acids with Thiols

OPENALEX - Publications

Tianhao Xu Xing-Yu Zhou Xiong Xiao Yan Yuan Long Liu and 4 more

A nickel-catalyzed decarbonylative thioetherification of carboxylic acids with thiols was developed. Under the reaction conditions, benzoic acids, cinnamic and benzyl coupled various including both aromatic aliphatic ones produce corresponding thioethers in up to 99% yields. Moreover, this applicable modification bioactive molecules such as 3-methylflavone-8-carboxylic acid, probenecid, flufenamic synthesis acaricide chlorbenside. These results well demonstrated potential synthetic value new...

10.1021/acs.joc.2c00866 article EN The Journal of Organic Chemistry 2022-06-18

Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization

OPENALEX - Publications

Dongmei Wang Xiong Xiao Naoyuki Kanda Takuya Yoshioka Jian Wu

This paper describes a speaker diarization model based on target voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number speakers, we investigate architectures that use input tensors with variable-length time and dimensions. Transformer layers are applied axis make output insensitive order profiles provided model. Time-wise sequential interspersed between these speaker-wise transformer allow temporal...

10.1109/icassp49357.2023.10095185 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Speaker-aware training of LSTM-RNNS for acoustic modelling

OPENALEX - Publications

Tian Tan Yanmin Qian Dong Yu Souvik Kundu Liang Lu and 3 more

Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown LSTM-RNNs achieve higher recognition accuracy than deep feed-forword networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based models not well investigated. In this paper, we study the speaker-aware training incorporates information during to normalise variability. We first present several architectures, and then...

10.1109/icassp.2016.7472685 article EN 2016-03-01

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

OPENALEX - Publications

Xiong Xiao Shengkui Zhao Duc Hoang Nguyen Xionghu Zhong Douglas L. Jones and 2 more

This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear adaptation approaches for reducing reverberation in speech signals. In the approach, DNN is trained from parallel clean/distorted corpus to map reverberant noisy coefficients (such as log magnitude spectrum) underlying clean coefficients. The constraint imposed by dynamic features (i.e., time derivatives of coefficients) are used enhance smoothness predicted coefficient trajectories...

10.1186/s13634-015-0300-4 article EN cc-by EURASIP Journal on Advances in Signal Processing 2016-01-13

A deep neural network approach for sentence boundary detection in broadcast news

OPENALEX - Publications

Chenglin Xu Lei Xie Guangpu Huang Xiong Xiao Eng Siong Chng and 1 more

This paper presents a deep neural network (DNN) approach to sentence boundary detection in broadcast news. We extract prosodic and lexical features at each inter-word position the transcripts learn sequential classifier label these positions as either or non-boundary. work is realized by hybrid DNN-CRF (conditional random field) architecture. The DNN accepts feature inputs non-linearly maps them into boundary/non-boundary posterior probability outputs. Subsequently, probabilities are...

10.21437/interspeech.2014-599 article EN Interspeech 2022 2014-09-14

Spoofing detection from a feature representation perspective

OPENALEX - Publications

Xiaohai Tian Zhizheng Wu Xiong Xiao Eng Siong Chng Haizhou Li

Spoofing detection, which discriminates the spoofed speech from natural speech, has gained much attention recently. Low-dimensional features that are used in speaker recognition/verification also spoofing detection. Unfortunately, they don't capture sufficient information required for In this work, we investigate use of high-dimensional maybe more sensitive to artifacts speech. Six types feature employed. For each kind feature, four different representations extracted, i.e. original...

10.1109/icassp.2016.7472051 article EN 2016-03-01

Cracking the cocktail party problem by multi-beam deep attractor network

OPENALEX - Publications

Zhuo Chen Jinyu Li Xiong Xiao Takuya Yoshioka Huaming Wang and 2 more

While recent progresses in neural network approaches to singlechannel speech separation, or more generally the cocktail party problem, achieved significant improvement, their performance for complex mixtures is still not satisfactory. In this work, we propose a novel multi-channel framework multi-talker separation. proposed model, an input mixture signal firstly converted set of beamformed signals using fixed beam patterns. For beamforming, use differential beamformers as they are suitable...

10.1109/asru.2017.8268969 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017-12-01

A novel adaptive importance sampling algorithm for Bayesian model updating

OPENALEX - Publications

Xiong Xiao Quanwang Li Zeyu Wang

10.1016/j.strusafe.2022.102230 article EN Structural Safety 2022-05-10

Low-resource keyword search strategies for tamil

OPENALEX - Publications

Nancy F. Chen Chongjia Ni I–Ming Chen Sunil Sivadas Van Tung Pham and 12 more

We propose strategies for a state-of-the-art keyword search (KWS) system developed by the SINGA team in context of 2014 NIST Open Keyword Search Evaluation (OpenKWS14) using conversational Tamil provided IARPA Babel program. To tackle low-resource challenges and rich morphological nature Tamil, we present highlights our current KWS system, including: (1) Submodular optimization data selection to maximize acoustic diversity through Gaussian component indexed N-grams; (2) Keywordaware language...

10.1109/icassp.2015.7178996 article EN 2015-04-01

Speech Separation Using Speaker Inventory

OPENALEX - Publications

Peidong Wang Zhuo Chen Xiong Xiao Zhong Meng Takuya Yoshioka and 3 more

Overlapped speech is one of the main challenges in conversational applications such as meeting transcription. Blind separation and extraction are two common approaches to this problem. Both them, however, suffer from limitations resulting lack abilities either leverage additional information or process multiple speakers simultaneously. In work, we propose a novel method called using speaker inventory (SSUSI), which combines advantages both thus solves their problems. SSUSI makes use...

10.1109/asru46091.2019.9003884 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Coming Soon ...