NFDI4DS | UHH-SEMS - Publication Details

Xiaodong Cui

ORCID: 0000-0003-4865-1307

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5102014291

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Natural Language Processing Techniques
Advanced Neural Network Applications
Neural Networks and Applications
Speech and dialogue systems
Topic Modeling
Machine Learning and ELM
Domain Adaptation and Few-Shot Learning
Stochastic Gradient Optimization Techniques
Sparse and Compressive Sensing Techniques
Advanced Data Compression Techniques
Explainable Artificial Intelligence (XAI)
Stock Market Forecasting Methods
Human Pose and Action Recognition
Imbalanced Data Classification Techniques
Automated Road and Building Extraction
Remote-Sensing Image Classification
Remote Sensing and LiDAR Applications
Evolutionary Algorithms and Applications
Metaheuristic Optimization Algorithms Research
Complex Network Analysis Techniques
Generative Adversarial Networks and Image Synthesis
Time Series Analysis and Forecasting

Tianjin Medical University
2024-2025

Tianjin Chest Hospital
2024-2025

Northwestern Polytechnical University
2021-2024

Shandong University of Science and Technology
2024

IBM (United States)
2013-2023

Yunnan University
2023

IBM Research - Thomas J. Watson Research Center
2007-2021

IBM Research (China)
2021

Altair Engineering (United States)
2021

Sohu (China)
2021

Data Augmentation for Deep Neural Network Acoustic Modeling

OPENALEX - Publications

Xiaodong Cui Vaibhava Goel Brian Kingsbury

This paper investigates data augmentation for deep neural network acoustic modeling based on label-preserving transformations to deal with sparsity. Two approaches, vocal tract length perturbation (VTLP) and stochastic feature mapping (SFM), are investigated both networks (DNNs) convolutional (CNNs). The approaches focused increasing speaker speech variations of the limited training such that models trained augmented more robust variations. In addition, a two-stage scheme stacked...

10.1109/taslp.2015.2438544 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2015-06-03

English Conversational Telephone Speech Recognition by Humans and Machines

OPENALEX - Publications

George Saon Gakuto Kurata Tom Sercu Kartik Audhkhasi Samuel Thomas and 7 more

One of the most difficult speech recognition tasks is accurate human to communication.Advances in deep learning over last few years have produced major improvements on representative Switchboard conversational corpus.Word error rates that just a ago were 14% dropped 8.0%, then 6.6% and recently 5.8%, are now believed be within striking range performance.This raises two issues -what IS performance, how far down can we still drive rates?A recent paper by Microsoft suggests already achieved...

10.21437/interspeech.2017-405 article EN Interspeech 2022 2017-08-16

Dilated Recurrent Neural Networks

OPENALEX - Publications

Shiyu Chang Yang Zhang Wei Han Mo Yu Xiaoxiao Guo and 5 more

Learning with recurrent neural networks (RNNs) on long sequences is a notoriously difficult task. There are three major challenges: 1) complex dependencies, 2) vanishing and exploding gradients, 3) efficient parallelization. In this paper, we introduce simple yet effective RNN connection structure, the DilatedRNN, which simultaneously tackles all of these challenges. The proposed architecture characterized by multi-resolution dilated skip connections can be combined flexibly diverse cells....

10.48550/arxiv.1710.02224 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

OPENALEX - Publications

Mingke Xu Fan Zhang Xiaodong Cui Wei Zhang

In Speech Emotion Recognition (SER), emotional characteristics often appear in diverse forms of energy patterns spectrograms. Typical attention neural network classifiers SER are usually optimized on a fixed granularity. this paper, we apply multiscale area deep convolutional to attend with varied granularities and therefore the classifier can benefit from an ensemble attentions different scales. To deal data sparsity, conduct augmentation vocal tract length perturbation (VTLP) improve...

10.1109/icassp39728.2021.9414635 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Multilingual representations for low resource speech recognition and keyword search

OPENALEX - Publications

Jia Cui Brian Kingsbury Bhuvana Ramabhadran Abhinav Sethy Kartik Audhkhasi and 14 more

This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in context OpenKWS15 evaluation IARPA Babel program. The task is to develop Swahili ASR KWS systems within two weeks using as little 3 hours transcribed data. Multilingual proved be crucial building these under strict time constraints. discusses several key insights how are derived used. First, we present a data sampling...

10.1109/asru.2015.7404803 article EN 2015-12-01

Data Augmentation for deep neural network acoustic modeling

OPENALEX - Publications

Xiaodong Cui Vaibhava Goel Brian Kingsbury

Data augmentation using label preserving transformations has been shown to be effective for neural network training make invariant predictions. In this paper we focus on data approaches acoustic modeling deep networks (DNNs) automatic speech recognition (ASR). We first investigate a modified version of previously studied approach vocal tract length perturbation (VTLP) and then propose novel based stochastic feature mapping (SFM) in speaker adaptive space. Experiments were conducted Bengali...

10.1109/icassp.2014.6854671 article EN 2014-05-01

Data augmentation for deep convolutional neural network acoustic modeling

OPENALEX - Publications

Xiaodong Cui Vaibhava Goel Brian Kingsbury

This paper investigates data augmentation based on label-preserving transformations for deep convolutional neural network (CNN) acoustic modeling to deal with limited training data. We show how stochastic feature mapping (SFM) can be carried out when CNN models log-Mel features as input and compare it vocal tract length perturbation (VTLP). Furthermore, a two-stage scheme stacked architecture is proposed combine VTLP SFM complementary approaches. Improved performance has been observed in...

10.1109/icassp.2015.7178831 article EN 2015-04-01

A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing

OPENALEX - Publications

Li Deng Xiaodong Cui R. Pruvenok Jianliang Huang Sifat Momen and 2 more

While vocal tract resonances (VTRs, or formants that are defined as such resonances) known to play a critical role in human speech perception and computer processing, there has been lack of standard databases needed for the quantitative evaluation automatic VTR extraction techniques. We report this paper on our recent effort create publicly available database first three frequency trajectories. The contains representative subset TIMIT corpus with respect speaker, gender, dialect phonetic...

10.1109/icassp.2006.1660034 article EN 2006-08-02

System combination and score normalization for spoken term detection

OPENALEX - Publications

Jonathan Mamou Jia Cui Xiaodong Cui Mark Gales Brian Kingsbury and 8 more

Spoken content in languages of emerging importance needs to be searchable provide access the underlying information. In this paper, we investigate problem extending data fusion methodologies from Information Retrieval for Term Detection on low-resource framework IARPA Babel program. We describe a number alternative methods improving keyword search performance. apply these Cantonese, language that presents some new issues terms reduced resources and shorter query lengths. First, show score...

10.1109/icassp.2013.6639278 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Affective-Knowledge-Enhanced Graph Convolutional Networks for Aspect-Based Sentiment Analysis with Multi-Head Attention

OPENALEX - Publications

Xiaodong Cui Wenbiao Tao Xiaohui Cui

Aspect-based sentiment analysis (ABSA) is a task in natural language processing (NLP) that involves predicting the polarity towards specific aspect text. Graph neural networks (GNNs) have been shown to be effective tools for tasks, but current research often overlooks affective information text, leading irrelevant being learned aspects. To address this issue, we propose novel GNN model, MHAKE-GCN, which based on graph convolutional network (GCN) and multi-head attention (MHA). Our model...

10.3390/app13074458 article EN cc-by Applied Sciences 2023-03-31

A high-performance Cantonese keyword search system

OPENALEX - Publications

Brian Kingsbury Jia Cui Xiaodong Cui Mark Gales Kate Knill and 8 more

We present a system for keyword search on Cantonese conversational telephony audio, collected the IARPA Babel program, that achieves good performance by combining postings lists produced diverse speech recognition systems from three different research groups. describe task, data which work was done, four systems, and our approach to combination search. show of outperforms best single 7%, achieving an actual term-weighted value 0.517.

10.1109/icassp.2013.6639279 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks

OPENALEX - Publications

Xiaodong Cui Wei Zhang Zoltán Tüske Michael Picheny

We propose a population-based Evolutionary Stochastic Gradient Descent (ESGD) framework for optimizing deep neural networks. ESGD combines SGD and gradient-free evolutionary algorithms as complementary in one which the optimization alternates between step evolution to improve average fitness of population. With back-off strategy an elitist step, it guarantees that best population will never degrade. In addition, individuals optimized with various SGD-based optimizers using distinct...

10.48550/arxiv.1810.06773 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Federated Acoustic Modeling for Automatic Speech Recognition

OPENALEX - Publications

Xiaodong Cui Songtao Lu Brian Kingsbury

Data privacy and protection is a crucial issue for any automatic speech recognition (ASR) service provider when dealing with clients. In this paper, we investigate federated acoustic modeling using data from multiple A client's stored on local server the clients communicate only model parameters central server, not their data. The communication happens infrequently to reduce cost. To mitigate non-iid issue, client adaptive training (CAFT) proposed canonicalize across experiments are carried...

10.1109/icassp39728.2021.9414305 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

OPENALEX - Publications

Chia‐Yu Chen Jiamin Ni Songtao Lu Xiaodong Cui Pin‐Yu Chen and 6 more

Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and demonstrated high ratios. However, most existing methods do not scale well large systems (due build-up) and/or fail evaluate model fidelity (test accuracy) datasets. mitigate these issues, we propose a new technique, Scalable Sparsified Gradient...

10.48550/arxiv.2104.11125 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR

OPENALEX - Publications

Xiaodong Cui A. Alwan

A feature compensation (FC) algorithm based on polynomial regression of utterance signal-to-noise ratio (SNR) for noise robust automatic speech recognition (ASR) is proposed. In this algorithm, the bias between clean and noisy features approximated by a set polynomials which are estimated from adaptation data new environment expectation-maximization (EM) under maximum likelihood (ML) criterion. ASR, SNR signal first then compensated polynomials. The decoded via acoustic HMMs trained with...

10.1109/tsa.2005.853002 article EN IEEE Transactions on Speech and Audio Processing 2005-10-18

Developing speech recognition systems for corpus indexing under the IARPA Babel program

OPENALEX - Publications

Jia Cui Xiaodong Cui Bhuvana Ramabhadran Janice Kim Brian Kingsbury and 5 more

Automatic speech recognition is a core component of many applications, including keyword search. In this paper we describe experiments on acoustic modeling, language and decoding for search Cantonese conversational telephony corpus collected as part the IARPA Babel program. We show that modeling techniques such bootstrapped-and-restructured model deep neural network significantly outperform state-of-the-art baseline GMM/HMM model, in terms both performance performance, with improvements up...

10.1109/icassp.2013.6638969 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Distributed Deep Learning Strategies for Automatic Speech Recognition

OPENALEX - Publications

Wei Zhang Xiaodong Cui Ulrich Finkler Brian Kingsbury George Saon and 2 more

In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) evaluate them with state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one most widely used datasets ASR performance benchmark. We first what are proper hyper-parameters (e.g., rate) to enable training sufficiently large batch size without impairing accuracy. then implement various strategies, including...

10.1109/icassp.2019.8682888 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Reparameterization Head for Efficient Multi-Input Networks

OPENALEX - Publications

Keke Tang Wenyu Zhao Weilong Peng Xiang Fang Xiaodong Cui and 2 more

Reparameterization techniques have demonstrated their efficacy in improving the efficiency of deep neural networks. However, application has been largely confined to single-input network structures, leaving multi-input ones, commonly encountered real-world applications, unexplored. In this paper, we formulate reparameterization head (RepHead), first framework designed introduce into RepHead compresses multiple inputs a single input and employs reconstruction operations recover them, thereby...

10.1109/icassp48485.2024.10447574 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

A Study of Variable-Parameter Gaussian Mixture Hidden Markov Modeling for Noisy Speech Recognition

OPENALEX - Publications

Xiaodong Cui Yifan Gong

To improve recognition performance in noisy environments, multicondition training is usually applied which speech signals corrupted by a variety of noise are used acoustic model training. Published hidden Markov modeling uses multiple Gaussian distributions to cover the spread distribution caused noise, distracts event itself and possibly sacrifices on clean speech. In this paper, we propose novel approach extends conventional mixture (GMHMM) state emission parameters (mean variance) as...

10.1109/tasl.2006.889791 article EN IEEE Transactions on Audio Speech and Language Processing 2007-04-25

An empirical study of confusion modeling in keyword search for low resource languages

OPENALEX - Publications

Murat Saraçlar Abhinav Sethy Bhuvana Ramabhadran Lidia Mangu Jia Cui and 3 more

Keyword search, in the context of low resource languages, has emerged as a key area research. The dominant approach keyword search is to use Automatic Speech Recognition (ASR) front end produce representation audio that can be indexed. biggest drawback this lies its inability deal with out-of-vocabulary words and query terms are not ASR system output. In paper we present an empirical study evaluating various approaches based on using confusion models expansion techniques address problem. We...

10.1109/asru.2013.6707774 article EN 2013-12-01

Multi-View and Multi-Objective Semi-Supervised Learning for HMM-Based Automatic Speech Recognition

OPENALEX - Publications

Xiaodong Cui Jing Huang Jen‐Tzung Chien

Current hidden Markov acoustic modeling for large-vocabulary continuous speech recognition (LVCSR) heavily relies on the availability of abundant labeled transcriptions. Given that labeling is both expensive and time-consuming while there a huge amount unlabeled data easily available nowadays, semi-supervised learning (SSL) from aiming to reduce development cost LVCSR becomes more important than ever. In this paper, new SSL approach proposed which exploits cross-view transfer through...

10.1109/tasl.2012.2191955 article EN IEEE Transactions on Audio Speech and Language Processing 2012-04-11

Coming Soon ...