- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Advanced Neural Network Applications
- Neural Networks and Applications
- Speech and dialogue systems
- Topic Modeling
- Machine Learning and ELM
- Domain Adaptation and Few-Shot Learning
- Stochastic Gradient Optimization Techniques
- Sparse and Compressive Sensing Techniques
- Advanced Data Compression Techniques
- Explainable Artificial Intelligence (XAI)
- Stock Market Forecasting Methods
- Human Pose and Action Recognition
- Imbalanced Data Classification Techniques
- Automated Road and Building Extraction
- Remote-Sensing Image Classification
- Remote Sensing and LiDAR Applications
- Evolutionary Algorithms and Applications
- Metaheuristic Optimization Algorithms Research
- Complex Network Analysis Techniques
- Generative Adversarial Networks and Image Synthesis
- Time Series Analysis and Forecasting
Tianjin Medical University
2024-2025
Tianjin Chest Hospital
2024-2025
Northwestern Polytechnical University
2021-2024
Shandong University of Science and Technology
2024
IBM (United States)
2013-2023
Yunnan University
2023
IBM Research - Thomas J. Watson Research Center
2007-2021
IBM Research (China)
2021
Altair Engineering (United States)
2021
Sohu (China)
2021
This paper investigates data augmentation for deep neural network acoustic modeling based on label-preserving transformations to deal with sparsity. Two approaches, vocal tract length perturbation (VTLP) and stochastic feature mapping (SFM), are investigated both networks (DNNs) convolutional (CNNs). The approaches focused increasing speaker speech variations of the limited training such that models trained augmented more robust variations. In addition, a two-stage scheme stacked...
One of the most difficult speech recognition tasks is accurate human to communication.Advances in deep learning over last few years have produced major improvements on representative Switchboard conversational corpus.Word error rates that just a ago were 14% dropped 8.0%, then 6.6% and recently 5.8%, are now believed be within striking range performance.This raises two issues -what IS performance, how far down can we still drive rates?A recent paper by Microsoft suggests already achieved...
Learning with recurrent neural networks (RNNs) on long sequences is a notoriously difficult task. There are three major challenges: 1) complex dependencies, 2) vanishing and exploding gradients, 3) efficient parallelization. In this paper, we introduce simple yet effective RNN connection structure, the DilatedRNN, which simultaneously tackles all of these challenges. The proposed architecture characterized by multi-resolution dilated skip connections can be combined flexibly diverse cells....
In Speech Emotion Recognition (SER), emotional characteristics often appear in diverse forms of energy patterns spectrograms. Typical attention neural network classifiers SER are usually optimized on a fixed granularity. this paper, we apply multiscale area deep convolutional to attend with varied granularities and therefore the classifier can benefit from an ensemble attentions different scales. To deal data sparsity, conduct augmentation vocal tract length perturbation (VTLP) improve...
This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in context OpenKWS15 evaluation IARPA Babel program. The task is to develop Swahili ASR KWS systems within two weeks using as little 3 hours transcribed data. Multilingual proved be crucial building these under strict time constraints. discusses several key insights how are derived used. First, we present a data sampling...
Data augmentation using label preserving transformations has been shown to be effective for neural network training make invariant predictions. In this paper we focus on data approaches acoustic modeling deep networks (DNNs) automatic speech recognition (ASR). We first investigate a modified version of previously studied approach vocal tract length perturbation (VTLP) and then propose novel based stochastic feature mapping (SFM) in speaker adaptive space. Experiments were conducted Bengali...
This paper investigates data augmentation based on label-preserving transformations for deep convolutional neural network (CNN) acoustic modeling to deal with limited training data. We show how stochastic feature mapping (SFM) can be carried out when CNN models log-Mel features as input and compare it vocal tract length perturbation (VTLP). Furthermore, a two-stage scheme stacked architecture is proposed combine VTLP SFM complementary approaches. Improved performance has been observed in...
While vocal tract resonances (VTRs, or formants that are defined as such resonances) known to play a critical role in human speech perception and computer processing, there has been lack of standard databases needed for the quantitative evaluation automatic VTR extraction techniques. We report this paper on our recent effort create publicly available database first three frequency trajectories. The contains representative subset TIMIT corpus with respect speaker, gender, dialect phonetic...
Spoken content in languages of emerging importance needs to be searchable provide access the underlying information. In this paper, we investigate problem extending data fusion methodologies from Information Retrieval for Term Detection on low-resource framework IARPA Babel program. We describe a number alternative methods improving keyword search performance. apply these Cantonese, language that presents some new issues terms reduced resources and shorter query lengths. First, show score...
Aspect-based sentiment analysis (ABSA) is a task in natural language processing (NLP) that involves predicting the polarity towards specific aspect text. Graph neural networks (GNNs) have been shown to be effective tools for tasks, but current research often overlooks affective information text, leading irrelevant being learned aspects. To address this issue, we propose novel GNN model, MHAKE-GCN, which based on graph convolutional network (GCN) and multi-head attention (MHA). Our model...
We present a system for keyword search on Cantonese conversational telephony audio, collected the IARPA Babel program, that achieves good performance by combining postings lists produced diverse speech recognition systems from three different research groups. describe task, data which work was done, four systems, and our approach to combination search. show of outperforms best single 7%, achieving an actual term-weighted value 0.517.
We propose a population-based Evolutionary Stochastic Gradient Descent (ESGD) framework for optimizing deep neural networks. ESGD combines SGD and gradient-free evolutionary algorithms as complementary in one which the optimization alternates between step evolution to improve average fitness of population. With back-off strategy an elitist step, it guarantees that best population will never degrade. In addition, individuals optimized with various SGD-based optimizers using distinct...
Data privacy and protection is a crucial issue for any automatic speech recognition (ASR) service provider when dealing with clients. In this paper, we investigate federated acoustic modeling using data from multiple A client's stored on local server the clients communicate only model parameters central server, not their data. The communication happens infrequently to reduce cost. To mitigate non-iid issue, client adaptive training (CAFT) proposed canonicalize across experiments are carried...
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and demonstrated high ratios. However, most existing methods do not scale well large systems (due build-up) and/or fail evaluate model fidelity (test accuracy) datasets. mitigate these issues, we propose a new technique, Scalable Sparsified Gradient...
A feature compensation (FC) algorithm based on polynomial regression of utterance signal-to-noise ratio (SNR) for noise robust automatic speech recognition (ASR) is proposed. In this algorithm, the bias between clean and noisy features approximated by a set polynomials which are estimated from adaptation data new environment expectation-maximization (EM) under maximum likelihood (ML) criterion. ASR, SNR signal first then compensated polynomials. The decoded via acoustic HMMs trained with...
Automatic speech recognition is a core component of many applications, including keyword search. In this paper we describe experiments on acoustic modeling, language and decoding for search Cantonese conversational telephony corpus collected as part the IARPA Babel program. We show that modeling techniques such bootstrapped-and-restructured model deep neural network significantly outperform state-of-the-art baseline GMM/HMM model, in terms both performance performance, with improvements up...
In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) evaluate them with state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one most widely used datasets ASR performance benchmark. We first what are proper hyper-parameters (e.g., rate) to enable training sufficiently large batch size without impairing accuracy. then implement various strategies, including...
Reparameterization techniques have demonstrated their efficacy in improving the efficiency of deep neural networks. However, application has been largely confined to single-input network structures, leaving multi-input ones, commonly encountered real-world applications, unexplored. In this paper, we formulate reparameterization head (RepHead), first framework designed introduce into RepHead compresses multiple inputs a single input and employs reconstruction operations recover them, thereby...
To improve recognition performance in noisy environments, multicondition training is usually applied which speech signals corrupted by a variety of noise are used acoustic model training. Published hidden Markov modeling uses multiple Gaussian distributions to cover the spread distribution caused noise, distracts event itself and possibly sacrifices on clean speech. In this paper, we propose novel approach extends conventional mixture (GMHMM) state emission parameters (mean variance) as...
Keyword search, in the context of low resource languages, has emerged as a key area research. The dominant approach keyword search is to use Automatic Speech Recognition (ASR) front end produce representation audio that can be indexed. biggest drawback this lies its inability deal with out-of-vocabulary words and query terms are not ASR system output. In paper we present an empirical study evaluating various approaches based on using confusion models expansion techniques address problem. We...
Current hidden Markov acoustic modeling for large-vocabulary continuous speech recognition (LVCSR) heavily relies on the availability of abundant labeled transcriptions. Given that labeling is both expensive and time-consuming while there a huge amount unlabeled data easily available nowadays, semi-supervised learning (SSL) from aiming to reduce development cost LVCSR becomes more important than ever. In this paper, new SSL approach proposed which exploits cross-view transfer through...