- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Topic Modeling
- Speech and dialogue systems
- Advanced Data Compression Techniques
- Neural Networks and Applications
- Time Series Analysis and Forecasting
- Video Analysis and Summarization
- Advanced Adaptive Filtering Techniques
- Advanced Text Analysis Techniques
- Digital Media Forensic Detection
- Phonetics and Phonology Research
- Algorithms and Data Compression
- Anomaly Detection Techniques and Applications
- Image and Signal Denoising Methods
- Dispute Resolution and Class Actions
- AI in Service Interactions
- Wireless Communication Networks Research
- Handwritten Text Recognition Techniques
- Bayesian Methods and Mixture Models
- Wireless Signal Modulation Classification
- Emotion and Mood Recognition
- Web Data Mining and Analysis
Brno University of Technology
2016-2025
Edip (Czechia)
2022
UniLaSalle Amiens (ESIEE-Amiens)
2002
Université Gustave Eiffel
2002
A new recurrent neural network based language model (RNN LM) with applications to speech recognition is presented. Results indicate that it possible obtain around 50% reduction of perplexity by using mixture several RNN LMs, compared a state the art backoff model. Speech experiments show 18% word error rate on Wall Street Journal task when comparing models trained same amount data, and 5% much harder NIST RT05 task, even more data than LM. We provide ample empirical evidence suggest...
We present several modifications of the original recurrent neural network language model (RNN LM).While this has been shown to significantly outperform many competitive modeling techniques in terms accuracy, remaining problem is computational complexity. In work, we show approaches that lead more than 15 times speedup for both training and testing phases. Next, importance using a backpropagation through time algorithm. An empirical comparison with feedforward networks also provided. end,...
We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the are sorted by their relevance. introduce hash-based implementation of a maximum entropy model, that can be trained as part model. This leads significant reduction computational complexity. achieved around 10% relative word error rate English Broadcast News speech recognition task, against 4-gram model 400M tokens.
In recent years, probabilistic features became an integral part of state-of-the-are LVCSR systems. this work, we are exploring the possibility obtaining directly from neural net without necessity converting output probabilities to suitable for subsequent GMM-HMM system. We experimented with 5-layer MLP bottle-neck in middle layer. After training such a net, used outputs as recognition The benefits twofold: first, improvement was gained when these instead features, second, size system...
We present results obtained with several advanced language modeling techniques, including class based model, cache maximum entropy structured random forest model and types of neural network models. show after combining all these models by using linear interpolation. conclude that for both small moderately sized tasks, we obtain new state the art combination models, is significantly better than performance any individual model. Obtained perplexity reductions against Good-Turing trigram...
The processing of speech corrupted by interfering overlapping speakers is one the challenging problems with regards to today's automatic recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most these tackle problem as separation, i.e., they blindly recover all from mixture. In some scenarios, such smart personal devices, we may however be interested in recovering target speaker a paper, introduce SpeakerBeam, method for...
Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">cocktail party effect</i> . For decades, researchers focused on approaching listening ability of humans. One critical issue handling speakers because nontarget speech signals share similar characteristics, complicating their...
This paper describes and discusses the "STBU" speaker recognition system, which performed well in NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium of four partners: Spescom DataVoice (Stellenbosch, South Africa), TNO (Soesterberg, The Netherlands), BUT (Brno, Czech Republic), University Stellenbosch Africa). system was combination three main kinds subsystems: 1) GMM, with short-time Mel frequency cepstral coefficient (MFCC) or perceptual linear prediction (PLP) features,...
This paper deals with phoneme recognition based on neural networks (NN). First, several approaches to improve the error rate are suggested and discussed. In experimental part, we concentrate TempoRAl Patterns (TRAPs) novel split temporal context (STC) recognizers. We also investigate into tandem NN architectures. The results of final system reported standard TIMIT database compare favorably best published results.
In this paper, we investigate alternative ways of processing MFCC-based features to use as the input Deep Neural Networks (DNNs). Our baseline is a conventional feature pipeline that involves splicing 13-dimensional front-end MFCCs across 9 frames, followed by applying LDA reduce dimension 40 and then further decorrelation using MLLT. Confirming results other groups, show speaker adaptation applied on top these feature-space MLLR helpful. The fact number parameters DNN not strongly sensitive...
In this paper, we describe recent progress in i-vector based speaker verification. The use of universal background models (UBM) with full-covariance matrices is suggested and thoroughly experimentally tested. i-vectors are scored using a simple cosine distance advanced techniques such as Probabilistic Linear Discriminant Analysis (PLDA) heavy-tailed variant PLDA (PLDA-HT). Finally, investigate into dimensionality reduction before entering the PLDA-HT modeling. results very competitive: on...
<para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> In this paper, several feature extraction and channel compensation techniques found in state-of-the-art speaker verification systems are analyzed discussed. For the NIST SRE 2006 submission, cepstral mean subtraction, warping, RelAtive SpecTrAl (RASTA) filtering, heteroscedastic linear discriminant analysis (HLDA), mapping, eigenchannel adaptation were incrementally added to minimize system's...
This paper presents BUT ReverbDB - a dataset of real room impulse responses (RIR), background noises and re-transmitted speech data. The retransmitted data includes LibriSpeech test-clean, 2000 HUB5 English evaluation part 2010 NIST Speaker Recognition Evaluation datasets. We provide detailed description RIR collection (hardware, software, post-processing) that can serve as "cook-book" for similar efforts. also validate in two sets automatic recognition (ASR) experiments draw conclusions...
Recently, several nonparametric Bayesian models have been proposed to automatically discover acoustic units in unlabeled data. Most of them are trained using various versions the Gibbs Sampling (GS) method. In this work, we consider Variational Bayes (VB) as alternative inference process. Even though VB yields an approximate solution posterior distribution it can be easily parallelized which makes more suitable for large database. Results show that, notwithstanding is order magnitude faster,...
We presented a novel technique for discriminative feature-level adaptation of automatic speech recognition system. The concept iVectors popular in Speaker Recognition is used to extract information about speaker or acoustic environment from segment. iVector low-dimensional fixed-length representing such information. To utilized adaptation, Region Dependent Linear Transforms (RDLT) are discriminatively trained using MPE criterion on large amount annotated data the relevant and compensate...
This work studies the usage of Deep Neural Network (DNN) Bottleneck (BN) features together with traditional MFCC in task i-vector-based speaker recognition. We decouple sufficient statistics extraction by using separate GMM models for frame alignment, and normalization we analyze BN (and their concatenation) two stages. also show effect full-covariance models, and, as a contrast, compare result to recent DNN-alignment approach. On NIST SRE2010, telephone condition, 60% relative gain over...
In this paper, we propose a DNN adaptation technique, where the i-vector extractor is replaced by Sequence Summarizing Neural Network (SSNN). Similarly to extractor, SSNN produces "summary vector", representing an acoustic summary of utterance. Such vector then appended input main network, while both networks are trained together optimizing single loss function. Both and speaker methods compared on AMI meeting data. The results show comparable performance techniques FBANK system with...
Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance.For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training such models.This work builds upon results showing notable improvements using cycle-consistency related techniques.Such techniques derive procedures losses able leverage unpaired and/or text by combining ASR with Text-to-Speech (TTS) models.In particular, proposes new...