- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Topic Modeling
- Bayesian Methods and Mixture Models
- Speech and dialogue systems
- Video Analysis and Summarization
- Advanced Image and Video Retrieval Techniques
- Neural Networks and Applications
- Advanced Data Compression Techniques
- Hearing Loss and Rehabilitation
- Handwritten Text Recognition Techniques
- Digital Media Forensic Detection
- Face recognition and analysis
- Wireless Signal Modulation Classification
- Image Processing and 3D Reconstruction
- Adversarial Robustness in Machine Learning
- Vehicle License Plate Recognition
- Data Management and Algorithms
- Multi-Agent Systems and Negotiation
- Emotion and Mood Recognition
- Domain Adaptation and Few-Shot Learning
- Gaussian Processes and Bayesian Inference
- Multimodal Machine Learning Applications
Athens University of Economics and Business
2023-2025
University of Nottingham
2017-2021
Brno University of Technology
2019
Computer Research Institute of Montréal
2013-2016
National Technical University of Athens
2009-2013
École Normale Supérieure - PSL
2013
École de Technologie Supérieure
2012-2013
Institute for Language and Speech Processing
2007-2011
We propose an end-to-end deep learning architecture for wordlevel visual speech recognition.The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks.We trained evaluated it on the Lipreading In-The-Wild benchmark, challenging database 500-size vocabulary consisting video excerpts from BBC TV broadcasts.The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over current state-of-the-art.
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images signals and perform speech recognition. However, research on audiovisual models is very limited. In this work, we present an model based residual networks Bidirectional Gated Recurrent Units (BGRUs). To best of our knowledge, first fusion simultaneously learns to directly image pixels waveforms performs within-context word recognition a large publicly...
We examine the use of Deep Neural Networks (DNN) in extracting Baum-Welch statistics for i-vector-based textindependent speaker recognition. Instead training universal background model using standard EM algorithm, components are predefined and correspond to set triphone states, posterior occupancy probabilities which modeled by a DNN. Those assignments then combined with 60-dim MFCC features calculate first order BaumWelch train i-vector extractor extract i-vectors. The DNN-based assignment...
The duration of speech segments has traditionally been controlled in the NIST speaker recognition evaluations so that researchers working this framework have relieved responsibility dealing with variability arises practical applications. fixed dimensional i-vector representation utterances is ideal for under such conditions and ignoring fact i-vectors extracted from short are less reliable than those long leads to a very simple formulation problem. However more realistic approach seems be...
Speaker clustering is a crucial step for speaker diarization. The short duration of speech segments in telephone dialogue and the absence prior information on number clusters dramatically increase difficulty this problem diarizing spontaneous conversations. We propose simple iterative Mean Shift algorithm based cosine distance to perform under these conditions. Two variants are compared an exhaustive practical study. report state art results as measured by Diarization Error Rate Number...
Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based can provide nonsequential alignments. Therefore, we could use a loss combination with an model order to force monotonic alignments and at the same time get rid assumption. In this paper, recently proposed hybrid CTC/attention architecture audio-visual...
de niveau recherche, publiés ou non, émanant des établissements d'enseignement et recherche français étrangers, laboratoires publics privés.
State of the art speaker recognition systems are based on i-vector representation speech segments. In this paper we show how can be used to perform blind adaptation hybrid DNN-HMM system and report excellent results a French language audio transcription task. The implemenation is very simple. An file first diarized each cluster represented by an i-vector. Acoustic feature vectors augmented corresponding i-vectors before being presented DNN. (The same for all acoustic aligned with given...
In this paper, we apply and enhance the i-vector-PLDA paradigm to text-dependent speaker recognition. Due its origin in text-independent recognition, does not make use of phonetic content each utterance. Moreover, uncertainty i-vector estimates should be taken into account PLDA model, due short duration utterances. To bridge gap, a phrase-dependent model with propagation is introduced. We examined it on RSR-2015 dataset show that despite low channel variability, improved results over GMM-UBM...
The automatic speaker verification spoofing and countermeasures challenge 2015 provides a common framework for the evaluation of or anti-spoofing techniques in presence various seen unseen attacks. This contribution proposes system consisting amplitude, phase, linear prediction residual, combined amplitude - phase-based detection In this task we use following features: Mel-frequency cepstral coefficients (MFCC), product spectrum-based coefficients, modified group delay weighted residual...
In this paper, we present the system description of joint efforts Brno University Technology (BUT) and Omilia -Conversational Intelligence for ASVSpoof2019 Spoofing Countermeasures Challenge.The primary submission Physical access (PA) is a fusion two VGG networks, trained on single two-channels features.For Logical (LA), our recently introduced SincNet architecture.The results PA show that proposed networks yield very competitive performance in all conditions achieved 86 % relative...
In this paper we investigate the use of adversarial domain adaptation for addressing problem language mismatch between speaker recognition corpora. context verification, methods aim at minimizing certain divergences distribution that utterance-level features follow (i.e. embeddings) when drawn from source and target domains languages), while preserving their capacity in recognizing speakers. Neural architectures extracting representations enable us to apply an end-to-end fashion train...
Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of leveraging unlabelled utterances, due the classification loss over training speakers.In this paper, we explore an alternative strategy enable use utterances in training.We propose train embedding extractors via reconstructing frames a target speech segment, given inferred another segment same utterance.We do by attaching standard extractor decoder network, which feed not merely with embedding, but also estimated...
Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which anticipate enable further improvements method. We examine several tricks in training, such as effects of normalizing input features and pooled statistics, different methods preventing overfitting well alternative non-linearities that can be used instead Rectifier Linear Units....
In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models speaker verification task have yet be fully explored. this paper, we analyze several feature extraction approaches built on top of a model, as well regularization and rate scheduler stabilize process further boost performance: multi-head factorized attentive pooling is proposed...
Recently, the pre-trained Transformer models have received a rising interest in field of speech processing thanks to their great success various downstream tasks. However, most fine-tuning approaches update all parameters model, which becomes prohibitive as model size grows and sometimes results over-fitting on small datasets. In this paper, we conduct comprehensive analysis applying parameter-efficient transfer learning (PETL) methods reduce required learnable for adapting speaker...
When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information the speech signal and best quantify or categorize noisy subjective emotion labels. Self-supervised pre-trained representations can robustly enabling state-of-the-art results in many downstream tasks including recognition. However, better ways of aggregating across time need be considered as relevant is likely appear piecewise not uniformly signal. For labels, take...
We discuss the limitations of i-vector representation speech segments in speaker recognition and explain how Joint Factor Analysis (JFA) can serve as an alternative feature extractor a variety ways. Building on work Zhao Dong, we implemented variational Bayes treatment JFA which accommodates adaptation universal background models (UBMs) natural way. This allows us to experiment with several types features for recognition: factors diagonal addition i-vectors, extracted without UBM each case....