Themos Stafylakis

ORCID: 0000-0002-9227-3588
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Topic Modeling
  • Bayesian Methods and Mixture Models
  • Speech and dialogue systems
  • Video Analysis and Summarization
  • Advanced Image and Video Retrieval Techniques
  • Neural Networks and Applications
  • Advanced Data Compression Techniques
  • Hearing Loss and Rehabilitation
  • Handwritten Text Recognition Techniques
  • Digital Media Forensic Detection
  • Face recognition and analysis
  • Wireless Signal Modulation Classification
  • Image Processing and 3D Reconstruction
  • Adversarial Robustness in Machine Learning
  • Vehicle License Plate Recognition
  • Data Management and Algorithms
  • Multi-Agent Systems and Negotiation
  • Emotion and Mood Recognition
  • Domain Adaptation and Few-Shot Learning
  • Gaussian Processes and Bayesian Inference
  • Multimodal Machine Learning Applications

Athens University of Economics and Business
2023-2025

University of Nottingham
2017-2021

Brno University of Technology
2019

Computer Research Institute of Montréal
2013-2016

National Technical University of Athens
2009-2013

École Normale Supérieure - PSL
2013

École de Technologie Supérieure
2012-2013

Institute for Language and Speech Processing
2007-2011

We propose an end-to-end deep learning architecture for wordlevel visual speech recognition.The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks.We trained evaluated it on the Lipreading In-The-Wild benchmark, challenging database 500-size vocabulary consisting video excerpts from BBC TV broadcasts.The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over current state-of-the-art.

10.21437/interspeech.2017-85 article EN Interspeech 2022 2017-08-16

Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images signals and perform speech recognition. However, research on audiovisual models is very limited. In this work, we present an model based residual networks Bidirectional Gated Recurrent Units (BGRUs). To best of our knowledge, first fusion simultaneously learns to directly image pixels waveforms performs within-context word recognition a large publicly...

10.1109/icassp.2018.8461326 article EN 2018-04-01

We examine the use of Deep Neural Networks (DNN) in extracting Baum-Welch statistics for i-vector-based textindependent speaker recognition. Instead training universal background model using standard EM algorithm, components are predefined and correspond to set triphone states, posterior occupancy probabilities which modeled by a DNN. Those assignments then combined with 60-dim MFCC features calculate first order BaumWelch train i-vector extractor extract i-vectors. The DNN-based assignment...

10.21437/odyssey.2014-44 article EN 2014-06-16

The duration of speech segments has traditionally been controlled in the NIST speaker recognition evaluations so that researchers working this framework have relieved responsibility dealing with variability arises practical applications. fixed dimensional i-vector representation utterances is ideal for under such conditions and ignoring fact i-vectors extracted from short are less reliable than those long leads to a very simple formulation problem. However more realistic approach seems be...

10.1109/icassp.2013.6639151 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Speaker clustering is a crucial step for speaker diarization. The short duration of speech segments in telephone dialogue and the absence prior information on number clusters dramatically increase difficulty this problem diarizing spontaneous conversations. We propose simple iterative Mean Shift algorithm based cosine distance to perform under these conditions. Two variants are compared an exhaustive practical study. report state art results as measured by Diarization Error Rate Number...

10.1109/taslp.2013.2285474 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2013-12-04

Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based can provide nonsequential alignments. Therefore, we could use a loss combination with an model order to force monotonic alignments and at the same time get rid assumption. In this paper, recently proposed hybrid CTC/attention architecture audio-visual...

10.1109/slt.2018.8639643 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

de niveau recherche, publiés ou non, émanant des établissements d'enseignement et recherche français étrangers, laboratoires publics privés.

10.21437/interspeech.2015-95 article FR Interspeech 2022 2015-09-06

State of the art speaker recognition systems are based on i-vector representation speech segments. In this paper we show how can be used to perform blind adaptation hybrid DNN-HMM system and report excellent results a French language audio transcription task. The implemenation is very simple. An file first diarized each cluster represented by an i-vector. Acoustic feature vectors augmented corresponding i-vectors before being presented DNN. (The same for all acoustic aligned with given...

10.1109/icassp.2014.6854823 article EN 2014-05-01

In this paper, we apply and enhance the i-vector-PLDA paradigm to text-dependent speaker recognition. Due its origin in text-independent recognition, does not make use of phonetic content each utterance. Moreover, uncertainty i-vector estimates should be taken into account PLDA model, due short duration utterances. To bridge gap, a phrase-dependent model with propagation is introduced. We examined it on RSR-2015 dataset show that despite low channel variability, improved results over GMM-UBM...

10.21437/interspeech.2013-691 article EN Interspeech 2022 2013-08-25

The automatic speaker verification spoofing and countermeasures challenge 2015 provides a common framework for the evaluation of or anti-spoofing techniques in presence various seen unseen attacks. This contribution proposes system consisting amplitude, phase, linear prediction residual, combined amplitude - phase-based detection In this task we use following features: Mel-frequency cepstral coefficients (MFCC), product spectrum-based coefficients, modified group delay weighted residual...

10.21437/interspeech.2015-469 article EN Interspeech 2022 2015-09-06

In this paper, we present the system description of joint efforts Brno University Technology (BUT) and Omilia -Conversational Intelligence for ASVSpoof2019 Spoofing Countermeasures Challenge.The primary submission Physical access (PA) is a fusion two VGG networks, trained on single two-channels features.For Logical (LA), our recently introduced SincNet architecture.The results PA show that proposed networks yield very competitive performance in all conditions achieved 86 % relative...

10.21437/interspeech.2019-2892 article EN Interspeech 2022 2019-09-13

In this paper we investigate the use of adversarial domain adaptation for addressing problem language mismatch between speaker recognition corpora. context verification, methods aim at minimizing certain divergences distribution that utterance-level features follow (i.e. embeddings) when drawn from source and target domains languages), while preserving their capacity in recognizing speakers. Neural architectures extracting representations enable us to apply an end-to-end fashion train...

10.1109/icassp.2019.8683616 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of leveraging unlabelled utterances, due the classification loss over training speakers.In this paper, we explore an alternative strategy enable use utterances in training.We propose train embedding extractors via reconstructing frames a target speech segment, given inferred another segment same utterance.We do by attaching standard extractor decoder network, which feed not merely with embedding, but also estimated...

10.21437/interspeech.2019-2842 article EN Interspeech 2022 2019-09-13

Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which anticipate enable further improvements method. We examine several tricks in training, such as effects of normalizing input features and pooled statistics, different methods preventing overfitting well alternative non-linearities that can be used instead Rectifier Linear Units....

10.1109/icassp.2019.8683445 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models speaker verification task have yet be fully explored. this paper, we analyze several feature extraction approaches built on top of a model, as well regularization and rate scheduler stabilize process further boost performance: multi-head factorized attentive pooling is proposed...

10.1109/slt54892.2023.10022775 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

Recently, the pre-trained Transformer models have received a rising interest in field of speech processing thanks to their great success various downstream tasks. However, most fine-tuning approaches update all parameters model, which becomes prohibitive as model size grows and sometimes results over-fitting on small datasets. In this paper, we conduct comprehensive analysis applying parameter-efficient transfer learning (PETL) methods reduce required learnable for adapting speaker...

10.1109/icassp49357.2023.10094795 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

10.1109/icassp49660.2025.10889058 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information the speech signal and best quantify or categorize noisy subjective emotion labels. Self-supervised pre-trained representations can robustly enabling state-of-the-art results in many downstream tasks including recognition. However, better ways of aggregating across time need be considered as relevant is likely appear piecewise not uniformly signal. For labels, take...

10.1109/icassp49357.2023.10094673 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

We discuss the limitations of i-vector representation speech segments in speaker recognition and explain how Joint Factor Analysis (JFA) can serve as an alternative feature extractor a variety ways. Building on work Zhao Dong, we implemented variational Bayes treatment JFA which accommodates adaptation universal background models (UBMs) natural way. This allows us to experiment with several types features for recognition: factors diagonal addition i-vectors, extracted without UBM each case....

10.1109/icassp.2014.6853889 article EN 2014-05-01
Coming Soon ...