NFDI4DS | UHH-SEMS - Publication Details

Themos Stafylakis

ORCID: 0000-0002-9227-3588

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5061939508

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Natural Language Processing Techniques
Topic Modeling
Bayesian Methods and Mixture Models
Speech and dialogue systems
Video Analysis and Summarization
Advanced Image and Video Retrieval Techniques
Neural Networks and Applications
Advanced Data Compression Techniques
Hearing Loss and Rehabilitation
Handwritten Text Recognition Techniques
Digital Media Forensic Detection
Face recognition and analysis
Wireless Signal Modulation Classification
Image Processing and 3D Reconstruction
Adversarial Robustness in Machine Learning
Vehicle License Plate Recognition
Data Management and Algorithms
Multi-Agent Systems and Negotiation
Emotion and Mood Recognition
Domain Adaptation and Few-Shot Learning
Gaussian Processes and Bayesian Inference
Multimodal Machine Learning Applications

Athens University of Economics and Business
2023-2025

University of Nottingham
2017-2021

Brno University of Technology
2019

Computer Research Institute of Montréal
2013-2016

National Technical University of Athens
2009-2013

École Normale Supérieure - PSL
2013

École de Technologie Supérieure
2012-2013

Institute for Language and Speech Processing
2007-2011

Combining Residual Networks with LSTMs for Lipreading

OPENALEX - Publications

Themos Stafylakis Georgios Tzimiropoulos

We propose an end-to-end deep learning architecture for wordlevel visual speech recognition.The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks.We trained evaluated it on the Lipreading In-The-Wild benchmark, challenging database 500-size vocabulary consisting video excerpts from BBC TV broadcasts.The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over current state-of-the-art.

10.21437/interspeech.2017-85 article EN Interspeech 2022 2017-08-16

End-to-End Audiovisual Speech Recognition

OPENALEX - Publications

Stavros Petridis Themos Stafylakis Pingchuan Ma Feipeng Cai Georgios Tzimiropoulos and 1 more

Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images signals and perform speech recognition. However, research on audiovisual models is very limited. In this work, we present an model based residual networks Bidirectional Gated Recurrent Units (BGRUs). To best of our knowledge, first fusion simultaneously learns to directly image pixels waveforms performs within-context word recognition a large publicly...

10.1109/icassp.2018.8461326 article EN 2018-04-01

Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition

OPENALEX - Publications

Patrick Kenny Themos Stafylakis Pierre Ouellet Vishwa Gupta Jahangir Alam

We examine the use of Deep Neural Networks (DNN) in extracting Baum-Welch statistics for i-vector-based textindependent speaker recognition. Instead training universal background model using standard EM algorithm, components are predefined and correspond to set triphone states, posterior occupancy probabilities which modeled by a DNN. Those assignments then combined with 60-dim MFCC features calculate first order BaumWelch train i-vector extractor extract i-vectors. The DNN-based assignment...

10.21437/odyssey.2014-44 article EN 2014-06-16

PLDA for speaker verification with utterances of arbitrary duration

OPENALEX - Publications

Patrick Kenny Themos Stafylakis Pierre Ouellet Jahangir Alam Pierre Dumouchel

The duration of speech segments has traditionally been controlled in the NIST speaker recognition evaluations so that researchers working this framework have relieved responsibility dealing with variability arises practical applications. fixed dimensional i-vector representation utterances is ideal for under such conditions and ignoring fact i-vectors extracted from short are less reliable than those long leads to a very simple formulation problem. However more realistic approach seems be...

10.1109/icassp.2013.6639151 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Handwritten document image segmentation into text lines and words

OPENALEX - Publications

Vassilis Papavassiliou Themos Stafylakis Vassilis Katsouros G. Carayannis

10.1016/j.patcog.2009.05.007 article EN Pattern Recognition 2009-05-23

A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization

OPENALEX - Publications

Mohammed Senoussaoui Patrick Kenny Themos Stafylakis Pierre Dumouchel

Speaker clustering is a crucial step for speaker diarization. The short duration of speech segments in telephone dialogue and the absence prior information on number clusters dramatically increase difficulty this problem diarizing spontaneous conversations. We propose simple iterative Mean Shift algorithm based cosine distance to perform under these conditions. Two variants are compared an exhaustive practical study. report state art results as measured by Diarization Error Rate Number...

10.1109/taslp.2013.2285474 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2013-12-04

Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture

OPENALEX - Publications

Stavros Petridis Themos Stafylakis Pingchuan Ma Georgios Tzimiropoulos Maja Pantić

Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based can provide nonsequential alignments. Therefore, we could use a loss combination with an model order to force monotonic alignments and at the same time get rid assumption. In this paper, recently proposed hybrid CTC/attention architecture audio-visual...

10.1109/slt.2018.8639643 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

The reddots data collection for speaker recognition

OPENALEX - Publications

Kong Aik Lee Anthony Larcher Guangsen Wang Patrick Kenny Niko Brümmer and 10 more

de niveau recherche, publiés ou non, émanant des établissements d'enseignement et recherche français étrangers, laboratoires publics privés.

10.21437/interspeech.2015-95 article FR Interspeech 2022 2015-09-06

I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription

OPENALEX - Publications

Vishwa Gupta Patrick Kenny Pierre Ouellet Themos Stafylakis

State of the art speaker recognition systems are based on i-vector representation speech segments. In this paper we show how can be used to perform blind adaptation hybrid DNN-HMM system and report excellent results a French language audio transcription task. The implemenation is very simple. An file first diarized each cluster represented by an i-vector. Acoustic feature vectors augmented corresponding i-vectors before being presented DNN. (The same for all acoustic aligned with given...

10.1109/icassp.2014.6854823 article EN 2014-05-01

Text-dependent speaker recognition using PLDA with uncertainty propagation

OPENALEX - Publications

Themos Stafylakis Patrick Kenny Pierre Ouellet Juan J. Pérez Marcel Kockmann and 1 more

In this paper, we apply and enhance the i-vector-PLDA paradigm to text-dependent speaker recognition. Due its origin in text-independent recognition, does not make use of phonetic content each utterance. Moreover, uncertainty i-vector estimates should be taken into account PLDA model, due short duration utterances. To bridge gap, a phrase-dependent model with propagation is introduced. We examined it on RSR-2015 dataset show that despite low channel variability, improved results over GMM-UBM...

10.21437/interspeech.2013-691 article EN Interspeech 2022 2013-08-25

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

OPENALEX - Publications

Themos Stafylakis Muhammad Haris Khan Georgios Tzimiropoulos

10.1016/j.cviu.2018.10.003 article EN Computer Vision and Image Understanding 2018-11-01

Development of CRIM system for the automatic speaker verification spoofing and countermeasures challenge 2015

OPENALEX - Publications

Md. Jahangir Alam Patrick Kenny Gautam Bhattacharya Themos Stafylakis

The automatic speaker verification spoofing and countermeasures challenge 2015 provides a common framework for the evaluation of or anti-spoofing techniques in presence various seen unseen attacks. This contribution proposes system consisting amplitude, phase, linear prediction residual, combined amplitude - phase-based detection In this task we use following features: Mel-frequency cepstral coefficients (MFCC), product spectrum-based coefficients, modified group delay weighted residual...

10.21437/interspeech.2015-469 article EN Interspeech 2022 2015-09-06

Detecting Spoofing Attacks Using VGG and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge

OPENALEX - Publications

Hossein Zeinali Themos Stafylakis Georgia Athanasopoulou Johan Rohdin Ioannis Gkinis and 2 more

In this paper, we present the system description of joint efforts Brno University Technology (BUT) and Omilia -Conversational Intelligence for ASVSpoof2019 Spoofing Countermeasures Challenge.The primary submission Physical access (PA) is a fusion two VGG networks, trained on single two-channels features.For Logical (LA), our recently introduced SincNet architecture.The results PA show that proposed networks yield very competitive performance in all conditions achieved 86 % relative...

10.21437/interspeech.2019-2892 article EN Interspeech 2022 2019-09-13

Speaker Verification Using End-to-end Adversarial Language Adaptation

OPENALEX - Publications

Johan Rohdin Themos Stafylakis Anna Silnova Hossein Zeinali Lukáš Burget and 1 more

In this paper we investigate the use of adversarial domain adaptation for addressing problem language mismatch between speaker recognition corpora. context verification, methods aim at minimizing certain divergences distribution that utterance-level features follow (i.e. embeddings) when drawn from source and target domains languages), while preserving their capacity in recognizing speakers. Neural architectures extracting representations enable us to apply an end-to-end fashion train...

10.1109/icassp.2019.8683616 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Self-Supervised Speaker Embeddings

OPENALEX - Publications

Themos Stafylakis Johan Rohdin Oldřich Plchot Petr Mizera Lukáš Burget

Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of leveraging unlabelled utterances, due the classification loss over training speakers.In this paper, we explore an alternative strategy enable use utterances in training.We propose train embedding extractors via reconstructing frames a target speech segment, given inferred another segment same utterance.We do by attaching standard extractor decoder network, which feed not merely with embedding, but also estimated...

10.21437/interspeech.2019-2842 article EN Interspeech 2022 2019-09-13

How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

OPENALEX - Publications

Hossein Zeinali Lukáš Burget Johan Rohdin Themos Stafylakis Jaň Černocký

Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which anticipate enable further improvements method. We examine several tricks in training, such as effects of normalizing input features and pooled statistics, different methods preventing overfitting well alternative non-linearities that can be used instead Rectifier Linear Units....

10.1109/icassp.2019.8683445 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

An Attention-Based Backend Allowing Efficient Fine-Tuning of Transformer Models for Speaker Verification

OPENALEX - Publications

Junyi Peng Oldřich Plchot Themos Stafylakis Ladislav Mošner Lukáš Burget and 1 more

In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models speaker verification task have yet be fully explored. this paper, we analyze several feature extraction approaches built on top of a model, as well regularization and rate scheduler stabilize process further boost performance: multi-head factorized attentive pooling is proposed...

10.1109/slt54892.2023.10022775 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

Parameter-Efficient Transfer Learning of Pre-Trained Transformer Models for Speaker Verification Using Adapters

OPENALEX - Publications

Junyi Peng Themos Stafylakis Rongzhi Gu Oldřich Plchot Ladislav Mošner and 2 more

Recently, the pre-trained Transformer models have received a rising interest in field of speech processing thanks to their great success various downstream tasks. However, most fine-tuning approaches update all parameters model, which becomes prohibitive as model size grows and sometimes results over-fitting on small datasets. In this paper, we conduct comprehensive analysis applying parameter-efficient transfer learning (PETL) methods reduce required learnable for adapting speaker...

10.1109/icassp49357.2023.10094795 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

OPENALEX - Publications

Junyi Peng Ladislav Mošner Lin Zhang Oldřich Plchot Themos Stafylakis and 2 more

10.1109/icassp49660.2025.10889058 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Speech-Based Emotion Recognition with Self-Supervised Models Using Attentive Channel-Wise Correlations and Label Smoothing

OPENALEX - Publications

Sofoklis Kakouros Themos Stafylakis Ladislav Mošner Lukáš Burget

When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information the speech signal and best quantify or categorize noisy subjective emotion labels. Self-supervised pre-trained representations can robustly enabling state-of-the-art results in many downstream tasks including recognition. However, better ways of aggregating across time need be considered as relevant is likely appear piecewise not uniformly signal. For labels, take...

10.1109/icassp49357.2023.10094673 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

JFA-based front ends for speaker recognition

OPENALEX - Publications

Patrick Kenny Themos Stafylakis Pierre Ouellet Md. Jahangir Alam

We discuss the limitations of i-vector representation speech segments in speaker recognition and explain how Joint Factor Analysis (JFA) can serve as an alternative feature extractor a variety ways. Building on work Zhao Dong, we implemented variational Bayes treatment JFA which accommodates adaptation universal background models (UBMs) natural way. This allows us to experiment with several types features for recognition: factors diagonal addition i-vectors, extracted without UBM each case....

10.1109/icassp.2014.6853889 article EN 2014-05-01

Supervised/Unsupervised Voice Activity Detectors for Text-dependent Speaker Recognition on the RSR2015 Corpus

OPENALEX - Publications

Patrick Kenny Themos Stafylakis Pierre Ouellet Md. Jahangir Alam Pierre Dumouchel

10.21437/odyssey.2014-14 article EN 2014-06-16

Coming Soon ...