NFDI4DS | UHH-SEMS - Publication Details

Attention Is All You Need In Speech Separation

OPENALEX - Publications

Cem Subakan Mirco Ravanelli Samuele Cornell Mirko Bronzi Jianyuan Zhong

Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers emerging as a natural alternative to standard replacing recurrent computations with multi-head attention mechanism.In this paper, we propose SepFormer, novel RNN-free Transformer-based neural network for speech separation. The Sep-Former learns short and long-term...

10.1109/icassp39728.2021.9413901 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Light Gated Recurrent Units for Speech Recognition

OPENALEX - Publications

Mirco Ravanelli Philémon Brakel Maurizio Omologo Yoshua Bengio

A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite great achievements of past decades, however, a natural and robust human-machine interaction still appears to be out reach, especially challenging environments characterized by significant noise reverberation. To improve robustness, modern recognizers often employ acoustic models based on recurrent neural networks (RNNs) are naturally able exploit large time contexts...

10.1109/tetci.2017.2762739 article EN IEEE Transactions on Emerging Topics in Computational Intelligence 2018-03-23

Exploring Self-Attention Mechanisms for Speech Separation

OPENALEX - Publications

Cem Subakan Mirco Ravanelli Samuele Cornell François Grondin Mirko Bronzi

Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance speech separation with WSJ0-2/3 Mix datasets. This paper studies in-depth for separation. In particular, extend our previous findings on SepFormer by providing results more challenging noisy noisy-reverberant datasets, such as...

10.1109/taslp.2023.3282097 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

LMAC-TD: Producing Time Domain Explanations for Audio Classifiers

OPENALEX - Publications

Eleonora Mancini Francesco Paissan Mirco Ravanelli Cem Subakan

10.1109/icassp49660.2025.10890448 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Fine-Tuning Strategies for Faster Inference Using Speech Self-Supervised Models: A Comparative Study

OPENALEX - Publications

Salah Zaiem Robin Algayres Titouan Parcollet Slim Essid Mirco Ravanelli

Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance low-resource settings. In this context, it been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better might be sanctioned with longer inferences. This article explores different approaches may deployed during the fine-tuning to reduce computations needed SSL encoder, leading faster We adapt a number of...

10.1109/icasspw59220.2023.10193042 article EN 2023-06-04

Real-M: Towards Speech Separation on Real Mixtures

OPENALEX - Publications

Cem Subakan Mirco Ravanelli Samuele Cornell François Grondin

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate models on synthetic datasets, while the performance of state-of-the-art techniques in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release REAL-M dataset, a crowd-sourced corpus real-life mixtures. Secondly, address problem evaluation mixtures, where ground truth is not available. We bypass issue by...

10.1109/icassp43922.2022.9746662 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Resource-Efficient Separation Transformer

OPENALEX - Publications

Cem Subakan Mirco Ravanelli Samuele Cornell Frédéric Lepoutre François Grondin

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based separation with reduced computational cost. Our main contribution is the development Resource-Efficient Separation Transformer (RE-SepFormer), self-attention-based architecture that reduces burden two ways. First, it uses non-overlapping blocks latent space. Second, operates...

10.48550/arxiv.2206.09507 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

OPENALEX - Publications

Zhepei Wang Cem Subakan Xilin Jiang Junkai Wu Efthymios Tzinis and 2 more

In this article, we work on a sound recognition system that continually incorporates new classes. Our main goal is to develop framework where the model can be updated without relying labeled data. For purpose, propose adopting representation learning, an encoder trained using unlabeled This learning enables study and implementation of practically relevant use case only small amount labels available in continual context. We also make empirical observation similarity-based method within robust...

10.1109/lsp.2022.3229643 article EN IEEE Signal Processing Letters 2022-01-01

Simulated Annealing in Early Layers Leads to Better Generalization

OPENALEX - Publications

Amir M. Sarfi Zahra Karimpour Muawiz Chaudhary Nasir M. Khalid Mirco Ravanelli and 2 more

Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods time in exchange improved LLF (later-layer-forgetting) is state-of-the-art method this category. It strengthens early layers by periodically re-initializing the last few network. Our principal innovation work use Simulated annealing EArly Layers (SEAL) network place re-initialization later layers. Essentially, go through normal gradient descent...

10.1109/cvpr52729.2023.01935 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Speech Emotion Diarization: Which Emotion Appears When?

OPENALEX - Publications

Yingzhi Wang Mirco Ravanelli A. El Yacoubi

Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect fine-grained nature and to unify various methods under a single objective, we propose new task: Diarization (SED). Just Speaker answers question "Who speaks when?", "Which emotion appears when?". facilitate evaluation performance establish...

10.1109/asru57964.2023.10389790 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023-12-16

Resource-Efficient Separation Transformer

OPENALEX - Publications

Luca Della Libera Cem Subakan Mirco Ravanelli Samuele Cornell Frédéric Lepoutre and 1 more

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based separation with reduced computational cost. Our main contribution is the development Resource-Efficient Separation Transformer (RE-SepFormer), self-attention-based architecture that reduces burden two ways. First, it uses non-overlapping blocks latent space. Second, operates...

10.1109/icassp48485.2024.10446670 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Transformers with Competitive Ensembles of Independent Mechanisms

OPENALEX - Publications

Alex Lamb Di He Anirudh Goyal Guolin Ke Chien-Feng Liao and 2 more

An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable model to keep distinct sources of information and routes processing well-separated. This structure is linked notion independent mechanisms causality literature, mechanism able retain same as irrelevant aspects world are changed. For example, convnets separation over positions, while attention-based (especially Transformers) learn combination...

10.48550/arxiv.2103.00336 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Attention is All You Need in Speech Separation

OPENALEX - Publications

Cem Subakan Mirco Ravanelli Samuele Cornell Mirko Bronzi Jianyuan Zhong

Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers emerging as a natural alternative to standard replacing recurrent computations with multi-head attention mechanism. In this paper, we propose SepFormer, novel RNN-free Transformer-based neural network for speech separation. The SepFormer learns short and long-term...

10.48550/arxiv.2010.13154 preprint EN other-oa arXiv (Cornell University) 2020-01-01

BIRD: Big Impulse Response Dataset

OPENALEX - Publications

François Grondin Jean-Samuel Lauzon Simon Michaud Mirco Ravanelli François Michaud

This paper introduces BIRD, the Big Impulse Response Dataset. open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using Image Method, making it largest currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources. The also use cases to illustrate how BIRD perform with existing speech corpora.

10.48550/arxiv.2010.09930 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

OPENALEX - Publications

Shubham Gupta Mirco Ravanelli Pascal Germain Cem Subakan

In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries explainable detection AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) the proposed produces result in more faithful explanations compared to standard posthoc explanation methods. Moreover, by associating representations, methodology generates tend be understandable...

10.48550/arxiv.2406.10422 preprint EN arXiv (Cornell University) 2024-06-14

Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

OPENALEX - Publications

Shubham Gupta Mirco Ravanelli Pascal Germain Cem Subakan

In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries explainable detection AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) the proposed produces result in more faithful explanations compared to standard posthoc explanation methods. Moreover, by associating representations, methodology generates tend be understandable...

10.21437/interspeech.2024-632 article EN Interspeech 2022 2024-09-01

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

OPENALEX - Publications

Luca Della Libera Pooneh Mousavi Salah Zaiem Cem Subakan Mirco Ravanelli

10.1109/taslp.2024.3487410 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Dynamic HumTrans: Humming Transcription Using CNNs and Dynamic Programming

OPENALEX - Publications

Shubham Gupta Isaac Neri Gomez-Sarmiento Faez Amjed Mezdari Mirco Ravanelli Cem Subakan

We propose a novel approach for humming transcription that combines CNN-based architecture with dynamic programming-based post-processing algorithm, utilizing the recently introduced HumTrans dataset. identify and address inherent problems offset onset ground truth provided by dataset, offering heuristics to improve these annotations, resulting in dataset precise annotations will aid future research. Additionally, we compare accuracy of our method against several others, demonstrating...

10.48550/arxiv.2410.05455 preprint EN arXiv (Cornell University) 2024-10-07

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

OPENALEX - Publications

Fırat Öncel Matthias Bethge Beyza Ermiş Mirco Ravanelli Cem Subakan and 1 more

In the last decade, generalization and adaptation abilities of deep learning models were typically evaluated on fixed training test distributions. Contrary to traditional learning, large language (LLMs) are (i) even more overparameterized, (ii) trained unlabeled text corpora curated from Internet with minimal human intervention, (iii) in an online fashion. These stark contrasts prevent researchers transferring lessons learned model contexts LLMs. To this end, our short paper introduces...

10.48550/arxiv.2410.05581 preprint EN arXiv (Cornell University) 2024-10-07

LMAC-TD: Producing Time Domain Explanations for Audio Classifiers

OPENALEX - Publications

Eleonora Mancini Francesco Paissan Mirco Ravanelli Cem Subakan

Neural networks are typically black-boxes that remain opaque with regards to their decision mechanisms. Several works in the literature have proposed post-hoc explanation methods alleviate this issue. This paper proposes LMAC-TD, a method trains decoder produce explanations directly time domain. methodology builds upon foundation of L-MAC, Listenable Maps for Audio Classifiers, produces faithful and listenable explanations. We incorporate SepFormer, popular transformer-based time-domain...

10.48550/arxiv.2409.08655 preprint EN arXiv (Cornell University) 2024-09-13

Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech

OPENALEX - Publications

Eleonora Mancini Francesco Paissan Paolo Torroni Cem Subakan Mirco Ravanelli

Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models speech-based PD detection have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming support the development of accurate, interpretable clinical decision-making diagnosis and monitoring. Our methodology involves (i) obtaining attributions saliency...

10.48550/arxiv.2411.08013 preprint EN arXiv (Cornell University) 2024-11-12

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

OPENALEX - Publications

Fırat Öncel Matthias Bethge Beyza Ermiş Mirco Ravanelli Cem Subakan and 1 more

10.18653/v1/2024.emnlp-main.1108 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

Posthoc Interpretation via Quantization

OPENALEX - Publications

Cem Subakan Francesco Paissan Mirco Ravanelli

In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of classifier into discrete, class-specific latent space. The codebooks act as bottleneck that forces interpreter focus on parts input data deemed relevant making prediction. model formulation also enables learning concepts incorporating supervision pretrained...

10.48550/arxiv.2303.12659 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Simulated Annealing in Early Layers Leads to Better Generalization

OPENALEX - Publications

Amirmohammad Sarfi Zahra Karimpour Muawiz Chaudhary Nasir M. Khalid Mirco Ravanelli and 2 more

Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods time in exchange improved LLF (later-layer-forgetting) is state-of-the-art method this category. It strengthens early layers by periodically re-initializing the last few network. Our principal innovation work use Simulated annealing EArly Layers (SEAL) network place re-initialization later layers. Essentially, go through normal gradient descent...

10.48550/arxiv.2304.04858 preprint EN other-oa arXiv (Cornell University) 2023-01-01