- Speech Recognition and Synthesis
- Music and Audio Processing
- Speech and Audio Processing
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Time Series Analysis and Forecasting
- Multimodal Machine Learning Applications
- Neural Networks and Applications
- Artificial Intelligence in Law
- Natural Language Processing Techniques
- Machine Learning in Materials Science
- Machine Learning and ELM
- Sentiment Analysis and Opinion Mining
- Topic Modeling
- Evolutionary Algorithms and Applications
- Explainable Artificial Intelligence (XAI)
- Image Processing and 3D Reconstruction
- Reinforcement Learning in Robotics
- Advanced Graph Neural Networks
- Anomaly Detection Techniques and Applications
- Robot Manipulation and Learning
- Computational and Text Analysis Methods
- Emotion and Mood Recognition
Mila - Quebec Artificial Intelligence Institute
2020-2025
Concordia University
2022-2024
Université de Montréal
2021
Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers emerging as a natural alternative to standard replacing recurrent computations with multi-head attention mechanism.In this paper, we propose SepFormer, novel RNN-free Transformer-based neural network for speech separation. The Sep-Former learns short and long-term...
A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite great achievements of past decades, however, a natural and robust human-machine interaction still appears to be out reach, especially challenging environments characterized by significant noise reverberation. To improve robustness, modern recognizers often employ acoustic models based on recurrent neural networks (RNNs) are naturally able exploit large time contexts...
Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance speech separation with WSJ0-2/3 Mix datasets. This paper studies in-depth for separation. In particular, extend our previous findings on SepFormer by providing results more challenging noisy noisy-reverberant datasets, such as...
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance low-resource settings. In this context, it been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better might be sanctioned with longer inferences. This article explores different approaches may deployed during the fine-tuning to reduce computations needed SSL encoder, leading faster We adapt a number of...
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate models on synthetic datasets, while the performance of state-of-the-art techniques in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release REAL-M dataset, a crowd-sourced corpus real-life mixtures. Secondly, address problem evaluation mixtures, where ground truth is not available. We bypass issue by...
Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based separation with reduced computational cost. Our main contribution is the development Resource-Efficient Separation Transformer (RE-SepFormer), self-attention-based architecture that reduces burden two ways. First, it uses non-overlapping blocks latent space. Second, operates...
In this article, we work on a sound recognition system that continually incorporates new classes. Our main goal is to develop framework where the model can be updated without relying labeled data. For purpose, propose adopting representation learning, an encoder trained using unlabeled This learning enables study and implementation of practically relevant use case only small amount labels available in continual context. We also make empirical observation similarity-based method within robust...
Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods time in exchange improved LLF (later-layer-forgetting) is state-of-the-art method this category. It strengthens early layers by periodically re-initializing the last few network. Our principal innovation work use Simulated annealing EArly Layers (SEAL) network place re-initialization later layers. Essentially, go through normal gradient descent...
Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect fine-grained nature and to unify various methods under a single objective, we propose new task: Diarization (SED). Just Speaker answers question "Who speaks when?", "Which emotion appears when?". facilitate evaluation performance establish...
Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based separation with reduced computational cost. Our main contribution is the development Resource-Efficient Separation Transformer (RE-SepFormer), self-attention-based architecture that reduces burden two ways. First, it uses non-overlapping blocks latent space. Second, operates...
An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable model to keep distinct sources of information and routes processing well-separated. This structure is linked notion independent mechanisms causality literature, mechanism able retain same as irrelevant aspects world are changed. For example, convnets separation over positions, while attention-based (especially Transformers) learn combination...
Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers emerging as a natural alternative to standard replacing recurrent computations with multi-head attention mechanism. In this paper, we propose SepFormer, novel RNN-free Transformer-based neural network for speech separation. The SepFormer learns short and long-term...
This paper introduces BIRD, the Big Impulse Response Dataset. open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using Image Method, making it largest currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources. The also use cases to illustrate how BIRD perform with existing speech corpora.
In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries explainable detection AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) the proposed produces result in more faithful explanations compared to standard posthoc explanation methods. Moreover, by associating representations, methodology generates tend be understandable...
In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries explainable detection AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) the proposed produces result in more faithful explanations compared to standard posthoc explanation methods. Moreover, by associating representations, methodology generates tend be understandable...
We propose a novel approach for humming transcription that combines CNN-based architecture with dynamic programming-based post-processing algorithm, utilizing the recently introduced HumTrans dataset. identify and address inherent problems offset onset ground truth provided by dataset, offering heuristics to improve these annotations, resulting in dataset precise annotations will aid future research. Additionally, we compare accuracy of our method against several others, demonstrating...
In the last decade, generalization and adaptation abilities of deep learning models were typically evaluated on fixed training test distributions. Contrary to traditional learning, large language (LLMs) are (i) even more overparameterized, (ii) trained unlabeled text corpora curated from Internet with minimal human intervention, (iii) in an online fashion. These stark contrasts prevent researchers transferring lessons learned model contexts LLMs. To this end, our short paper introduces...
Neural networks are typically black-boxes that remain opaque with regards to their decision mechanisms. Several works in the literature have proposed post-hoc explanation methods alleviate this issue. This paper proposes LMAC-TD, a method trains decoder produce explanations directly time domain. methodology builds upon foundation of L-MAC, Listenable Maps for Audio Classifiers, produces faithful and listenable explanations. We incorporate SepFormer, popular transformer-based time-domain...
Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models speech-based PD detection have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming support the development of accurate, interpretable clinical decision-making diagnosis and monitoring. Our methodology involves (i) obtaining attributions saliency...
In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of classifier into discrete, class-specific latent space. The codebooks act as bottleneck that forces interpreter focus on parts input data deemed relevant making prediction. model formulation also enables learning concepts incorporating supervision pretrained...
Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods time in exchange improved LLF (later-layer-forgetting) is state-of-the-art method this category. It strengthens early layers by periodically re-initializing the last few network. Our principal innovation work use Simulated annealing EArly Layers (SEAL) network place re-initialization later layers. Essentially, go through normal gradient descent...