Gabriel Synnaeve

ORCID: 0000-0003-1715-3356
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • Speech and Audio Processing
  • Artificial Intelligence in Games
  • Natural Language Processing Techniques
  • Reinforcement Learning in Robotics
  • Topic Modeling
  • Domain Adaptation and Few-Shot Learning
  • Digital Games and Media
  • Multimodal Machine Learning Applications
  • Advanced Neural Network Applications
  • Speech and dialogue systems
  • Generative Adversarial Networks and Image Synthesis
  • Time Series Analysis and Forecasting
  • Software Engineering Research
  • Intelligence, Security, War Strategy
  • Terrorism, Counterterrorism, and Political Violence
  • Software Testing and Debugging Techniques
  • AI-based Problem Solving and Planning
  • Semantic Web and Ontologies
  • Parallel Computing and Optimization Techniques
  • Advanced Image and Video Retrieval Techniques
  • Sports Analytics and Performance
  • Phonetics and Phonology Research
  • Model-Driven Software Engineering Techniques

Menlo School
2020-2024

Laboratoire de Sciences Cognitives et Psycholinguistique
2013-2024

Alpha Omega Alpha Medical Honor Society
2024

Université Paris-Saclay
2023

Institut national de recherche en informatique et en automatique
2011-2023

Laboratoire Lorrain de Recherche en Informatique et ses Applications
2023

Meta (Israel)
2017-2022

Laboratoire d'Informatique de Grenoble
2010-2021

Université Grenoble Alpes
2010-2021

Collège de France
2011-2021

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However optimization vision transformers has little studied so far. In this work, we build and optimize deeper transformer networks classification. particular, investigate interplay architecture such dedicated transformers. We make two changes that significantly improve accuracy deep This leads us to produce models whose...

10.1109/iccv48922.2021.00010 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) linear layer in which patches interact, independently and identically across channels, (ii) two-layer feed-forward channels interact per patch. When trained with modern training strategy using heavy data-augmentation optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. also train ResMLP models...

10.1109/tpami.2022.3206148 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-09-12

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way similar foundation models computer vision. These could greatly simplify use images any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This shows existing methods, especially self-supervised can produce such if trained enough curated from diverse sources. We revisit approaches combine...

10.48550/arxiv.2304.07193 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as black box, trained independently downstream task and fixed vocabulary objects attributes. This makes it challenging for such capture long tail visual concepts expressed in free form text. In paper we propose MDETR, an end-to-end modulated that detects image conditioned raw text query, like caption or question. We use...

10.1109/iccv48922.2021.00180 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source books the LibriVox project. contains over 60K hours audio, which is, to our knowledge, largest freely-available corpus speech. The has been segmented using voice activity detection and tagged with SNR, speaker ID genre descriptions. Additionally, we provide baseline evaluation metrics working three settings: (1) zero...

10.1109/icassp40776.2020.9052942 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

We present a causal speech enhancement model working on the raw waveform that runs in real-time laptop CPU.The proposed is based an encoder-decoder architecture with skip-connections.It optimized both time and frequency domains, using multiple loss functions.Empirical evidence shows it capable of removing various kinds background noise including stationary non-stationary noises, as well room reverb.Additionally, we suggest set data augmentation techniques applied directly which further...

10.21437/interspeech.2020-2409 article EN Interspeech 2022 2020-10-25

This paper presents an overview of the existing work on AI for real-time strategy (RTS) games. Specifically, we focus around game StarCraft, which has emerged in past few years as unified test bed this research. We describe specific challenges posed by RTS games, and solutions that have been explored to address them. Additionally, also present a summary results recent StarCraft competitions, describing architectures used participants. Finally, conclude with discussion emphasizing problems...

10.1109/tciaig.2013.2286295 article EN IEEE Transactions on Computational Intelligence and AI in Games 2013-10-18

This paper presents a simple end-to-end model for speech recognition, combining convolutional network based acoustic and graph decoding. It is trained to output letters, with transcribed speech, without the need force alignment of phonemes. We introduce an automatic segmentation criterion training from sequence annotation that on par CTC while being simpler. show competitive results in word error rate Librispeech corpus MFCC features, promising raw waveform.

10.48550/arxiv.1609.03193 preprint EN other-oa arXiv (Cornell University) 2016-01-01

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research.The dataset is derived from read audiobooks LibriVox and consists of 8 languages, including about 44.5K hours English total 6K other languages.Additionally, we provide Language Models (LM) baseline Automatic Speech Recognition (ASR) models all the languages in our dataset.We believe such transcribed will open new avenues ASR Text-To-Speech (TTS) research.

10.21437/interspeech.2020-2826 article EN Interspeech 2022 2020-10-25

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of data beyond the local convolutions. This flexibility, however, comes with a quadratic complexity time memory, hindering application to long sequences high-resolution images. We propose "transposed" version that operates...

10.48550/arxiv.2106.09681 preprint EN cc-by arXiv (Cornell University) 2021-01-01

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support input contexts, and zero-shot instruction following ability programming tasks. provide multiple flavors to cover wide range applications: foundation (Code Llama), Python specializations - Python), instruction-following Instruct) with 7B, 13B, 34B 70B parameters each. All are trained sequences 16k tokens show improvements...

10.48550/arxiv.2308.12950 preprint EN cc-by arXiv (Cornell University) 2023-01-01

We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers speech recognition, with either CTC or Seq2Seq loss functions. perform experiments on standard LibriSpeech dataset, leverage additional unlabeled data from LibriVox through pseudo-labeling. show that while Transformer-based acoustic models have superior performance supervised dataset alone, semi-supervision improves all across architectures functions bridges much gaps between...

10.48550/arxiv.1911.08460 preprint EN other-oa arXiv (Cornell University) 2019-01-01

This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses ArrayFire tensor library for maximum efficiency. Here we explain architecture design of system compare it to other major systems. In some cases more than 2x faster optimized frameworks training end-to-end neural networks recognition. We also show that wav2letter++'s times scale linearly 64 GPUs, highest tested, models with 100 million...

10.1109/icassp.2019.8683535 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Self-supervised learning of speech representations has been a very active research area but most work is focused on single domain such as read audio books for which there exist large quantities labeled and unlabeled data.In this paper, we explore more general setups where the data pre-training differs from fine-tuning, in turn may differ test domain.Our experiments show that using target during leads to performance improvements across variety setups.On large-scale competitive setup,...

10.21437/interspeech.2021-236 article EN Interspeech 2022 2021-08-27

Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if can be effectively combined. In this paper, we show that pseudo-labeling with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes from Libri-light well 53k hours LibriVox achieves word error rates (WER) 2.8%/4.8% on the clean other test sets...

10.1109/icassp39728.2021.9414641 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in streaming encoder-decoder architecture with quantized latent space trained an end-to-end fashion. simplify and speed-up the training by using single multiscale spectrogram adversary that efficiently reduces artifacts produce high-quality samples. novel loss balancer mechanism to stabilize training: weight of now defines fraction overall gradient it should represent, thus...

10.48550/arxiv.2210.13438 preprint EN other-oa arXiv (Cornell University) 2022-01-01

We tackle the task of conditional music generation. introduce MusicGen, a single Language Model (LM) that operates over several streams compressed discrete representation, i.e., tokens. Unlike prior work, MusicGen is comprised single-stage transformer LM together with efficient token interleaving patterns, which eliminates need for cascading models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how can generate high-quality samples, both mono and stereo, while...

10.48550/arxiv.2306.05284 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We describe a simple scheme that allows an agent to learn about its environment in unsupervised manner. Our pits two versions of the same agent, Alice and Bob, against one another. proposes task for Bob complete; then attempts complete task. In this work we will focus on kinds environments: (nearly) reversible environments can be reset. "propose" by doing sequence actions must undo or repeat them, respectively. Via appropriate reward structure, automatically generate curriculum exploration,...

10.48550/arxiv.1703.05407 preprint EN other-oa arXiv (Cornell University) 2017-01-01

We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and overall simplifying deployment ASR systems that support diverse languages.We perform an extensive benchmark 51 varying amount data by language (from 100 hours to 1100 hours).We compare three variants multilingual from joint without knowing input language, using this information, heads (one per "cluster").We show models...

10.21437/interspeech.2020-2831 article EN Interspeech 2022 2020-10-25

We train a bank of complex filters that operates on the raw waveform and is fed into convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation mel-filterbanks, then fine-tuned jointly with remaining architecture. perform recognition experiments TIMIT show several architectures, models trained TD- consistently outperform their counterparts comparable mel-filterbanks. get our best performance by learning...

10.1109/icassp.2018.8462015 article EN 2018-04-01

Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely convolutional networks, leveraging recent advances in models from the raw waveform modeling. This fully is trained end-to-end predict characters waveform, removing step altogether. An external model used decode words....

10.48550/arxiv.1812.06864 preprint EN other-oa arXiv (Cornell University) 2018-01-01
Coming Soon ...