NFDI4DS | UHH-SEMS - Publication Details

Gabriel Synnaeve

ORCID: 0000-0003-1715-3356

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5041907084

Research Areas

Speech Recognition and Synthesis
Music and Audio Processing
Speech and Audio Processing
Artificial Intelligence in Games
Natural Language Processing Techniques
Reinforcement Learning in Robotics
Topic Modeling
Domain Adaptation and Few-Shot Learning
Digital Games and Media
Multimodal Machine Learning Applications
Advanced Neural Network Applications
Speech and dialogue systems
Generative Adversarial Networks and Image Synthesis
Time Series Analysis and Forecasting
Software Engineering Research
Intelligence, Security, War Strategy
Terrorism, Counterterrorism, and Political Violence
Software Testing and Debugging Techniques
AI-based Problem Solving and Planning
Semantic Web and Ontologies
Parallel Computing and Optimization Techniques
Advanced Image and Video Retrieval Techniques
Sports Analytics and Performance
Phonetics and Phonology Research
Model-Driven Software Engineering Techniques

Menlo School
2020-2024

Laboratoire de Sciences Cognitives et Psycholinguistique
2013-2024

Alpha Omega Alpha Medical Honor Society
2024

Université Paris-Saclay
2023

Institut national de recherche en informatique et en automatique
2011-2023

Laboratoire Lorrain de Recherche en Informatique et ses Applications
2023

Meta (Israel)
2017-2022

Laboratoire d'Informatique de Grenoble
2010-2021

Université Grenoble Alpes
2010-2021

Collège de France
2011-2021

Going deeper with Image Transformers

OPENALEX - Publications

Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Hervé Jeǵou

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However optimization vision transformers has little studied so far. In this work, we build and optimize deeper transformer networks classification. particular, investigate interplay architecture such dedicated transformers. We make two changes that significantly improve accuracy deep This leads us to produce models whose...

10.1109/iccv48922.2021.00010 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

OPENALEX - Publications

Hugo Touvron Piotr Bojanowski Mathilde Caron Matthieu Cord Alaaeldin El-Nouby and 6 more

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) linear layer in which patches interact, independently and identically across channels, (ii) two-layer feed-forward channels interact per patch. When trained with modern training strategy using heavy data-augmentation optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. also train ResMLP models...

10.1109/tpami.2022.3206148 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-09-12

DINOv2: Learning Robust Visual Features without Supervision

OPENALEX - Publications

Maxime Oquab Timothée Darcet Théo Moutakanni Huy Vo Marc Szafraniec and 21 more

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way similar foundation models computer vision. These could greatly simplify use images any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This shows existing methods, especially self-supervised can produce such if trained enough curated from diverse sources. We revisit approaches combine...

10.48550/arxiv.2304.07193 preprint EN other-oa arXiv (Cornell University) 2023-01-01

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

OPENALEX - Publications

Aishwarya Kamath Mannat Singh Yann LeCun Gabriel Synnaeve Ishan Misra and 1 more

Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as black box, trained independently downstream task and fixed vocabulary objects attributes. This makes it challenging for such capture long tail visual concepts expressed in free form text. In paper we propose MDETR, an end-to-end modulated that detects image conditioned raw text query, like caption or question. We use...

10.1109/iccv48922.2021.00180 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Libri-Light: A Benchmark for ASR with Limited or No Supervision

OPENALEX - Publications

Jacob Kahn Maude Rivière Wenlong Zheng Eugene Kharitonov Qinmei Xu and 10 more

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source books the LibriVox project. contains over 60K hours audio, which is, to our knowledge, largest freely-available corpus speech. The has been segmented using voice activity detection and tagged with SNR, speaker ID genre descriptions. Additionally, we provide baseline evaluation metrics working three settings: (1) zero...

10.1109/icassp40776.2020.9052942 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Real Time Speech Enhancement in the Waveform Domain

OPENALEX - Publications

Alexandre Défossez Gabriel Synnaeve Yossi Adi

We present a causal speech enhancement model working on the raw waveform that runs in real-time laptop CPU.The proposed is based an encoder-decoder architecture with skip-connections.It optimized both time and frequency domains, using multiple loss functions.Empirical evidence shows it capable of removing various kinds background noise including stationary non-stationary noises, as well room reverb.Additionally, we suggest set data augmentation techniques applied directly which further...

10.21437/interspeech.2020-2409 article EN Interspeech 2022 2020-10-25

A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft

OPENALEX - Publications

Santiago Ontañón Gabriel Synnaeve Alberto Uriarte Florian Richoux David G. Churchill and 1 more

This paper presents an overview of the existing work on AI for real-time strategy (RTS) games. Specifically, we focus around game StarCraft, which has emerged in past few years as unified test bed this research. We describe specific challenges posed by RTS games, and solutions that have been explored to address them. Additionally, also present a summary results recent StarCraft competitions, describing architectures used participants. Finally, conclude with discussion emphasizing problems...

10.1109/tciaig.2013.2286295 article EN IEEE Transactions on Computational Intelligence and AI in Games 2013-10-18

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

OPENALEX - Publications

Ronan Collobert Christian Puhrsch Gabriel Synnaeve

This paper presents a simple end-to-end model for speech recognition, combining convolutional network based acoustic and graph decoding. It is trained to output letters, with transcribed speech, without the need force alignment of phonemes. We introduce an automatic segmentation criterion training from sequence annotation that on par CTC while being simpler. show competitive results in word error rate Librispeech corpus MFCC features, promising raw waveform.

10.48550/arxiv.1609.03193 preprint EN other-oa arXiv (Cornell University) 2016-01-01

MLS: A Large-Scale Multilingual Dataset for Speech Research

OPENALEX - Publications

Vineel Pratap Qiantong Xu Anuroop Sriram Gabriel Synnaeve Ronan Collobert

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research.The dataset is derived from read audiobooks LibriVox and consists of 8 languages, including about 44.5K hours English total 6K other languages.Additionally, we provide Language Models (LM) baseline Automatic Speech Recognition (ASR) models all the languages in our dataset.We believe such transcribed will open new avenues ASR Text-To-Speech (TTS) research.

10.21437/interspeech.2020-2826 article EN Interspeech 2022 2020-10-25

XCiT: Cross-Covariance Image Transformers

OPENALEX - Publications

Alaaeldin El-Nouby Hugo Touvron Mathilde Caron Piotr Bojanowski Matthijs Douze and 6 more

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of data beyond the local convolutions. This flexibility, however, comes with a quadratic complexity time memory, hindering application to long sequences high-resolution images. We propose "transposed" version that operates...

10.48550/arxiv.2106.09681 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Code Llama: Open Foundation Models for Code

OPENALEX - Publications

Baptiste Rozière Jonas Gehring Fabian Gloeckle Sten Sootla Itai Gat and 20 more

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support input contexts, and zero-shot instruction following ability programming tasks. provide multiple flavors to cover wide range applications: foundation (Code Llama), Python specializations - Python), instruction-following Instruct) with 7B, 13B, 34B 70B parameters each. All are trained sequences 16k tokens show improvements...

10.48550/arxiv.2308.12950 preprint EN cc-by arXiv (Cornell University) 2023-01-01

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

OPENALEX - Publications

Gabriel Synnaeve Qiantong Xu Jacob Kahn Édouard Grave Tatiana Likhomanenko and 4 more

We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers speech recognition, with either CTC or Seq2Seq loss functions. perform experiments on standard LibriSpeech dataset, leverage additional unlabeled data from LibriVox through pseudo-labeling. show that while Transformer-based acoustic models have superior performance supervised dataset alone, semi-supervision improves all across architectures functions bridges much gaps between...

10.48550/arxiv.1911.08460 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Wav2Letter++: A Fast Open-source Speech Recognition System

OPENALEX - Publications

Vineel Pratap Awni Hannun Qiantong Xu Jeff Cai Jacob Kahn and 3 more

This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses ArrayFire tensor library for maximum efficiency. Here we explain architecture design of system compare it to other major systems. In some cases more than 2x faster optimized frameworks training end-to-end neural networks recognition. We also show that wav2letter++'s times scale linearly 64 GPUs, highest tested, models with 100 million...

10.1109/icassp.2019.8683535 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

OPENALEX - Publications

Wei-Ning Hsu Anuroop Sriram Alexei Baevski Tatiana Likhomanenko Qiantong Xu and 6 more

Self-supervised learning of speech representations has been a very active research area but most work is focused on single domain such as read audio books for which there exist large quantities labeled and unlabeled data.In this paper, we explore more general setups where the data pre-training differs from fine-tuning, in turn may differ test domain.Our experiments show that using target during leads to performance improvements across variety setups.On large-scale competitive setup,...

10.21437/interspeech.2021-236 article EN Interspeech 2022 2021-08-27

Self-Training and Pre-Training are Complementary for Speech Recognition

OPENALEX - Publications

Qiantong Xu Alexei Baevski Tatiana Likhomanenko Paden Tomasello Alexis Conneau and 3 more

Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if can be effectively combined. In this paper, we show that pseudo-labeling with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes from Libri-light well 53k hours LibriVox achieves word error rates (WER) 2.8%/4.8% on the clean other test sets...

10.1109/icassp39728.2021.9414641 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

High Fidelity Neural Audio Compression

OPENALEX - Publications

Alexandre Défossez Jade Copet Gabriel Synnaeve Yossi Adi

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in streaming encoder-decoder architecture with quantized latent space trained an end-to-end fashion. simplify and speed-up the training by using single multiscale spectrogram adversary that efficiently reduces artifacts produce high-quality samples. novel loss balancer mechanism to stabilize training: weight of now defines fraction overall gradient it should represent, thus...

10.48550/arxiv.2210.13438 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Simple and Controllable Music Generation

OPENALEX - Publications

Jade Copet Felix Kreuk Itai Gat Tal Remez David Kant and 3 more

We tackle the task of conditional music generation. introduce MusicGen, a single Language Model (LM) that operates over several streams compressed discrete representation, i.e., tokens. Unlike prior work, MusicGen is comprised single-stage transformer LM together with efficient token interleaving patterns, which eliminates need for cascading models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how can generate high-quality samples, both mono and stereo, while...

10.48550/arxiv.2306.05284 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play

OPENALEX - Publications

Sainbayar Sukhbaatar Zeming Lin Ilya Kostrikov Gabriel Synnaeve Arthur Szlam and 1 more

We describe a simple scheme that allows an agent to learn about its environment in unsupervised manner. Our pits two versions of the same agent, Alice and Bob, against one another. proposes task for Bob complete; then attempts complete task. In this work we will focus on kinds environments: (nearly) reversible environments can be reset. "propose" by doing sequence actions must undo or repeat them, respectively. Via appropriate reward structure, automatically generate curriculum exploration,...

10.48550/arxiv.1703.05407 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

OPENALEX - Publications

Vineel Pratap Anuroop Sriram Paden Tomasello Awni Hannun Vitaliy Liptchinsky and 2 more

We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and overall simplifying deployment ASR systems that support diverse languages.We perform an extensive benchmark 51 varying amount data by language (from 100 hours to 1100 hours).We compare three variants multilingual from joint without knowing input language, using this information, heads (one per "cluster").We show models...

10.21437/interspeech.2020-2831 article EN Interspeech 2022 2020-10-25

Learning Filterbanks from Raw Speech for Phone Recognition

OPENALEX - Publications

Neil Zeghidour Nicolas Usunier Iasonas Kokkinos Thomas Schaiz Gabriel Synnaeve and 1 more

We train a bank of complex filters that operates on the raw waveform and is fed into convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation mel-filterbanks, then fine-tuned jointly with remaining architecture. perform recognition experiments TIMIT show several architectures, models trained TD- consistently outperform their counterparts comparable mel-filterbanks. get our best performance by learning...

10.1109/icassp.2018.8462015 article EN 2018-04-01

Fully Convolutional Speech Recognition

OPENALEX - Publications

Neil Zeghidour Qiantong Xu Vitaliy Liptchinsky Nicolas Usunier Gabriel Synnaeve and 1 more

Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely convolutional networks, leveraging recent advances in models from the raw waveform modeling. This fully is trained end-to-end predict characters waveform, removing step altogether. An external model used decode words....

10.48550/arxiv.1812.06864 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Coming Soon ...