- Speech Recognition and Synthesis
- Music and Audio Processing
- Speech and Audio Processing
- Artificial Intelligence in Games
- Natural Language Processing Techniques
- Reinforcement Learning in Robotics
- Topic Modeling
- Domain Adaptation and Few-Shot Learning
- Digital Games and Media
- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Speech and dialogue systems
- Generative Adversarial Networks and Image Synthesis
- Time Series Analysis and Forecasting
- Software Engineering Research
- Intelligence, Security, War Strategy
- Terrorism, Counterterrorism, and Political Violence
- Software Testing and Debugging Techniques
- AI-based Problem Solving and Planning
- Semantic Web and Ontologies
- Parallel Computing and Optimization Techniques
- Advanced Image and Video Retrieval Techniques
- Sports Analytics and Performance
- Phonetics and Phonology Research
- Model-Driven Software Engineering Techniques
Menlo School
2020-2024
Laboratoire de Sciences Cognitives et Psycholinguistique
2013-2024
Alpha Omega Alpha Medical Honor Society
2024
Université Paris-Saclay
2023
Institut national de recherche en informatique et en automatique
2011-2023
Laboratoire Lorrain de Recherche en Informatique et ses Applications
2023
Meta (Israel)
2017-2022
Laboratoire d'Informatique de Grenoble
2010-2021
Université Grenoble Alpes
2010-2021
Collège de France
2011-2021
Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However optimization vision transformers has little studied so far. In this work, we build and optimize deeper transformer networks classification. particular, investigate interplay architecture such dedicated transformers. We make two changes that significantly improve accuracy deep This leads us to produce models whose...
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) linear layer in which patches interact, independently and identically across channels, (ii) two-layer feed-forward channels interact per patch. When trained with modern training strategy using heavy data-augmentation optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. also train ResMLP models...
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way similar foundation models computer vision. These could greatly simplify use images any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This shows existing methods, especially self-supervised can produce such if trained enough curated from diverse sources. We revisit approaches combine...
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as black box, trained independently downstream task and fixed vocabulary objects attributes. This makes it challenging for such capture long tail visual concepts expressed in free form text. In paper we propose MDETR, an end-to-end modulated that detects image conditioned raw text query, like caption or question. We use...
We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source books the LibriVox project. contains over 60K hours audio, which is, to our knowledge, largest freely-available corpus speech. The has been segmented using voice activity detection and tagged with SNR, speaker ID genre descriptions. Additionally, we provide baseline evaluation metrics working three settings: (1) zero...
We present a causal speech enhancement model working on the raw waveform that runs in real-time laptop CPU.The proposed is based an encoder-decoder architecture with skip-connections.It optimized both time and frequency domains, using multiple loss functions.Empirical evidence shows it capable of removing various kinds background noise including stationary non-stationary noises, as well room reverb.Additionally, we suggest set data augmentation techniques applied directly which further...
This paper presents an overview of the existing work on AI for real-time strategy (RTS) games. Specifically, we focus around game StarCraft, which has emerged in past few years as unified test bed this research. We describe specific challenges posed by RTS games, and solutions that have been explored to address them. Additionally, also present a summary results recent StarCraft competitions, describing architectures used participants. Finally, conclude with discussion emphasizing problems...
This paper presents a simple end-to-end model for speech recognition, combining convolutional network based acoustic and graph decoding. It is trained to output letters, with transcribed speech, without the need force alignment of phonemes. We introduce an automatic segmentation criterion training from sequence annotation that on par CTC while being simpler. show competitive results in word error rate Librispeech corpus MFCC features, promising raw waveform.
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research.The dataset is derived from read audiobooks LibriVox and consists of 8 languages, including about 44.5K hours English total 6K other languages.Additionally, we provide Language Models (LM) baseline Automatic Speech Recognition (ASR) models all the languages in our dataset.We believe such transcribed will open new avenues ASR Text-To-Speech (TTS) research.
Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of data beyond the local convolutions. This flexibility, however, comes with a quadratic complexity time memory, hindering application to long sequences high-resolution images. We propose "transposed" version that operates...
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support input contexts, and zero-shot instruction following ability programming tasks. provide multiple flavors to cover wide range applications: foundation (Code Llama), Python specializations - Python), instruction-following Instruct) with 7B, 13B, 34B 70B parameters each. All are trained sequences 16k tokens show improvements...
We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers speech recognition, with either CTC or Seq2Seq loss functions. perform experiments on standard LibriSpeech dataset, leverage additional unlabeled data from LibriVox through pseudo-labeling. show that while Transformer-based acoustic models have superior performance supervised dataset alone, semi-supervision improves all across architectures functions bridges much gaps between...
This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses ArrayFire tensor library for maximum efficiency. Here we explain architecture design of system compare it to other major systems. In some cases more than 2x faster optimized frameworks training end-to-end neural networks recognition. We also show that wav2letter++'s times scale linearly 64 GPUs, highest tested, models with 100 million...
Self-supervised learning of speech representations has been a very active research area but most work is focused on single domain such as read audio books for which there exist large quantities labeled and unlabeled data.In this paper, we explore more general setups where the data pre-training differs from fine-tuning, in turn may differ test domain.Our experiments show that using target during leads to performance improvements across variety setups.On large-scale competitive setup,...
Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if can be effectively combined. In this paper, we show that pseudo-labeling with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes from Libri-light well 53k hours LibriVox achieves word error rates (WER) 2.8%/4.8% on the clean other test sets...
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in streaming encoder-decoder architecture with quantized latent space trained an end-to-end fashion. simplify and speed-up the training by using single multiscale spectrogram adversary that efficiently reduces artifacts produce high-quality samples. novel loss balancer mechanism to stabilize training: weight of now defines fraction overall gradient it should represent, thus...
We tackle the task of conditional music generation. introduce MusicGen, a single Language Model (LM) that operates over several streams compressed discrete representation, i.e., tokens. Unlike prior work, MusicGen is comprised single-stage transformer LM together with efficient token interleaving patterns, which eliminates need for cascading models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how can generate high-quality samples, both mono and stereo, while...
We describe a simple scheme that allows an agent to learn about its environment in unsupervised manner. Our pits two versions of the same agent, Alice and Bob, against one another. proposes task for Bob complete; then attempts complete task. In this work we will focus on kinds environments: (nearly) reversible environments can be reset. "propose" by doing sequence actions must undo or repeat them, respectively. Via appropriate reward structure, automatically generate curriculum exploration,...
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and overall simplifying deployment ASR systems that support diverse languages.We perform an extensive benchmark 51 varying amount data by language (from 100 hours to 1100 hours).We compare three variants multilingual from joint without knowing input language, using this information, heads (one per "cluster").We show models...
We train a bank of complex filters that operates on the raw waveform and is fed into convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation mel-filterbanks, then fine-tuned jointly with remaining architecture. perform recognition experiments TIMIT show several architectures, models trained TD- consistently outperform their counterparts comparable mel-filterbanks. get our best performance by learning...
Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely convolutional networks, leveraging recent advances in models from the raw waveform modeling. This fully is trained end-to-end predict characters waveform, removing step altogether. An external model used decode words....