- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Metallurgy and Material Forming
- Topic Modeling
- Music and Audio Processing
- Metal Alloys Wear and Properties
- Diverse Industrial Engineering Technologies
- Speech and Audio Processing
- Information Retrieval and Search Behavior
- Language and cultural evolution
- Recommender Systems and Techniques
- Mobile Crowdsensing and Crowdsourcing
- Expert finding and Q&A systems
- Ferroelectric and Negative Capacitance Devices
- Web Data Mining and Analysis
- Domain Adaptation and Few-Shot Learning
- Microstructure and Mechanical Properties of Steels
- Genomics and Phylogenetic Studies
- Reinforcement Learning in Robotics
- Machine Learning and Data Classification
- Neural Networks and Applications
- Algorithms and Data Compression
- Engineering Technology and Methodologies
- Advanced Memory and Neural Computing
- Machine Learning and Algorithms
Google (Switzerland)
2023
Samara State Technical University
2023
Meta (Israel)
2019-2022
École des hautes études en sciences sociales
2022
Institute of Forensic Science
2021
École Normale Supérieure
2021
National University of Science and Technology
2013-2019
National University of Science and Technology
2005-2019
Meta (United States)
2019
University of Glasgow
2012-2015
We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source books the LibriVox project. contains over 60K hours audio, which is, to our knowledge, largest freely-available corpus speech. The has been segmented using voice activity detection and tagged with SNR, speaker ID genre descriptions. Additionally, we provide baseline evaluation metrics working three settings: (1) zero...
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input to sequence of discrete tokens and casts as language modeling task in this representation space. show how existing tokenizers provide different trade-offs between reconstruction quality structure, we propose hybrid tokenization scheme achieve both objectives. Namely, leverage discretized activations masked model pre-trained on capture structure codes produced by neural codec...
We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate content, prosodic information, and speaker identity. This allows to synthesize in a controllable manner. analyze various state-of-the-art, representation learning methods shed light on advantages each method while considering reconstruction quality disentanglement properties. Specifically, evaluate F0 reconstruction,...
Contrastive Predictive Coding (CPC), based on predicting future segments of speech from past is emerging as a powerful algorithm for representation learning signal. However, it still under-performs compared to other methods unsupervised evaluation benchmarks. Here, we intro-duce WavAugment, time-domain data augmentation library which adapt and optimize the specificities CPC (raw waveform input, contrastive loss, versus structure). We find that applying only prediction performed yields better...
Abstract We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as composition sequence-to-sequence tasks: from text to high-level semantic tokens (akin “reading”) and low-level acoustic (“speaking”). Decoupling these tasks enables training the “speaking” module using abundant audio-only data, unlocks highly efficient combination pretraining backtranslation reduce...
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based speech-based models, PaLM-2 [Anil et al., 2023] AudioLM [Borsos 2022], into unified multimodal architecture that can process generate text with applications including recognition speech-to-speech translation. inherits the capability to preserve paralinguistic information such as speaker identity intonation from linguistic knowledge present only in models PaLM-2. demonstrate...
Abstract We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised unit discovery coupled with a dual-tower transformer architecture cross-attention trained 2000 hours two-channel raw conversational (Fisher dataset) without any text or labels. show that our is speech, laughter, and other paralinguistic signals in two channels simultaneously reproduces more fluid turn taking compared text-based cascaded model.1,2
We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention confidence-based parallel decoding to generate neural codec. Compared autoregressive generation approach our produces same quality with higher consistency in voice acoustic conditions, while being two orders magnitude faster. generates 30 seconds 0.5 TPU-v4. demonstrate ability scale longer sequences by...
Words categorize the semantic fields they refer to in ways that maximize communication accuracy while minimizing complexity. Focusing on well-studied color domain, we show artificial neural networks trained with deep-learning techniques play a discrimination game develop systems whose distribution accuracy/complexity plane closely matches of human languages. The observed variation among emergent color-naming is explained by different degrees discriminative need, sort might also characterize...
Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, Marco Baroni. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP): System Demonstrations. 2019.
We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on Libri-light dataset, provides up 60k hours of audio English books associated text. provide pipeline baseline system consisting an encoder contrastive predictive coding (CPC), quantizer ($k$-means) and standard (BERT LSTM). metrics evaluate learned representations at acoustic (ABX discrimination), lexical...
Despite renewed interest in emergent language simulations with neural networks, little is known about the basic properties of induced code, and how they compare to human language. One fundamental characteristic latter, as Zipf's Law Abbreviation (ZLA), that more frequent words are efficiently associated shorter strings. We study whether same pattern emerges when two a "speaker" "listener", trained play signaling game. Surprisingly, we find networks develop an \emph{anti-efficient} encoding...
Online Learning to Rank is a powerful paradigm that allows train ranking models using only online feedback from its users.In this work, we consider Federated setup (FOLtR) where on-mobile are trained in way respects the users' privacy. We require user data, such as queries, results, and their feature representations never communicated for purpose of ranker's training. believe interesting, it combines unique requirements learning algorithm: (a) preserving privacy, (b) low communication...
Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: System Demonstrations. 2022.
Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
Natural language allows us to refer novel composite concepts by combining expressions denoting their parts according systematic rules, a property known as \emph{compositionality}. In this paper, we study whether the emerging in deep multi-agent simulations possesses similar ability primitive combinations, and it accomplishes feat strategies akin human-language compositionality. Equipped with new ways measure compositionality emergent languages inspired disentanglement representation...
Online evaluation methods, such as A/B and interleaving experiments, are widely used for search engine evaluation. Since they rely on noisy implicit user feedback, running each experiment takes a considerable time. Recently, the problem of reducing duration online experiments has received substantial attention from research community. However, possibility using sequential statistical testing procedures time required remains less studied. Such allow an to stop early, once data collected is...
Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu Anh Nguyen, Morgan Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.
Query suggestion or auto-completion mechanisms are widely used by search engines and increasingly attracting interest from the research community. However, lack of commonly accepted evaluation methodology metrics means that it is not possible to compare results approaches literature. Moreover, often evaluate query suggestions tend be an adaptation other domains without a proper justification. Hence, necessarily clear if improvements reported in literature would result actual improvement...
Despite their failure to solve the compositional SCAN dataset, seq2seq architectures still achieve astonishing success on more practical tasks. This observation pushes us question usefulness of SCAN-style generalization in realistic NLP In this work, we study benefit that such compositionality brings about several machine translation We present focused modifications Transformer greatly improve capabilities and select one remains par with a vanilla standard (MT) task. Next, its performance...
The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according their frequency in logs is suboptimal. Firstly, many candidate queries the same prefix can be removed as redundant. Secondly, also personalised based on user's context. These two directions improve mechanisms' quality opposition: latter aims promote address intents user likely have, former diversify cover possible. We introduce...
Studies of discrete languages emerging when neural agents communicate to solve a joint task often look for evidence compositional structure. This stems the expectation that such structure would allow be acquired faster by and enable them generalize better. We argue these beneficial properties are only loosely connected compositionality. In two experiments, we demonstrate that, depending on task, non-compositional might show equal, or better, generalization performance acquisition speed than...