Eugene Kharitonov

ORCID: 0009-0000-8653-721X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Metallurgy and Material Forming
  • Topic Modeling
  • Music and Audio Processing
  • Metal Alloys Wear and Properties
  • Diverse Industrial Engineering Technologies
  • Speech and Audio Processing
  • Information Retrieval and Search Behavior
  • Language and cultural evolution
  • Recommender Systems and Techniques
  • Mobile Crowdsensing and Crowdsourcing
  • Expert finding and Q&A systems
  • Ferroelectric and Negative Capacitance Devices
  • Web Data Mining and Analysis
  • Domain Adaptation and Few-Shot Learning
  • Microstructure and Mechanical Properties of Steels
  • Genomics and Phylogenetic Studies
  • Reinforcement Learning in Robotics
  • Machine Learning and Data Classification
  • Neural Networks and Applications
  • Algorithms and Data Compression
  • Engineering Technology and Methodologies
  • Advanced Memory and Neural Computing
  • Machine Learning and Algorithms

Google (Switzerland)
2023

Samara State Technical University
2023

Meta (Israel)
2019-2022

École des hautes études en sciences sociales
2022

Institute of Forensic Science
2021

École Normale Supérieure
2021

National University of Science and Technology
2013-2019

National University of Science and Technology
2005-2019

Meta (United States)
2019

University of Glasgow
2012-2015

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source books the LibriVox project. contains over 60K hours audio, which is, to our knowledge, largest freely-available corpus speech. The has been segmented using voice activity detection and tagged with SNR, speaker ID genre descriptions. Additionally, we provide baseline evaluation metrics working three settings: (1) zero...

10.1109/icassp40776.2020.9052942 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input to sequence of discrete tokens and casts as language modeling task in this representation space. show how existing tokenizers provide different trade-offs between reconstruction quality structure, we propose hybrid tokenization scheme achieve both objectives. Namely, leverage discretized activations masked model pre-trained on capture structure codes produced by neural codec...

10.1109/taslp.2023.3288409 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate content, prosodic information, and speaker identity. This allows to synthesize in a controllable manner. analyze various state-of-the-art, representation learning methods shed light on advantages each method while considering reconstruction quality disentanglement properties. Specifically, evaluate F0 reconstruction,...

10.21437/interspeech.2021-475 article EN Interspeech 2022 2021-08-27

Contrastive Predictive Coding (CPC), based on predicting future segments of speech from past is emerging as a powerful algorithm for representation learning signal. However, it still under-performs compared to other methods unsupervised evaluation benchmarks. Here, we intro-duce WavAugment, time-domain data augmentation library which adapt and optimize the specificities CPC (raw waveform input, contrastive loss, versus structure). We find that applying only prediction performed yields better...

10.1109/slt48900.2021.9383605 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Abstract We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as composition sequence-to-sequence tasks: from text to high-level semantic tokens (akin “reading”) and low-level acoustic (“speaking”). Decoupling these tasks enables training the “speaking” module using abundant audio-only data, unlocks highly efficient combination pretraining backtranslation reduce...

10.1162/tacl_a_00618 article EN cc-by Transactions of the Association for Computational Linguistics 2023-01-01

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based speech-based models, PaLM-2 [Anil et al., 2023] AudioLM [Borsos 2022], into unified multimodal architecture that can process generate text with applications including recognition speech-to-speech translation. inherits the capability to preserve paralinguistic information such as speaker identity intonation from linguistic knowledge present only in models PaLM-2. demonstrate...

10.48550/arxiv.2306.12925 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Abstract We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised unit discovery coupled with a dual-tower transformer architecture cross-attention trained 2000 hours two-channel raw conversational (Fisher dataset) without any text or labels. show that our is speech, laughter, and other paralinguistic signals in two channels simultaneously reproduces more fluid turn taking compared text-based cascaded model.1,2

10.1162/tacl_a_00545 article EN cc-by Transactions of the Association for Computational Linguistics 2023-01-01

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention confidence-based parallel decoding to generate neural codec. Compared autoregressive generation approach our produces same quality with higher consistency in voice acoustic conditions, while being two orders magnitude faster. generates 30 seconds 0.5 TPU-v4. demonstrate ability scale longer sequences by...

10.48550/arxiv.2305.09636 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Words categorize the semantic fields they refer to in ways that maximize communication accuracy while minimizing complexity. Focusing on well-studied color domain, we show artificial neural networks trained with deep-learning techniques play a discrimination game develop systems whose distribution accuracy/complexity plane closely matches of human languages. The observed variation among emergent color-naming is explained by different degrees discriminative need, sort might also characterize...

10.1073/pnas.2016569118 article EN cc-by Proceedings of the National Academy of Sciences 2021-03-15

10.17513/vaael.4043 article EN Bulletin of the Altai Academy of Economics and law 2025-01-01

Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, Marco Baroni. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP): System Demonstrations. 2019.

10.18653/v1/d19-3010 preprint EN cc-by 2019-01-01

We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on Libri-light dataset, provides up 60k hours of audio English books associated text. provide pipeline baseline system consisting an encoder contrastive predictive coding (CPC), quantizer ($k$-means) and standard (BERT LSTM). metrics evaluate learned representations at acoustic (ABX discrimination), lexical...

10.21437/interspeech.2021-1755 article EN Interspeech 2022 2021-08-27

Despite renewed interest in emergent language simulations with neural networks, little is known about the basic properties of induced code, and how they compare to human language. One fundamental characteristic latter, as Zipf's Law Abbreviation (ZLA), that more frequent words are efficiently associated shorter strings. We study whether same pattern emerges when two a "speaker" "listener", trained play signaling game. Surprisingly, we find networks develop an \emph{anti-efficient} encoding...

10.48550/arxiv.1905.12561 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Online Learning to Rank is a powerful paradigm that allows train ranking models using only online feedback from its users.In this work, we consider Federated setup (FOLtR) where on-mobile are trained in way respects the users' privacy. We require user data, such as queries, results, and their feature representations never communicated for purpose of ranker's training. believe interesting, it combines unique requirements learning algorithm: (a) preserving privacy, (b) low communication...

10.1145/3289600.3290968 article EN 2019-01-30

Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: System Demonstrations. 2022.

10.18653/v1/2022.naacl-demo.1 article EN cc-by 2022-01-01

Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.

10.18653/v1/2022.acl-long.593 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Natural language allows us to refer novel composite concepts by combining expressions denoting their parts according systematic rules, a property known as \emph{compositionality}. In this paper, we study whether the emerging in deep multi-agent simulations possesses similar ability primitive combinations, and it accomplishes feat strategies akin human-language compositionality. Equipped with new ways measure compositionality emergent languages inspired disentanglement representation...

10.18653/v1/2020.acl-main.407 preprint EN 2020-01-01

Online evaluation methods, such as A/B and interleaving experiments, are widely used for search engine evaluation. Since they rely on noisy implicit user feedback, running each experiment takes a considerable time. Recently, the problem of reducing duration online experiments has received substantial attention from research community. However, possibility using sequential statistical testing procedures time required remains less studied. Such allow an to stop early, once data collected is...

10.1145/2766462.2767729 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-04

Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu Anh Nguyen, Morgan Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.

10.18653/v1/2022.emnlp-main.769 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Query suggestion or auto-completion mechanisms are widely used by search engines and increasingly attracting interest from the research community. However, lack of commonly accepted evaluation methodology metrics means that it is not possible to compare results approaches literature. Moreover, often evaluate query suggestions tend be an adaptation other domains without a proper justification. Hence, necessarily clear if improvements reported in literature would result actual improvement...

10.1145/2484028.2484041 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2013-07-28

Despite their failure to solve the compositional SCAN dataset, seq2seq architectures still achieve astonishing success on more practical tasks. This observation pushes us question usefulness of SCAN-style generalization in realistic NLP In this work, we study benefit that such compositionality brings about several machine translation We present focused modifications Transformer greatly improve capabilities and select one remains par with a vanilla standard (MT) task. Next, its performance...

10.18653/v1/2021.blackboxnlp-1.9 preprint EN cc-by 2021-01-01

The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according their frequency in logs is suboptimal. Firstly, many candidate queries the same prefix can be removed as redundant. Secondly, also personalised based on user's context. These two directions improve mechanisms' quality opposition: latter aims promote address intents user likely have, former diversify cover possible. We introduce...

10.1145/2505515.2505661 article EN 2013-10-27

Studies of discrete languages emerging when neural agents communicate to solve a joint task often look for evidence compositional structure. This stems the expectation that such structure would allow be acquired faster by and enable them generalize better. We argue these beneficial properties are only loosely connected compositionality. In two experiments, we demonstrate that, depending on task, non-compositional might show equal, or better, generalization performance acquisition speed than...

10.18653/v1/2020.blackboxnlp-1.2 article EN cc-by 2020-01-01
Coming Soon ...