Alena Fenogenova

ORCID: 0000-0003-3139-1668
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Text Readability and Simplification
  • Multimodal Machine Learning Applications
  • Authorship Attribution and Profiling
  • Computational and Text Analysis Methods
  • Advanced Text Analysis Techniques
  • Linguistics, Language Diversity, and Identity
  • Sentiment Analysis and Opinion Mining
  • Speech Recognition and Synthesis
  • Speech and dialogue systems
  • Scientific Research and Philosophical Inquiry
  • Adversarial Robustness in Machine Learning
  • Innovative Educational Technologies
  • Language, Metaphor, and Cognition
  • Information Systems and Technology Applications
  • Advanced Research in Systems and Signal Processing
  • Lexicography and Language Studies
  • linguistics and terminology studies
  • Educational Games and Gamification
  • Second Language Acquisition and Learning
  • Image Retrieval and Classification Techniques
  • Semantic Web and Ontologies
  • Foreign Language Teaching Methods
  • Artificial Intelligence in Games

Custom MMIC (United States)
2024

National Research University Higher School of Economics
2016-2023

Siberian Academy of Finance and Banking
2020-2021

Abstract This paper introduces mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from 25 linguistically diverse language families using Wikipedia and the C4 Corpus. We detail design pretraining procedure. The models undergo an intrinsic extrinsic evaluation: modeling in all languages, downstream evaluation cross-lingual NLU datasets benchmarks 33 world knowledge probing 23 languages. in-context learning abilities are par with contemporaneous while covering larger number...

10.1162/tacl_a_00633 article EN cc-by Transactions of the Association for Computational Linguistics 2024-01-01

Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero- and few-shot learning paradigms, which opens up new possibilities for using the pre-trained models. This paper introduces two GPT-like with 1.3 billion 13 parameters trained on 60 languages from 25 families Wikipedia Colossal Clean Crawled Corpus. We reproduce GPT-3 architecture GPT-2 sources sparse attention mechanism; Deepspeed Megatron frameworks allow us to parallelize training...

10.48550/arxiv.2204.07580 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Tatiana Shavrina, Alena Fenogenova, Emelyanov Anton, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, Evlampiev. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

10.18653/v1/2020.emnlp-main.381 article EN cc-by 2020-01-01

Text embeddings are typically evaluated on a limited set of tasks, which constrained by language, domain, and task diversity. To address these limitations provide more comprehensive evaluation, we introduce the Massive Multilingual Embedding Benchmark (MMTEB) - large-scale, community-driven expansion MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes diverse challenging, novel such as instruction following, long-document retrieval, code...

10.48550/arxiv.2502.13595 preprint EN arXiv (Cornell University) 2025-02-19

Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, developing such specifically for the Russian has received little attention. This paper introduces a collection of 13 LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), encoder-decoder (ruT5, FRED-T5) architectures. We provide report on model architecture design pretraining, results evaluating their generalization abilities understanding...

10.48550/arxiv.2309.10931 preprint EN cc-by arXiv (Cornell University) 2023-01-01

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal models and transformers require development a methodology for their broad diagnostics testing intellectual skills - detection natural inference, commonsense reasoning, ability to perform simple logical operations regardless text subject or lexicon. For first time, nine tasks, collected organized analogically SuperGLUE methodology, was...

10.48550/arxiv.2010.15925 preprint EN other-oa arXiv (Cornell University) 2020-01-01

The paper introduces two Russian machine reading comprehension (MRC) datasets, called MuSeRC and RuCoS, which require reasoning over multiple sentences commonsense knowledge to infer the answer. former follows design of MultiRC, while latter is a counterpart ReCoRD dataset. datasets are included in RussianSuperGLUE, general language understanding benchmark. We provide comparative analysis demonstrate that proposed tasks relatively more complex as compared original ones for English. Besides,...

10.18653/v1/2020.coling-main.570 article EN cc-by Proceedings of the 17th international conference on Computational linguistics - 2020-01-01

We present the shared task on artificial text detection in Russian, which is organized as a part of Dialogue Evaluation initiative, held 2022. The dataset includes texts from 14 generators, i.e., one human writer and 13 generative models fine-tuned for or more following generation tasks: machine translation, paraphrase generation, summarization, simplification. also consider back-translation zero-shot approaches. human-written are collected publicly available resources across multiple...

10.28995/2075-7182-2022-21-497-511 article EN Computational Linguistics and Intellectual Technologies 2022-06-18

Abstract With recent advances in natural language generation, risks associated with the rapid proliferation and misuse of generative models for malicious purposes steadily increase. Artificial text detection (ATD) has emerged to develop resources computational methods mitigate these risks, such as generating fake news scientific article reviews. This paper introduces corpus artificial texts (CoAT), a large-scale human-written generated Russian language. CoAT spans six domains comprises...

10.1017/nlp.2024.38 article EN Natural language processing. 2024-09-06

Text detoxification is the task of rewriting a toxic text into neutral while preserving its original content. It has wide range applications, e.g. moderation output neural chatbots or suggesting less emotional version posts on social networks. This paper provides description RUSSE-2022 competition methods for Russian language. first which features (i) parallel training data and (ii) manual evaluation. We describe setup competition, solutions participating teams analyse their performance. In...

10.28995/2075-7182-2022-21-114-131 article EN Computational Linguistics and Intellectual Technologies 2022-06-18

This article is devoted to the problem of Anglicisms in texts Russian: tasks detection and automatic rewriting text with substitution by their Russian-language equivalents. Within framework study, we present a parallel corpus models that identify replace them Russian equivalent, preserving stylistics original text.

10.28995/2075-7182-2023-22-295-306 article EN Computational Linguistics and Intellectual Technologies 2023-06-19

DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019) design: it comprises natural yes/no questions. Each question is paired with paragraph from Wikipedia and an answer, derived the paragraph. The task to take both as input come up i.e. produce binary output. In this paper, we present reproducible approach DaNetQA creation investigate transfer learning methods for language transferring. For transferring leverage three similar sentence modelling tasks: 1) corpus of...

10.48550/arxiv.2010.02605 preprint EN other-oa arXiv (Cornell University) 2020-01-01

We present the shared task on artificial text detection in Russian, which is organized as a part of Dialogue Evaluation initiative, held 2022. The dataset includes texts from 14 generators, i.e., one human writer and 13 generative models fine-tuned for or more following generation tasks: machine translation, paraphrase generation, summarization, simplification. also consider back-translation zero-shot approaches. human-written are collected publicly available resources across multiple...

10.48550/arxiv.2206.01583 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Spiridonova, Valentina Kurenshchikova, Artemova, Vladislav Mikhailov. Findings of the Association for Computational Linguistics: EMNLP 2022.

10.18653/v1/2022.findings-emnlp.183 article EN cc-by 2022-01-01

Over the past few years, one of most notable advancements in AI research has been foundation models (FMs), headlined by rise language (LMs). As models' size increases, LMs demonstrate enhancements measurable aspects and development new qualitative features. However, despite researchers' attention rapid growth LM application, capabilities, limitations, associated risks still need to be better understood. To address these issues, we introduce an open Multimodal Evaluation Russian-language...

10.48550/arxiv.2401.04531 preprint EN cc-by arXiv (Cornell University) 2024-01-01

Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal address limited number languages and lack diversity language-specific phenomena. This paper introduces Russian Benchmark Linguistic Pairs (RuBLiMP), which includes 45k sentences that differ in grammaticality isolate morphological, syntactic, or semantic phenomenon. In contrast benchmarks linguistic pairs, RuBLiMP is created by applying...

10.48550/arxiv.2406.19232 preprint EN arXiv (Cornell University) 2024-06-27
Coming Soon ...