Ximena Gutierrez-Vasques

ORCID: 0000-0002-1486-2774
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Text Readability and Simplification
  • Authorship Attribution and Profiling
  • Syntax, Semantics, Linguistic Variation
  • Spanish Linguistics and Language Studies
  • Semantic Web and Ontologies
  • Language and cultural evolution
  • Speech and dialogue systems
  • Literary and Cultural Studies
  • Music and Audio Processing
  • Translation Studies and Practices
  • Cultural and political discourse analysis
  • Phonetics and Phonology Research
  • Biomedical Text Mining and Ontologies
  • Linguistic Variation and Morphology
  • Journalism and Media Studies

Universidad Nacional Autónoma de México
2016-2024

Centro de Investigaciones Interdisciplinarias en Ciencias y Humanidades
2024

University of Zurich
2018-2023

Dartmouth College
2021

Meta (United States)
2021

Universidad de la República
2021

Carnegie Mellon University
2021

Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Giménez-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, Katharina Kann. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages Americas. 2021.

10.18653/v1/2021.americasnlp-1.23 article EN cc-by 2021-01-01

Indigenous languages of the American continent are highly diverse. However, they have received little attention from technological perspective. In this paper, we review research, digital resources and available NLP systems that focus on these languages. We present main challenges research questions arise when distant low-resource scenarios faced. would like to encourage in linguistically rich diverse areas Americas.

10.48550/arxiv.1806.04291 preprint EN cc-by arXiv (Cornell University) 2018-01-01

In linguistics, there is little consensus on how to define, measure, and compare complexity across languages. We propose take the diversity of viewpoints as a given, capture language by vector measurements, rather than single value. then assess statistical support for two controversial hypotheses: trade-off hypothesis equi-complexity hypothesis. furnish meta-analyses 28 metrics applied texts written in overall 80 typologically diverse The partially supported, sense that around one third...

10.1515/lingvan-2021-0054 article EN cc-by Linguistics Vanguard 2022-10-14

Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova, Tanja Samardzic. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.

10.18653/v1/2021.eacl-main.302 article EN cc-by 2021-01-01

Abstract Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundantpatterns compressing the data, and hence alleviates sparsity problem downstream applications.Subwords discovered during first merge operations tend to have most substantial impact on thecompression of texts. However, structural underpinnings this effect not been analyzedcross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages threeparallel...

10.1162/coli_a_00489 article EN cc-by-nc-nd Computational Linguistics 2023-07-07

The aim of this thesis proposal is to perform bilingual lexicon extraction for cases in which small parallel corpora are available and it not easy obtain monolingual corpus at least one the languages.Moreover, languages typologically distant there no seed available.We focus on language pair Spanish-Nahuatl, we propose work with morpheme based representations order reduce sparseness facilitate task finding lexical correspondences between a highly agglutinative fusional one.We take into...

10.3115/v1/n15-2021 article EN cc-by 2015-01-01

Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts digitally available text data, such as Indigenous languages. However, it shown that pretrained multilingual models are able perform crosslingual transfer in a zero-shot setting even which unseen during pretraining. Yet, prior work evaluating performance on largely shallow token-level tasks. It remains unclear if learning deeper semantic tasks is...

10.3389/frai.2022.995667 article EN cc-by Frontiers in Artificial Intelligence 2022-12-02

We propose a quantitative approach for quantifying morphological complexity of language based on text. Several corpus-based methods have focused measuring the different word forms that can produce. take into account not only productivity processes but also predictability those processes. use model predicts probability sub-word sequences within word; we calculate entropy rate this and it as measure internal structure words. Our results show is important to integrate these two dimensions when...

10.3390/e22010048 article EN cc-by Entropy 2019-12-30

In linguistics, interlinear glossing is an essential procedure for analyzing the morphology of languages. This type annotation useful language documentation, and it can also provide valuable data NLP applications. We perform automatic Otomi, under-resourced language. Our work comprises pre-processing corpus. implement different sequential labelers. CRF models represented efficient good solution our task. Two main observations emerged from work: 1) with a higher number parameters (RNNs)...

10.18653/v1/2021.americasnlp-1.5 article EN cc-by 2021-01-01

Los grandes modelos del lenguaje son tecnologías que han mostrado una capacidad notable para producir texto simula al humano escrito; estos están detrás de agentes conversacionales como chatGPT o Gemini. Si bien el impacto y uso se ha extendido a numerosos sectores la sociedad, no siempre discuten los fundamentos técnicos científicos subyacen desarrollos inteligencia artificial. El presente artículo propone dar introducción funcionamiento lenguaje, desde las primeras propuestas hasta...

10.22201/dgtic.26832968e.2024.10.18 article ES cc-by-nc TIES Revista de Tecnología e Innovación en Educación Superior 2024-06-21

We use two small parallel corpora for comparing the morphological complexity of Spanish, Otomi and Nahuatl. These are languages that belong to different linguistic families, latter low-resourced. take into account quantitative criteria, on one hand distribution types over tokens in a corpus, other, perplexity entropy as indicators word structure predictability. show language can be complex terms how many forms produce, however, it may less predictability its internal words.

10.48550/arxiv.1808.04314 preprint EN other-oa arXiv (Cornell University) 2018-01-01

The aim of this work is to extract word translation pairs from a small parallel corpus and measure the impact dealing with morphology for improving task. We focus on language pair Spanish-Nahuatl, both languages are morphologically rich distant each other. generate semi-supervised morphological segementation models we compare two approaches (estimation, association) extracting bilingual correspondences. show that taking into account typological properties languages, such as morphology, helps...

10.26342/2019-63-4 article EN Procesamiento del lenguaje natural 2019-09-01

Tatyana Ruzsics, Olga Sozinova, Ximena Gutierrez-Vasques, Tanja Samardzic. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.

10.18653/v1/2021.eacl-main.278 article EN cc-by 2021-01-01

Tanja Samardžić, Ximena Gutierrez-Vasques, Rob van der Goot, Max Müller-Eberstein, Olga Pelloni, Barbara Plank. Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL). 2022.

10.18653/v1/2022.conll-1.18 article EN cc-by 2022-01-01

In this work we focus on the task of automatically extracting bilingual lexicon for language pair Spanish-Nahuatl. This is a low-resource setting where only small amount parallel corpus available. Most downstream methods do not well under low-resources conditions. specially true approaches that use vectorial representations like Word2Vec. Our proposal to construct word vectors from graph. graph generated using translation pairs obtained an unsupervised alignment method. We show that, in...

10.48550/arxiv.1710.02569 preprint EN other-oa arXiv (Cornell University) 2017-01-01
Coming Soon ...