NFDI4DS | UHH-SEMS - Publication Details

Ximena Gutierrez-Vasques

ORCID: 0000-0002-1486-2774

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5004693481

Research Areas

Natural Language Processing Techniques
Topic Modeling
Text Readability and Simplification
Authorship Attribution and Profiling
Syntax, Semantics, Linguistic Variation
Spanish Linguistics and Language Studies
Semantic Web and Ontologies
Language and cultural evolution
Speech and dialogue systems
Literary and Cultural Studies
Music and Audio Processing
Translation Studies and Practices
Cultural and political discourse analysis
Phonetics and Phonology Research
Biomedical Text Mining and Ontologies
Linguistic Variation and Morphology
Journalism and Media Studies

Universidad Nacional Autónoma de México
2016-2024

Centro de Investigaciones Interdisciplinarias en Ciencias y Humanidades
2024

University of Zurich
2018-2023

Dartmouth College
2021

Meta (United States)
2021

Universidad de la República
2021

Carnegie Mellon University
2021

Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas

OPENALEX - Publications

Manuel Mager Arturo Oncevay Abteen Ebrahimi John E. Ortega Annette Rios and 13 more

Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Giménez-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, Katharina Kann. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages Americas. 2021.

10.18653/v1/2021.americasnlp-1.23 article EN cc-by 2021-01-01

Challenges of language technologies for the indigenous languages of the Americas

OPENALEX - Publications

Manuel Mager Ximena Gutierrez-Vasques Gerardo Sierra Iván Meza

Indigenous languages of the American continent are highly diverse. However, they have received little attention from technological perspective. In this paper, we review research, digital resources and available NLP systems that focus on these languages. We present main challenges research questions arise when distant low-resource scenarios faced. would like to encourage in linguistically rich diverse areas Americas.

10.48550/arxiv.1806.04291 preprint EN cc-by arXiv (Cornell University) 2018-01-01

Complexity trade-offs and equi-complexity in natural languages: a meta-analysis

OPENALEX - Publications

Christian Bentz Ximena Gutierrez-Vasques Olga Sozinova Tanja Samardžić

In linguistics, there is little consensus on how to define, measure, and compare complexity across languages. We propose take the diversity of viewpoints as a given, capture language by vector measurements, rather than single value. then assess statistical support for two controversial hypotheses: trade-off hypothesis equi-complexity hypothesis. furnish meta-analyses 28 metrics applied texts written in overall 80 typologically diverse The partially supported, sense that around one third...

10.1515/lingvan-2021-0054 article EN cc-by Linguistics Vanguard 2022-10-14

From characters to words: the turning point of BPE merges

OPENALEX - Publications

Ximena Gutierrez-Vasques Christian Bentz Olga Sozinova Tanja Samardžić

Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova, Tanja Samardzic. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.

10.18653/v1/2021.eacl-main.302 article EN cc-by 2021-01-01

Languages Through the Looking Glass of BPE Compression

OPENALEX - Publications

Ximena Gutierrez-Vasques Christian Bentz Tanja Samardžić

Abstract Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundantpatterns compressing the data, and hence alleviates sparsity problem downstream applications.Subwords discovered during first merge operations tend to have most substantial impact on thecompression of texts. However, structural underpinnings this effect not been analyzedcross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages threeparallel...

10.1162/coli_a_00489 article EN cc-by-nc-nd Computational Linguistics 2023-07-07

Bilingual lexicon extraction for a distant language pair using a small parallel corpus

OPENALEX - Publications

Ximena Gutierrez-Vasques

The aim of this thesis proposal is to perform bilingual lexicon extraction for cases in which small parallel corpora are available and it not easy obtain monolingual corpus at least one the languages.Moreover, languages typologically distant there no seed available.We focus on language pair Spanish-Nahuatl, we propose work with morpheme based representations order reduce sparseness facilitate task finding lexical correspondences between a highly agglutinative fusional one.We take into...

10.3115/v1/n15-2021 article EN cc-by 2015-01-01

AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas

OPENALEX - Publications

Katharina Kann Abteen Ebrahimi Manuel Mager Arturo Oncevay John E. Ortega and 13 more

Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts digitally available text data, such as Indigenous languages. However, it shown that pretrained multilingual models are able perform crosslingual transfer in a zero-shot setting even which unseen during pretraining. Yet, prior work evaluating performance on largely shallow token-level tasks. It remains unclear if learning deeper semantic tasks is...

10.3389/frai.2022.995667 article EN cc-by Frontiers in Artificial Intelligence 2022-12-02

Productivity and Predictability for Measuring Morphological Complexity

OPENALEX - Publications

Ximena Gutierrez-Vasques Víctor Mijangos

We propose a quantitative approach for quantifying morphological complexity of language based on text. Several corpus-based methods have focused measuring the different word forms that can produce. take into account not only productivity processes but also predictability those processes. use model predicts probability sub-word sequences within word; we calculate entropy rate this and it as measure internal structure words. Our results show is important to integrate these two dimensions when...

10.3390/e22010048 article EN cc-by Entropy 2019-12-30

Automatic Interlinear Glossing for Otomi language

OPENALEX - Publications

Diego Barriga Martínez Víctor Mijangos Ximena Gutierrez-Vasques

In linguistics, interlinear glossing is an essential procedure for analyzing the morphology of languages. This type annotation useful language documentation, and it can also provide valuable data NLP applications. We perform automatic Otomi, under-resourced language. Our work comprises pre-processing corpus. implement different sequential labelers. CRF models represented efficient good solution our task. Two main observations emerged from work: 1) with a higher number parameters (RNNs)...

10.18653/v1/2021.americasnlp-1.5 article EN cc-by 2021-01-01

De las ideas verdes incoloras hasta ChatGpt: los grandes modelos del lenguaje

OPENALEX - Publications

Ximena Gutierrez-Vasques Víctor Germán Mijangos de la Cruz

Los grandes modelos del lenguaje son tecnologías que han mostrado una capacidad notable para producir texto simula al humano escrito; estos están detrás de agentes conversacionales como chatGPT o Gemini. Si bien el impacto y uso se ha extendido a numerosos sectores la sociedad, no siempre discuten los fundamentos técnicos científicos subyacen desarrollos inteligencia artificial. El presente artículo propone dar introducción funcionamiento lenguaje, desde las primeras propuestas hasta...

10.22201/dgtic.26832968e.2024.10.18 article ES cc-by-nc TIES Revista de Tecnología e Innovación en Educación Superior 2024-06-21

Comparing morphological complexity of Spanish, Otomi and Nahuatl

OPENALEX - Publications

Ximena Gutierrez-Vasques Víctor Mijangos

We use two small parallel corpora for comparing the morphological complexity of Spanish, Otomi and Nahuatl. These are languages that belong to different linguistic families, latter low-resourced. take into account quantitative criteria, on one hand distribution types over tokens in a corpus, other, perplexity entropy as indicators word structure predictability. show language can be complex terms how many forms produce, however, it may less predictability its internal words.

10.48550/arxiv.1808.04314 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Morphological segmentation for extracting Spanish-Nahuatl bilingual lexicon

OPENALEX - Publications

Ximena Gutierrez-Vasques Alfonso Medina Gerardo Sierra

The aim of this work is to extract word translation pairs from a small parallel corpus and measure the impact dealing with morphology for improving task. We focus on language pair Spanish-Nahuatl, both languages are morphologically rich distant each other. generate semi-supervised morphological segementation models we compare two approaches (estimation, association) extracting bilingual correspondences. show that taking into account typological properties languages, such as morphology, helps...

10.26342/2019-63-4 article EN Procesamiento del lenguaje natural 2019-09-01

Interpretability for Morphological Inflection: from Character-level Predictions to Subword-level Rules

OPENALEX - Publications

Tatyana Ruzsics Olga Sozinova Ximena Gutierrez-Vasques Tanja Samardžić

Tatyana Ruzsics, Olga Sozinova, Ximena Gutierrez-Vasques, Tanja Samardzic. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.

10.18653/v1/2021.eacl-main.278 article EN cc-by 2021-01-01

On Language Spaces, Scales and Cross-Lingual Transfer of UD Parsers

OPENALEX - Publications

Tanja Samardžić Ximena Gutierrez-Vasques Rob van der Goot Max Müller-Eberstein Olga Pelloni and 1 more

Tanja Samardžić, Ximena Gutierrez-Vasques, Rob van der Goot, Max Müller-Eberstein, Olga Pelloni, Barbara Plank. Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL). 2022.

10.18653/v1/2022.conll-1.18 article EN cc-by 2022-01-01

Low-resource bilingual lexicon extraction using graph based word embeddings

OPENALEX - Publications

Ximena Gutierrez-Vasques Víctor Mijangos

In this work we focus on the task of automatically extracting bilingual lexicon for language pair Spanish-Nahuatl. This is a low-resource setting where only small amount parallel corpus available. Most downstream methods do not well under low-resources conditions. specially true approaches that use vectorial representations like Word2Vec. Our proposal to construct word vectors from graph. graph generated using translation pairs obtained an unsupervised alignment method. We show that, in...

10.48550/arxiv.1710.02569 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Coming Soon ...