- Natural Language Processing Techniques
- Topic Modeling
- Text Readability and Simplification
- Authorship Attribution and Profiling
- Syntax, Semantics, Linguistic Variation
- Spanish Linguistics and Language Studies
- Semantic Web and Ontologies
- Language and cultural evolution
- Speech and dialogue systems
- Literary and Cultural Studies
- Music and Audio Processing
- Translation Studies and Practices
- Cultural and political discourse analysis
- Phonetics and Phonology Research
- Biomedical Text Mining and Ontologies
- Linguistic Variation and Morphology
- Journalism and Media Studies
Universidad Nacional Autónoma de México
2016-2024
Centro de Investigaciones Interdisciplinarias en Ciencias y Humanidades
2024
University of Zurich
2018-2023
Dartmouth College
2021
Meta (United States)
2021
Universidad de la República
2021
Carnegie Mellon University
2021
Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Giménez-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, Katharina Kann. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages Americas. 2021.
Indigenous languages of the American continent are highly diverse. However, they have received little attention from technological perspective. In this paper, we review research, digital resources and available NLP systems that focus on these languages. We present main challenges research questions arise when distant low-resource scenarios faced. would like to encourage in linguistically rich diverse areas Americas.
In linguistics, there is little consensus on how to define, measure, and compare complexity across languages. We propose take the diversity of viewpoints as a given, capture language by vector measurements, rather than single value. then assess statistical support for two controversial hypotheses: trade-off hypothesis equi-complexity hypothesis. furnish meta-analyses 28 metrics applied texts written in overall 80 typologically diverse The partially supported, sense that around one third...
Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova, Tanja Samardzic. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.
Abstract Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundantpatterns compressing the data, and hence alleviates sparsity problem downstream applications.Subwords discovered during first merge operations tend to have most substantial impact on thecompression of texts. However, structural underpinnings this effect not been analyzedcross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages threeparallel...
The aim of this thesis proposal is to perform bilingual lexicon extraction for cases in which small parallel corpora are available and it not easy obtain monolingual corpus at least one the languages.Moreover, languages typologically distant there no seed available.We focus on language pair Spanish-Nahuatl, we propose work with morpheme based representations order reduce sparseness facilitate task finding lexical correspondences between a highly agglutinative fusional one.We take into...
Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts digitally available text data, such as Indigenous languages. However, it shown that pretrained multilingual models are able perform crosslingual transfer in a zero-shot setting even which unseen during pretraining. Yet, prior work evaluating performance on largely shallow token-level tasks. It remains unclear if learning deeper semantic tasks is...
We propose a quantitative approach for quantifying morphological complexity of language based on text. Several corpus-based methods have focused measuring the different word forms that can produce. take into account not only productivity processes but also predictability those processes. use model predicts probability sub-word sequences within word; we calculate entropy rate this and it as measure internal structure words. Our results show is important to integrate these two dimensions when...
In linguistics, interlinear glossing is an essential procedure for analyzing the morphology of languages. This type annotation useful language documentation, and it can also provide valuable data NLP applications. We perform automatic Otomi, under-resourced language. Our work comprises pre-processing corpus. implement different sequential labelers. CRF models represented efficient good solution our task. Two main observations emerged from work: 1) with a higher number parameters (RNNs)...
Los grandes modelos del lenguaje son tecnologías que han mostrado una capacidad notable para producir texto simula al humano escrito; estos están detrás de agentes conversacionales como chatGPT o Gemini. Si bien el impacto y uso se ha extendido a numerosos sectores la sociedad, no siempre discuten los fundamentos técnicos científicos subyacen desarrollos inteligencia artificial. El presente artículo propone dar introducción funcionamiento lenguaje, desde las primeras propuestas hasta...
We use two small parallel corpora for comparing the morphological complexity of Spanish, Otomi and Nahuatl. These are languages that belong to different linguistic families, latter low-resourced. take into account quantitative criteria, on one hand distribution types over tokens in a corpus, other, perplexity entropy as indicators word structure predictability. show language can be complex terms how many forms produce, however, it may less predictability its internal words.
The aim of this work is to extract word translation pairs from a small parallel corpus and measure the impact dealing with morphology for improving task. We focus on language pair Spanish-Nahuatl, both languages are morphologically rich distant each other. generate semi-supervised morphological segementation models we compare two approaches (estimation, association) extracting bilingual correspondences. show that taking into account typological properties languages, such as morphology, helps...
Tatyana Ruzsics, Olga Sozinova, Ximena Gutierrez-Vasques, Tanja Samardzic. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.
Tanja Samardžić, Ximena Gutierrez-Vasques, Rob van der Goot, Max Müller-Eberstein, Olga Pelloni, Barbara Plank. Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL). 2022.
In this work we focus on the task of automatically extracting bilingual lexicon for language pair Spanish-Nahuatl. This is a low-resource setting where only small amount parallel corpus available. Most downstream methods do not well under low-resources conditions. specially true approaches that use vectorial representations like Word2Vec. Our proposal to construct word vectors from graph. graph generated using translation pairs obtained an unsupervised alignment method. We show that, in...