NFDI4DS | UHH-SEMS - Publication Details

Rodolfo Zevallos

ORCID: 0000-0003-0192-7740

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5006038779

Research Areas

Natural Language Processing Techniques
Speech Recognition and Synthesis
Topic Modeling
Speech and dialogue systems
Speech and Audio Processing
Biomedical Text Mining and Ontologies
Second Language Acquisition and Learning
Seismology and Earthquake Studies
Advanced Data Processing Techniques
Sociology, Governance, and Technology
Digital Communication and Language
Text Readability and Simplification
Music and Audio Processing
Authorship Attribution and Profiling
GNSS positioning and interference
Educational Technology in Learning
ICT in Developing Communities
Mental Health via Writing
Language and cultural evolution
E-Learning and Knowledge Management

Universitat Pompeu Fabra
2021-2023

National Agrarian University
2022

Universidad Nacional del Callao
2020

Pontifical Catholic University of Peru
2020

FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN

OPENALEX - Publications

Milind Agarwal Sweta Agrawal Antonios Anastasopoulos Luisa Bentivogli Ondřej Bojar and 57 more

Milind Agarwal, Sweta Agrawal, Antonios Anastasopoulos, Luisa Bentivogli, Ondřej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Chen, William Khalid Choukri, Alexandra Chronopoulou, Anna Currey, Thierry Declerck, Qianqian Dong, Kevin Duh, Yannick Estève, Marcello Federico, Souhir Gahbiche, Barry Haddow, Benjamin Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Javorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Kumar, Pengwei Li, Xutai Ma, Prashant Mathur, Evgeny...

10.18653/v1/2023.iwslt-1.1 article EN cc-by 2023-01-01

Findings of the IWSLT 2024 Evaluation Campaign

OPENALEX - Publications

Ibrahim Said Ahmad Antonios Anastasopoulos Ondřej Bojar Claudia Borg Marine Carpuat and 40 more

This paper reports on the shared tasks organized by 21st IWSLT Conference. The address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling dubbing, speech-to-speech dialect low-resource speech Indic languages. attracted 18 teams whose submissions are documented 26 system papers. growing interest towards translation is also witnessed constantly increasing number of task organizers contributors to overview paper, almost evenly...

10.48550/arxiv.2411.05088 preprint EN arXiv (Cornell University) 2024-11-07

Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua

OPENALEX - Publications

Rodolfo Zevallos J. V. Ortega William Chen R.G. Castro Núria Bel and 3 more

Rodolfo Zevallos, John Ortega, William Chen, Richard Castro, Núria Bel, Cesar Toshio, Renzo Venturas, Hilario Aradiel and Nelsi Melgarejo. Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing. 2022.

10.18653/v1/2022.deeplo-1.1 article EN cc-by 2022-01-01

Implementation of a Bilingual Participative Argumentation Web Platform for collection of Spanish Text and Quechua Speech

OPENALEX - Publications

Yudi Guzmán-Monteza Alexis Tavara Rodolfo Zevallos Hugo Vega-Huerta

Web development began in the 1990s. The versatility and flexibility of this technology has made it possible for its use application to enhance technological different fields around world. However, Latin American countries such as Peru, there is a lack culturally relevant web applications applied socio-political field. On other hand, since 2018 date, many social political movements have been making efforts obtain perspectives demands from citizens promote constituent process that date not...

10.1109/icecce52056.2021.9514251 article EN 2019 International Conference on Electrical, Communication, and Computer Engineering (ICECCE) 2021-06-12

QUESPA Submission for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks

OPENALEX - Publications

John E. Ortega Rodolfo Zevallos William Chen

This article describes the QUESPA team speech translation (ST) submissions for Quechua to Spanish (QUE–SPA) track featured in Evaluation Campaign of IWSLT 2023: low-resource and dialect translation. Two main submission types were supported campaign: constrained unconstrained. We submitted six total systems which our best (primary) system consisted an ST model based on Fairseq S2T framework where audio representations created using log mel-scale filter banks as features translations performed...

10.18653/v1/2023.iwslt-1.23 article EN cc-by 2023-01-01

Hints on the data for language modeling of synthetic languages with transformers

OPENALEX - Publications

Rodolfo Zevallos Núria Bel

Language Models (LM) are becoming more and useful for providing representations upon which to train Natural Processing applications. However, there is now clear evidence that attention-based transformers require a critical amount of language data produce good enough LMs. The question we have addressed in this paper what extent the varies languages different morphological typology, particular those rich inflectional morphology, whether tokenization method preprocess can make difference. These...

10.18653/v1/2023.acl-long.699 article EN cc-by 2023-01-01

Text-To-Speech Data Augmentation for Low Resource Speech Recognition

OPENALEX - Publications

Rodolfo Zevallos

Nowadays, the main problem of deep learning techniques used in development automatic speech recognition (ASR) models is lack transcribed data. The goal this research to propose a new data augmentation method improve ASR for agglutinative and low-resource languages. This novel generates both synthetic text audio. Some experiments were conducted using corpus Quechua language, which an language. In study, sequence-to-sequence (seq2seq) model was applied generate text, addition generating...

10.48550/arxiv.2204.00291 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN

OPENALEX - Publications

Ibrahim Said Ahmad Antonios Anastasopoulos Ondřej Bojar Claudia Borg Marine Carpuat and 39 more

10.18653/v1/2024.iwslt-1.1 article EN 2024-01-01

TEMA: Token Embeddings Mapping for Enriching Low-Resource Language Models

OPENALEX - Publications

Rodolfo Zevallos Núria Bel Mireia Farrús

10.18653/v1/2024.emnlp-main.638 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

The Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation

OPENALEX - Publications

Haohao Huan Wang Adam Meyers John E. Ortega Rodolfo Zevallos

Translating between languages with drastically different grammatical conventions poses challenges, not just for human interpreters but also machine translation systems. In this work, we specifically target the challenges posed by attributive nouns in Chinese, which frequently cause ambiguities English translation. By manually inserting omitted particle X ('DE'). news article titles from Penn Chinese Discourse Treebank, developed a targeted dataset to fine-tune Hugging Face models, improving...

10.48550/arxiv.2412.14323 preprint EN arXiv (Cornell University) 2024-12-18

The First Multilingual Model For The Detection of Suicide Texts

OPENALEX - Publications

Rodolfo Zevallos Annika Marie Schoene John E. Ortega

Suicidal ideation is a serious health problem affecting millions of people worldwide. Social networks provide information about these mental problems through users' emotional expressions. We propose multilingual model leveraging transformer architectures like mBERT, XML-R, and mT5 to detect suicidal text across posts in six languages - Spanish, English, German, Catalan, Portuguese Italian. A Spanish suicide tweet dataset was translated into five other using SeamlessM4T. Each fine-tuned on...

10.48550/arxiv.2412.15498 preprint EN arXiv (Cornell University) 2024-12-19

Language technology into high schools for revitalization of endangered languages

OPENALEX - Publications

Luis Camacho Rodolfo Zevallos

Language technology is the missing piece of puzzle that will bring us closer to a complete revitalization endangered languages. Almost every digital product uses and dependent on language; language not anymore an option but key enabler solution boosting future growth. Technical issues are hard lesser problems building corpus languages, centuries oppression managed dent pride sense belonging which reflected in lack awareness loss own language. In order reach based technology, powered by...

10.1109/intercon50315.2020.9220197 article EN 2020-09-01

Evaluating Self-Supervised Speech Representations for Indigenous American Languages

OPENALEX - Publications

Chihchen Chen William Chen Rodolfo Zevallos J. V. Ortega

The application of self-supervision to speech representation learning has garnered significant interest in recent years, due its scalability large amounts unlabeled data. However, much progress, both terms pre-training and downstream evaluation, remained concentrated monolingual models that only consider English. Few other languages, even fewer indigenous ones. In our submission the New Language Track ASRU 2023 ML-SUPERB Challenge, we present an ASR corpus for Quechua, South American...

10.48550/arxiv.2310.03639 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Data Augmentation for Low-Resource Quechua ASR Improvement

OPENALEX - Publications

Rodolfo Zevallos Núria Bel Guillermo Cámbara Mireia Farrús Jordi Luque

Automatic Speech Recognition (ASR) is a key element in new services that helps users to interact with an automated system.Deep learning methods have made it possible deploy systems word error rates below 5% for ASR of English.However, the use these only available languages hundreds or thousands hours audio and their corresponding transcriptions.For so-called low-resource speed up availability resources can improve performance systems, creating on basis existing ones are being investigated.In...

10.21437/interspeech.2022-770 article EN Interspeech 2022 2022-09-16

Lingüística computacional para la revitalización y el poliglotismo

OPENALEX - Publications

Luis Camacho Rodolfo Zevallos

A pesar de las leyes existentes, en la práctica el Estado peruano ignora multiculturalidad y se comporta como una entidad monolingüe monocultural. Dado que este paradigma equivocado todavía vigente, no ha invertido lo suficiente para desarrollar habilidades lingüísticas con fin servir a todos los ciudadanos por igual. Las consecuencias ello son falta fomento, discriminación finalmente aislamiento lleva extinción lenguas autóctonas. Nuestra iniciativa es cambiar equivocado, despertar orgullo...

10.30920/letras.91.134.9 article ES cc-by Letras (Lima) 2020-11-16

Frequency Balanced Datasets Lead to Better Language Models

OPENALEX - Publications

Rodolfo Zevallos Mireia Farrús Núria Bel

This paper reports on the experiments aimed to improve our understanding of role amount data required for training attention-based transformer language models. Specifically, we investigate impact reducing immense amounts pre-training through sampling strategies that identify and reduce high-frequency tokens as different studies have indicated existence very in might bias learning, causing undesired effects. In this light, describe algorithm iteratively assesses token frequencies removes...

10.18653/v1/2023.findings-emnlp.527 article EN cc-by 2023-01-01

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

OPENALEX - Publications

Alp Öktem Rodolfo Zevallos Yasmin Moslem Güneş Öztürk Karen Şarhon

We develop machine translation and speech synthesis systems to complement the efforts of revitalizing Judeo-Spanish, exiled language Sephardic Jews, which survived for centuries, but now faces threat extinction in digital age. Building on resources created by community Turkey elsewhere, we create corpora tools that would help preserve this future generations. For translation, first a Spanish Judeo-Spanish rule-based system, order generate large volumes synthetic parallel data relevant pairs:...

10.48550/arxiv.2205.15599 preprint EN cc-by-nc-nd arXiv (Cornell University) 2022-01-01

Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

OPENALEX - Publications

Rodolfo Zevallos Luis Camacho Nelsi Melgarejo

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. transcribed intended for the research and development technologies to preserve endangered languages in Peru. primarily designed automatic recognition, language identification text-to-speech tools. In order achieve sustainably, we employ crowdsourcing methodology. includes four Peru, it expected that by end year 2022, can reach up 20 out 48 has 220 hours audio recorded more than 500 volunteers, making...

10.48550/arxiv.2207.05498 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Data Augmentation for Low-Resource Quechua ASR Improvement

OPENALEX - Publications

Rodolfo Zevallos Núria Bel Guillermo Cámbara Mireia Farrús Jordi Luque

Automatic Speech Recognition (ASR) is a key element in new services that helps users to interact with an automated system. Deep learning methods have made it possible deploy systems word error rates below 5% for ASR of English. However, the use these only available languages hundreds or thousands hours audio and their corresponding transcriptions. For so-called low-resource speed up availability resources can improve performance systems, creating on basis existing ones are being...

10.48550/arxiv.2207.06872 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Coming Soon ...