- Natural Language Processing Techniques
- Topic Modeling
- Text Readability and Simplification
- Text and Document Classification Technologies
- Digital Humanities and Scholarship
- Speech and dialogue systems
- Language, Linguistics, Cultural Analysis
- Genetics, Bioinformatics, and Biomedical Research
- Semantic Web and Ontologies
- Multimodal Machine Learning Applications
- Data Quality and Management
- African history and culture analysis
- Translation Studies and Practices
- Image Processing and 3D Reconstruction
- Information Retrieval and Search Behavior
- Bayesian Methods and Mixture Models
- Interpreting and Communication in Healthcare
University of Waterloo
2022-2024
Leiden University
2022
Johns Hopkins University
2022
University of Washington
2022
Emmanuel College - Massachusetts
2022
Université d'Orléans
2022
Boston College
2022
Indiana University Bloomington
2022
The University of Melbourne
2022
Martin University
2022
Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation large, Web-mined text datasets covering hundreds languages. We manually audit quality 205 language-specific corpora released with five major public (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource systematic issues: At least 15 no usable text, significant fraction contains less than 50% sentences acceptable quality. In...
Abstract We take a step towards addressing the under- representation of African continent in NLP research by bringing together different stakeholders to create first large, publicly available, high-quality dataset for named entity recognition (NER) ten languages. detail characteristics these languages help researchers and practitioners better understand challenges they pose NER tasks. analyze our datasets conduct an extensive empirical evaluation state- of-the-art methods across both...
Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade...
David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire Memdjokam Koagne,...
African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval (XOR QA) -- those that retrieve answer from other while serving people in their native language offer a means filling this gap. To end, we create AfriQA, first cross-lingual QA dataset with focus on languages. AfriQA includes 12,000+ XOR examples across 10 While previous datasets focused primarily...
Pretrained language models represent the state of art in NLP, but successful construction such often requires large amounts data and computational resources.Thus, paucity for low-resource languages impedes development robust NLP capabilities these languages.There has been some recent success pretraining encoderonly solely on a combination lowresource African languages, exemplified by AfriBERTa.In this work, we extend approach "small data" to encoderdecoder models.We introduce AfriTeVa,...
Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents languages, we contribute first high-quality datasets 6 languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorùbá. There are a total 9,000 turns, each language having 1,500 which translate from portion English multi-domain MultiWOZ dataset. Subsequently, benchmark by investigating...
Akintunde Oladipo, Mofetoluwa Adeyemi, Orevaoghene Ahia, Abraham Owodunni, Odunayo Ogundepo, David Adelani, Jimmy Lin. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.
Odunayo Ogundepo, Tajuddeen Gwadabe, Clara Rivera, Jonathan Clark, Sebastian Ruder, David Adelani, Bonaventure Dossou, Abdou Diop, Claytone Sikasote, Gilles Hacheme, Happy Buzaaba, Ignatius Ezeani, Rooweither Mabuya, Salomey Osei, Chris Emezue, Albert Kahira, Shamsuddeen Muhammad, Akintunde Oladipo, Abraham Owodunni, Atnafu Tonja, Iyanuoluwa Shode, Akari Asai, Anuoluwapo Aremu, Ayodele Awokoya, Bernard Opoku, Chiamaka Chukwuneke, Christine Mwase, Clemencia Siro, Stephen Arthur, Tunde Ajayi,...
This paper provides a short overview of the CIRAL track at Forum for Information Retrieval Evaluation (FIRE) 2023. focused on cross-lingual information retrieval (CLIR) between English and four African languages which include Hausa, Somali, Swahili, Yoruba. In bid to promote CLIR research curate test collection languages, community evaluations were carried out via pooling. We briefly discuss details task, dataset, relevance assessment results from in this paper.
We participated in the WMT 2022 Large-Scale Machine Translation Evaluation for African Languages Shared Task. This work describes our approach, which is based on filtering given noisy data using a sentence-pair classifier that was built by fine-tuning pre-trained language model. To train classifier, we obtain positive samples (i.e. high-quality parallel sentences) from gold-standard curated dataset and extract negative low-quality automatically aligned choosing sentences with low alignment...
Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents languages, we contribute first high-quality datasets 6 languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yor\`ub\'a. These consist 1,500 turns each, which translate from a portion English multi-domain MultiWOZ dataset. Subsequently, investigate analyze effectiveness modelling through...
African languages are spoken by over a billion people, but underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well lack understanding settings where current methods effective. In this paper, we make towards solutions for these challenges, focusing on task named entity recognition (NER). We create largest human-annotated NER dataset 20 languages, study behavior state-of-the-art cross-lingual transfer an...
Large language models (LLMs) have shown impressive zero-shot capabilities in various document reranking tasks. Despite their successful implementations, there is still a gap existing literature on effectiveness low-resource languages. To address this gap, we investigate how LLMs function as rerankers cross-lingual information retrieval (CLIR) systems for African Our implementation covers English and four languages (Hausa, Somali, Swahili, Yoruba) examine with queries passages the...