- Natural Language Processing Techniques
- Topic Modeling
- Multimodal Machine Learning Applications
- Speech Recognition and Synthesis
- Sentiment Analysis and Opinion Mining
- Hate Speech and Cyberbullying Detection
- Text Readability and Simplification
- Privacy-Preserving Technologies in Data
- Text and Document Classification Technologies
- Speech and dialogue systems
- Machine Learning and Data Classification
- Authorship Attribution and Profiling
- ICT in Developing Communities
- Semantic Web and Ontologies
- Research Data Management Practices
- Scientific Computing and Data Management
- ICT Impact and Policies
- Language and cultural evolution
- Adversarial Robustness in Machine Learning
- Linguistics and Language Analysis
- Cognitive Science and Education Research
- Second Language Learning and Teaching
- Imbalanced Data Classification Techniques
- Simulation Techniques and Applications
- Linguistic Studies and Language Acquisition
McGill University
2024-2025
Canadian Institute for Advanced Research
2025
University College London
2022-2024
Mila - Quebec Artificial Intelligence Institute
2024
Saarland University
2020-2023
IT University of Copenhagen
2023
Tokyo Institute of Technology
2023
Administration for Community Living
2023
American Jewish Committee
2023
Huazhong University of Science and Technology
2023
Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...
Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these understand phenomena in a broader population is difficult due their non-representativeness the bias of statistical inference tools towards dominant languages groups. While demographic attribute could be used mitigate such bias, current techniques are almost entirely monolingual fail work global environment. We address challenges by combining multilingual with post-stratification...
Abstract We take a step towards addressing the under- representation of African continent in NLP research by bringing together different stakeholders to create first large, publicly available, high-quality dataset for named entity recognition (NER) ten languages. detail characteristics these languages help researchers and practitioners better understand challenges they pose NER tasks. analyze our datasets conduct an extensive empirical evaluation state- of-the-art methods across both...
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...
Sentiment analysis is one of the most widely studied applications in NLP, but work focuses on languages with large amounts data. We introduce first large-scale human-annotated Twitter sentiment dataset for four spoken Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting around 30,000 annotated tweets per language (and 14,000 Nigerian-Pidgin), including a significant fraction code-mixed tweets. propose text collection, filtering, processing labeling methods that enable us to...
Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large drop languages unseen during pre-training, especially African One of the most effective approaches to adapt new \textit{language adaptive fine-tuning} (LAFT) -- fine-tuning multilingual PLM monolingual texts using pre-training objective. adapting target individually takes disk space limits...
David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Wairagala, Muhammad Umair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez...
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 with at least one million speakers each. Yet, there little NLP research conducted on African languages. Crucial enabling such availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains total >110,000 tweets in 14 (Amharic, Algerian Arabic, Hausa, Igbo,...
Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance different amounts available resources the three African Hausa, isiXhosa Yor\`ub\'a both NER topic classification. We show combination with transfer learning or distant...
David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire Memdjokam Koagne,...
Shamsuddeen Muhammad, Idris Abdulmumin, Abinew Ayele, Nedjma Ousidhoum, David Adelani, Seid Yimam, Ibrahim Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Alipio Jorge, Pavel Brazdil, Felermino Ali, Davis David, Salomey Osei, Bello Shehu-Bello, Falalu Lawan, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Messelle, Hailu Balcha, Sisay Chala, Hagos Gebremichael, Bernard Opoku, Stephen Arthur. Proceedings of the 2023 Conference on Empirical Methods in...
Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Seid Muhie Yimam, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Nedjma Ousidhoum, Abinew Ali Ayele, Saif Mohammad, Meriem Beloucif, Sebastian Ruder. Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023.
While recent multilingual automatic speech recognition models claim to support thousands of languages, ASR for low-resource languages remains highly unreliable due limited bimodal and text training data. Better spoken language understanding (SLU) can strengthen massively the robustness by levering semantics compensate scarce data, such as disambiguating utterances via context or exploiting semantic similarities across languages. Even more so, SLU is indispensable inclusive technology in...
Zheng Xin Yong, Hailey Schoelkopf, Niklas Muennighoff, Alham Fikri Aji, David Ifeoluwa Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, Genta Winata, Stella Biderman, Edward Raff, Dragomir Radev, Vassilina Nikoulina. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, moderated. However, in many regions of the Global South, there have been several documented occurrences (1) absence moderation (2) censorship due reliance on keyword spotting out context. Further, high-profile individuals frequently at center process, while large targeted hate campaigns against minorities overlooked. These limitations mainly lack high-quality data...
Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these often exclude evaluations of low-resource languages rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo -- a multicultural, open-source benchmark dataset 16 African with utterances generated by native speakers across diverse domains, including banking, travel, home, dining....
People worldwide use language in subtle and complex ways to express emotions. While emotion recognition -- an umbrella term for several NLP tasks significantly impacts different applications other fields, most work the area is focused on high-resource languages. Therefore, this has led major disparities research proposed solutions, especially low-resource languages that suffer from lack of high-quality datasets. In paper, we present BRIGHTER-- a collection multilabeled emotion-annotated...
Traditional supervised fine-tuning (SFT) strategies for sequence-to-sequence tasks often train models to directly generate the target output. Recent work has shown that guiding with intermediate steps, such as keywords, outlines, or reasoning chains, can significantly improve performance, coherence, and interpretability. However, these methods depend on predefined formats annotated data, limiting their scalability generalizability. In this work, we introduce a task-agnostic framework enables...
High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other LLMs still underperform non-English languages, likely due to a gap in quality and diversity available multilingual corpora. In this work, we find that machine-translated texts from single source contribute significantly LLMs. We translate FineWeb-Edu, English web dataset, into nine resulting 1.7-trillion-token which call TransWebEdu pretrain...
David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda...
Incorrect labels in training data occur when human annotators make mistakes or the is generated via weak distant supervision. It has been shown that complex noise-handling techniques - by modeling, cleaning filtering noisy instances are required to prevent models from fitting this label noise. However, we show work that, for text classification tasks with modern NLP like BERT, over a variety of noise types, existing methods do not always improve its performance, and may even deteriorate it,...
What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) noise (3) pre-training data model, (4) impact domain mismatch, and (5) language typology. In addition yielding several heuristics, experiments form framework for evaluating sensitivities machine translation systems. While is robust...
Miaoran Zhang, Marius Mosbach, David Adelani, Michael Hedderich, Dietrich Klakow. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.