David Ifeoluwa Adelani

ORCID: 0000-0002-0193-2083
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Multimodal Machine Learning Applications
  • Speech Recognition and Synthesis
  • Sentiment Analysis and Opinion Mining
  • Hate Speech and Cyberbullying Detection
  • Text Readability and Simplification
  • Privacy-Preserving Technologies in Data
  • Text and Document Classification Technologies
  • Speech and dialogue systems
  • Machine Learning and Data Classification
  • Authorship Attribution and Profiling
  • ICT in Developing Communities
  • Semantic Web and Ontologies
  • Research Data Management Practices
  • Scientific Computing and Data Management
  • ICT Impact and Policies
  • Language and cultural evolution
  • Adversarial Robustness in Machine Learning
  • Linguistics and Language Analysis
  • Cognitive Science and Education Research
  • Second Language Learning and Teaching
  • Imbalanced Data Classification Techniques
  • Simulation Techniques and Applications
  • Linguistic Studies and Language Acquisition

McGill University
2024-2025

Canadian Institute for Advanced Research
2025

University College London
2022-2024

Mila - Quebec Artificial Intelligence Institute
2024

Saarland University
2020-2023

IT University of Copenhagen
2023

Tokyo Institute of Technology
2023

Administration for Community Living
2023

American Jewish Committee
2023

Huazhong University of Science and Technology
2023

Teven Le Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ilić and 95 more Daniel Hesslow Roman Castagné Alexandra Sasha Luccioni François Yvon Matthias Gallé Jonathan Tow Alexander M. Rush Stella Biderman Albert Webson Pawan Sasanka Ammanamanchi Thomas J. Wang Benoît Sagot Niklas Muennighoff A. Villanova del Moral Olatunji Ruwase Rachel Bawden Stas Bekman Angelina McMillan-Major Iz Beltagy Huu Du Nguyen Lucile Saulnier Samson Tan Pedro Ortiz Suarez Victor Sanh Hugo Laurençon Yacine Jernite Julien Launay Margaret Mitchell Colin Raffel Aaron Gokaslan Adi Simhi Aitor Soroa Alham Fikri Aji Amit Alfassy Anna Rogers Ariel Kreisberg Nitzav Canwen Xu Chenghao Mou Chris Chinenye Emezue Christopher Klamm Colin Leong Daniel van Strien David Ifeoluwa Adelani Dragomir Radev Eduardo González Ponferrada Efrat Levkovizh Ethan Kim Eyal Bar Natan Francesco De Toni Gérard Dupont Germán Kruszewski Giada Pistilli Hady Elsahar Hamza Benyamina Hieu Tran Ian Yu Idris Abdulmumin Isaac Johnson Itziar González-Dios Javier de la Rosa Jenny Chim Jesse Dodge Jianguo Zhu Jonathan Chang Jörg Frohberg Joseph Tobing Joydeep Bhattacharjee Khalid Almubarak Kimbo Chen Kyle Lo Leandro von Werra Leon Weber Long Phan Loubna Ben Allal Ludovic Tanguy Manan Dey Manuel Romero Muñoz Maraim Masoud María Grandury Mario Šaško Max Tze Han Huang Maximin Coavoux Mayank Singh Mike Tian-Jian Jiang Minh Chien Vu Mohammad Ali Jauhar Mustafa Ghaleb Nishant Subramani Nora Kassner Nurulaqilla Khamis Olivier Nguyen Omar Espejel Ona De Gibert Paulo Villegas Peter Henderson

Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...

10.48550/arxiv.2211.05100 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these understand phenomena in a broader population is difficult due their non-representativeness the bias of statistical inference tools towards dominant languages groups. While demographic attribute could be used mitigate such bias, current techniques are almost entirely monolingual fail work global environment. We address challenges by combining multilingual with post-stratification...

10.1145/3308558.3313684 preprint EN 2019-05-13

Abstract We take a step towards addressing the under- representation of African continent in NLP research by bringing together different stakeholders to create first large, publicly available, high-quality dataset for named entity recognition (NER) ten languages. detail characteristics these languages help researchers and practitioners better understand challenges they pose NER tasks. analyze our datasets conduct an extensive empirical evaluation state- of-the-art methods across both...

10.1162/tacl_a_00416 article EN cc-by Transactions of the Association for Computational Linguistics 2021-01-01

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...

10.48550/arxiv.2303.03915 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Sentiment analysis is one of the most widely studied applications in NLP, but work focuses on languages with large amounts data. We introduce first large-scale human-annotated Twitter sentiment dataset for four spoken Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting around 30,000 annotated tweets per language (and 14,000 Nigerian-Pidgin), including a significant fraction code-mixed tweets. propose text collection, filtering, processing labeling methods that enable us to...

10.48550/arxiv.2201.08277 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large drop languages unseen during pre-training, especially African One of the most effective approaches to adapt new \textit{language adaptive fine-tuning} (LAFT) -- fine-tuning multilingual PLM monolingual texts using pre-training objective. adapting target individually takes disk space limits...

10.48550/arxiv.2204.06487 preprint EN cc-by arXiv (Cornell University) 2022-01-01

David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Wairagala, Muhammad Umair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez...

10.18653/v1/2022.naacl-main.223 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 with at least one million speakers each. Yet, there little NLP research conducted on African languages. Crucial enabling such availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains total >110,000 tweets in 14 (Amharic, Algerian Arabic, Hausa, Igbo,...

10.48550/arxiv.2302.08956 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance different amounts available resources the three African Hausa, isiXhosa Yor\`ub\'a both NER topic classification. We show combination with transfer learning or distant...

10.18653/v1/2020.emnlp-main.204 preprint EN cc-by 2020-01-01

David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire Memdjokam Koagne,...

10.18653/v1/2022.emnlp-main.298 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Shamsuddeen Muhammad, Idris Abdulmumin, Abinew Ayele, Nedjma Ousidhoum, David Adelani, Seid Yimam, Ibrahim Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Alipio Jorge, Pavel Brazdil, Felermino Ali, Davis David, Salomey Osei, Bello Shehu-Bello, Falalu Lawan, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Messelle, Hailu Balcha, Sisay Chala, Hagos Gebremichael, Bernard Opoku, Stephen Arthur. Proceedings of the 2023 Conference on Empirical Methods in...

10.18653/v1/2023.emnlp-main.862 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Seid Muhie Yimam, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Nedjma Ousidhoum, Abinew Ali Ayele, Saif Mohammad, Meriem Beloucif, Sebastian Ruder. Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023.

10.18653/v1/2023.semeval-1.315 article EN cc-by Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) 2023-01-01

While recent multilingual automatic speech recognition models claim to support thousands of languages, ASR for low-resource languages remains highly unreliable due limited bimodal and text training data. Better spoken language understanding (SLU) can strengthen massively the robustness by levering semantics compensate scarce data, such as disambiguating utterances via context or exploiting semantic similarities across languages. Even more so, SLU is indispensable inclusive technology in...

10.48550/arxiv.2501.06117 preprint EN arXiv (Cornell University) 2025-01-10

Zheng Xin Yong, Hailey Schoelkopf, Niklas Muennighoff, Alham Fikri Aji, David Ifeoluwa Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, Genta Winata, Stella Biderman, Edward Raff, Dragomir Radev, Vassilina Nikoulina. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.653 article EN cc-by 2023-01-01

Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, moderated. However, in many regions of the Global South, there have been several documented occurrences (1) absence moderation (2) censorship due reliance on keyword spotting out context. Further, high-profile individuals frequently at center process, while large targeted hate campaigns against minorities overlooked. These limitations mainly lack high-quality data...

10.48550/arxiv.2501.08284 preprint EN arXiv (Cornell University) 2025-01-14

Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these often exclude evaluations of low-resource languages rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo -- a multicultural, open-source benchmark dataset 16 African with utterances generated by native speakers across diverse domains, including banking, travel, home, dining....

10.48550/arxiv.2502.09814 preprint EN arXiv (Cornell University) 2025-02-13

People worldwide use language in subtle and complex ways to express emotions. While emotion recognition -- an umbrella term for several NLP tasks significantly impacts different applications other fields, most work the area is focused on high-resource languages. Therefore, this has led major disparities research proposed solutions, especially low-resource languages that suffer from lack of high-quality datasets. In paper, we present BRIGHTER-- a collection multilabeled emotion-annotated...

10.48550/arxiv.2502.11926 preprint EN arXiv (Cornell University) 2025-02-17

Traditional supervised fine-tuning (SFT) strategies for sequence-to-sequence tasks often train models to directly generate the target output. Recent work has shown that guiding with intermediate steps, such as keywords, outlines, or reasoning chains, can significantly improve performance, coherence, and interpretability. However, these methods depend on predefined formats annotated data, limiting their scalability generalizability. In this work, we introduce a task-agnostic framework enables...

10.48550/arxiv.2502.12304 preprint EN arXiv (Cornell University) 2025-02-17

High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other LLMs still underperform non-English languages, likely due to a gap in quality and diversity available multilingual corpora. In this work, we find that machine-translated texts from single source contribute significantly LLMs. We translate FineWeb-Edu, English web dataset, into nine resulting 1.7-trillion-token which call TransWebEdu pretrain...

10.48550/arxiv.2502.13252 preprint EN arXiv (Cornell University) 2025-02-18

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda...

10.18653/v1/2023.ijcnlp-main.10 article EN cc-by 2023-01-01

Incorrect labels in training data occur when human annotators make mistakes or the is generated via weak distant supervision. It has been shown that complex noise-handling techniques - by modeling, cleaning filtering noisy instances are required to prevent models from fitting this label noise. However, we show work that, for text classification tasks with modern NLP like BERT, over a variety of noise types, existing methods do not always improve its performance, and may even deteriorate it,...

10.18653/v1/2022.insights-1.8 article EN cc-by 2022-01-01

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) noise (3) pre-training data model, (4) impact domain mismatch, and (5) language typology. In addition yielding several heuristics, experiments form framework for evaluating sensitivities machine translation systems. While is robust...

10.18653/v1/2022.findings-acl.6 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

Miaoran Zhang, Marius Mosbach, David Adelani, Michael Hedderich, Dietrich Klakow. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.

10.18653/v1/2022.naacl-main.436 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01
Coming Soon ...