NFDI4DS | UHH-SEMS - Publication Details

David Ifeoluwa Adelani

ORCID: 0000-0002-0193-2083

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5088658365

Research Areas

Natural Language Processing Techniques
Topic Modeling
Multimodal Machine Learning Applications
Speech Recognition and Synthesis
Sentiment Analysis and Opinion Mining
Hate Speech and Cyberbullying Detection
Text Readability and Simplification
Privacy-Preserving Technologies in Data
Text and Document Classification Technologies
Speech and dialogue systems
Machine Learning and Data Classification
Authorship Attribution and Profiling
ICT in Developing Communities
Semantic Web and Ontologies
Research Data Management Practices
Scientific Computing and Data Management
ICT Impact and Policies
Language and cultural evolution
Adversarial Robustness in Machine Learning
Linguistics and Language Analysis
Cognitive Science and Education Research
Second Language Learning and Teaching
Imbalanced Data Classification Techniques
Simulation Techniques and Applications
Linguistic Studies and Language Acquisition

McGill University
2024-2025

Canadian Institute for Advanced Research
2025

University College London
2022-2024

Mila - Quebec Artificial Intelligence Institute
2024

Saarland University
2020-2023

IT University of Copenhagen
2023

Tokyo Institute of Technology
2023

Administration for Community Living
2023

American Jewish Committee
2023

Huazhong University of Science and Technology
2023

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

OPENALEX - Publications

Teven Le Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ilić and 95 more

Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...

10.48550/arxiv.2211.05100 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

OPENALEX - Publications

Zijian Wang Scott A. Hale David Ifeoluwa Adelani Przemyslaw A. Grabowicz Timo Hartman and 2 more

Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these understand phenomena in a broader population is difficult due their non-representativeness the bias of statistical inference tools towards dominant languages groups. While demographic attribute could be used mitigate such bias, current techniques are almost entirely monolingual fail work global environment. We address challenges by combining multilingual with post-stratification...

10.1145/3308558.3313684 preprint EN 2019-05-13

MasakhaNER: Named Entity Recognition for African Languages

OPENALEX - Publications

David Ifeoluwa Adelani Jade Abbott Graham Neubig Daniel D’souza Julia Kreutzer and 56 more

Abstract We take a step towards addressing the under- representation of African continent in NLP research by bringing together different stakeholders to create first large, publicly available, high-quality dataset for named entity recognition (NER) ten languages. detail characteristics these languages help researchers and practitioners better understand challenges they pose NER tasks. analyze our datasets conduct an extensive empirical evaluation state- of-the-art methods across both...

10.1162/tacl_a_00416 article EN cc-by Transactions of the Association for Computational Linguistics 2021-01-01

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

OPENALEX - Publications

Hugo Laurençon Lucile Saulnier Thomas J. Wang Christopher Akiki A. Villanova del Moral and 49 more

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...

10.48550/arxiv.2303.03915 preprint EN cc-by arXiv (Cornell University) 2023-01-01

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

OPENALEX - Publications

Shamsuddeen Hassan Muhammad David Ifeoluwa Adelani Sebastian Ruder Ibrahim Said Ahmad Idris Abdulmumin and 7 more

Sentiment analysis is one of the most widely studied applications in NLP, but work focuses on languages with large amounts data. We introduce first large-scale human-annotated Twitter sentiment dataset for four spoken Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting around 30,000 annotated tweets per language (and 14,000 Nigerian-Pidgin), including a significant fraction code-mixed tweets. propose text collection, filtering, processing labeling methods that enable us to...

10.48550/arxiv.2201.08277 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning

OPENALEX - Publications

Jesujoba O. Alabi David Ifeoluwa Adelani Marius Mosbach Dietrich Klakow

Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large drop languages unseen during pre-training, especially African One of the most effective approaches to adapt new \textit{language adaptive fine-tuning} (LAFT) -- fine-tuning multilingual PLM monolingual texts using pre-training objective. adapting target individually takes disk space limits...

10.48550/arxiv.2204.06487 preprint EN cc-by arXiv (Cornell University) 2022-01-01

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

OPENALEX - Publications

David Ifeoluwa Adelani Jesujoba O. Alabi Angela Fan Julia Kreutzer Xiaoyu Shen and 40 more

David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Wairagala, Muhammad Umair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez...

10.18653/v1/2022.naacl-main.223 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

OPENALEX - Publications

Shamsuddeen Hassan Muhammad Idris Abdulmumin Abinew Ali Ayele Nedjma Ousidhoum David Ifeoluwa Adelani and 21 more

Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 with at least one million speakers each. Yet, there little NLP research conducted on African languages. Crucial enabling such availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains total >110,000 tweets in 14 (Amharic, Algerian Arabic, Hausa, Igbo,...

10.48550/arxiv.2302.08956 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages

OPENALEX - Publications

Michael A. Hedderich David Ifeoluwa Adelani Dawei Zhu Jesujoba O. Alabi Udia Markus and 1 more

Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance different amounts available resources the three African Hausa, isiXhosa Yor\`ub\'a both NER topic classification. We show combination with transfer learning or distant...

10.18653/v1/2020.emnlp-main.204 preprint EN cc-by 2020-01-01

MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

OPENALEX - Publications

David Ifeoluwa Adelani Graham Neubig Sebastian Ruder Shruti Rijhwani Michael Beukman and 40 more

David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire Memdjokam Koagne,...

10.18653/v1/2022.emnlp-main.298 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

OPENALEX - Publications

Shamsuddeen Hassan Muhammad Idris Abdulmumin Abinew Ali Ayele Nedjma Ousidhoum David Ifeoluwa Adelani and 22 more

Shamsuddeen Muhammad, Idris Abdulmumin, Abinew Ayele, Nedjma Ousidhoum, David Adelani, Seid Yimam, Ibrahim Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Alipio Jorge, Pavel Brazdil, Felermino Ali, Davis David, Salomey Osei, Bello Shehu-Bello, Falalu Lawan, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Messelle, Hailu Balcha, Sisay Chala, Hagos Gebremichael, Bernard Opoku, Stephen Arthur. Proceedings of the 2023 Conference on Empirical Methods in...

10.18653/v1/2023.emnlp-main.862 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)

OPENALEX - Publications

Shamsuddeen Hassan Muhammad Idris Abdulmumin Seid Muhie Yimam David Ifeoluwa Adelani Ibrahim Said Ahmad and 5 more

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Seid Muhie Yimam, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Nedjma Ousidhoum, Abinew Ali Ayele, Saif Mohammad, Meriem Beloucif, Sebastian Ruder. Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023.

10.18653/v1/2023.semeval-1.315 article EN cc-by Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) 2023-01-01

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

OPENALEX - Publications

Fabian David Schmidt Ivan Vulić Goran Glavaš David Ifeoluwa Adelani

While recent multilingual automatic speech recognition models claim to support thousands of languages, ASR for low-resource languages remains highly unreliable due limited bimodal and text training data. Better spoken language understanding (SLU) can strengthen massively the robustness by levering semantics compensate scarce data, such as disambiguating utterances via context or exploiting semantic similarities across languages. Even more so, SLU is indispensable inclusive technology in...

10.48550/arxiv.2501.06117 preprint EN arXiv (Cornell University) 2025-01-10

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

OPENALEX - Publications

Zheng Yong Hailey Schoelkopf Niklas Muennighoff Alham Fikri Aji David Ifeoluwa Adelani and 10 more

Zheng Xin Yong, Hailey Schoelkopf, Niklas Muennighoff, Alham Fikri Aji, David Ifeoluwa Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, Genta Winata, Stella Biderman, Edward Raff, Dragomir Radev, Vassilina Nikoulina. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.653 article EN cc-by 2023-01-01

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

OPENALEX - Publications

Shamsuddeen Hassan Muhammad Idris Abdulmumin Abinew Ali Ayele David Ifeoluwa Adelani Ibrahim Said Ahmad and 22 more

Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, moderated. However, in many regions of the Global South, there have been several documented occurrences (1) absence moderation (2) censorship due reliance on keyword spotting out context. Further, high-profile individuals frequently at center process, while large targeted hate campaigns against minorities overlooked. These limitations mainly lack high-quality data...

10.48550/arxiv.2501.08284 preprint EN arXiv (Cornell University) 2025-01-14

INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages

OPENALEX - Publications

Hao Yu Jesujoba O. Alabi Andiswa Bukula Jian Yun Zhuang En-Shiun Annie Lee and 17 more

Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these often exclude evaluations of low-resource languages rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo -- a multicultural, open-source benchmark dataset 16 African with utterances generated by native speakers across diverse domains, including banking, travel, home, dining....

10.48550/arxiv.2502.09814 preprint EN arXiv (Cornell University) 2025-02-13

BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

OPENALEX - Publications

Shamsuddeen Hassan Muhammad Nedjma Ousidhoum Idris Abdulmumin Jan Philip Wahle Terry Ruas and 43 more

People worldwide use language in subtle and complex ways to express emotions. While emotion recognition -- an umbrella term for several NLP tasks significantly impacts different applications other fields, most work the area is focused on high-resource languages. Therefore, this has led major disparities research proposed solutions, especially low-resource languages that suffer from lack of high-quality datasets. In paper, we present BRIGHTER-- a collection multilabeled emotion-annotated...

10.48550/arxiv.2502.11926 preprint EN arXiv (Cornell University) 2025-02-17

Warmup Generations: A Task-Agnostic Approach for Guiding Sequence-to-Sequence Learning with Unsupervised Initial State Generation

OPENALEX - Publications

Senyu Li Zipeng Sun Jiayi Wang Xue Liu Pontus Stenetorp and 2 more

Traditional supervised fine-tuning (SFT) strategies for sequence-to-sequence tasks often train models to directly generate the target output. Recent work has shown that guiding with intermediate steps, such as keywords, outlines, or reasoning chains, can significantly improve performance, coherence, and interpretability. However, these methods depend on predefined formats annotated data, limiting their scalability generalizability. In this work, we introduce a task-agnostic framework enables...

10.48550/arxiv.2502.12304 preprint EN arXiv (Cornell University) 2025-02-17

Multilingual Language Model Pretraining using Machine-translated Data

OPENALEX - Publications

Jiayi Wang Yao Lu Maurice Weber Max Ryabinin David Ifeoluwa Adelani and 3 more

High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other LLMs still underperform non-English languages, likely due to a gap in quality and diversity available multilingual corpora. In this work, we find that machine-translated texts from single source contribute significantly LLMs. We translate FineWeb-Edu, English web dataset, into nine resulting 1.7-trillion-token which call TransWebEdu pretrain...

10.48550/arxiv.2502.13252 preprint EN arXiv (Cornell University) 2025-02-18

MasakhaNEWS: News Topic Classification for African languages

OPENALEX - Publications

David Ifeoluwa Adelani Marek Masiak Israel Abebe Azime Jesujoba O. Alabi Atnafu Lambebo Tonja and 60 more

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda...

10.18653/v1/2023.ijcnlp-main.10 article EN cc-by 2023-01-01

Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification

OPENALEX - Publications

Dawei Zhu Michael A. Hedderich Fangzhou Zhai David Ifeoluwa Adelani Dietrich Klakow

Incorrect labels in training data occur when human annotators make mistakes or the is generated via weak distant supervision. It has been shown that complex noise-handling techniques - by modeling, cleaning filtering noisy instances are required to prevent models from fitting this label noise. However, we show work that, for text classification tasks with modern NLP like BERT, over a variety of noise types, existing methods do not always improve its performance, and may even deteriorate it,...

10.18653/v1/2022.insights-1.8 article EN cc-by 2022-01-01

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

OPENALEX - Publications

En-Shiun Lee Sarubi Thillainathan Shravan Nayak Surangika Ranathunga David Ifeoluwa Adelani and 2 more

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) noise (3) pre-training data model, (4) impact domain mismatch, and (5) language typology. In addition yielding several heuristics, experiments form framework for evaluating sensitivities machine translation systems. While is robust...

10.18653/v1/2022.findings-acl.6 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

OPENALEX - Publications

Miaoran Zhang Marius Mosbach David Ifeoluwa Adelani Michael A. Hedderich Dietrich Klakow

Miaoran Zhang, Marius Mosbach, David Adelani, Michael Hedderich, Dietrich Klakow. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.

10.18653/v1/2022.naacl-main.436 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

Coming Soon ...