NFDI4DS | UHH-SEMS - Publication Details

The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes

OPENALEX - Publications

Siobhan Mackenzie Hall Samantha Dalal Raesetje Sefala Foutse Yuehgoh Aisha Alaagib and 6 more

We provide a window into the process of constructing dataset for machine learning (ML) applications by reflecting on building World Wide Dishes (WWD), an image and text consisting culinary dishes their associated customs from around world. WWD takes participatory approach to creation: community members guide design research engage in crowdsourcing efforts build dataset. responds calls ML address limitations web-scraped Internet datasets with curated, high-quality data incorporating localised...

10.48550/arxiv.2502.05961 preprint EN arXiv (Cornell University) 2025-02-09

AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR

OPENALEX - Publications

Tobi Olatunji Tejumade Afonja Aditya Yadavalli Chris Chinenye Emezue Sahib Singh and 6 more

Abstract Africa has a very poor doctor-to-patient ratio. At busy clinics, doctors could see 30+ patients per day—a heavy patient burden compared with developed countries—but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, ASR is mature, even ubiquitous, in nations, and clinician-reported performance of commercial systems generally satisfactory. Furthermore, the recent general domain approaching human accuracy....

10.1162/tacl_a_00627 article EN cc-by Transactions of the Association for Computational Linguistics 2023-01-01

Towards Biologically Plausible and Private Gene Expression Data Generation

OPENALEX - Publications

Dingfan Chen Marie Oestreich Tejumade Afonja Raouf Kerkouche Matthias Becker and 1 more

Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications. Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only elementary metrics relatively simple distributions. In this paper, we initiate a systematic analysis how DP generative perform their natural application scenarios, specifically focusing real-world gene expression data....

10.48550/arxiv.2402.04912 preprint EN arXiv (Cornell University) 2024-02-07

Towards Biologically Plausible and Private Gene Expression Data Generation

OPENALEX - Publications

Dingfan Chen Marie Oestreich Tejumade Afonja Raouf Kerkouche Matthias Becker and 1 more

Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications. Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only elementary metrics relatively simple distributions. In this paper, we initiate a systematic analysis how DP generative perform their natural application scenarios, specifically focusing real-world gene expression data....

10.56553/popets-2024-0062 article EN cc-by Proceedings on Privacy Enhancing Technologies 2024-04-01

AfriNames: Most ASR Models "Butcher" African Names

OPENALEX - Publications

Tobi Olatunji Tejumade Afonja Bonaventure F. P. Dossou Atnafu Lambebo Tonja Chris Chinenye Emezue and 2 more

10.21437/interspeech.2023-2122 article EN Interspeech 2022 2023-08-14

Proceedings of the NeurIPS 2020 Workshop on Machine Learning for the Developing World: Improving Resilience

OPENALEX - Publications

Tejumade Afonja Konstantin Klemmer Aya Salama Paula Rodríguez Díaz Niveditha Kalavakonda and 1 more

These are the proceedings of 4th workshop on Machine Learning for Developing World (ML4D), held as part Thirty-fourth Conference Neural Information Processing Systems (NeurIPS) Saturday, December 12th 2020.

10.48550/arxiv.2101.04347 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Learning Nigerian accent embeddings from speech: preliminary results based on SautiDB-Naija corpus

OPENALEX - Publications

Tejumade Afonja Oladimeji Mudele Iroro Orife Kenechi Dukor Lawrence Francis and 4 more

This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the was created and curated as well preliminary experiments accent classification learning embeddings. The initial version includes over 900 recordings from L2 speakers languages, such Yoruba, Igbo, Edo, Efik-Ibibio, Igala. further demonstrate fine-tuning on pre-trained model like wav2vec can yield representations suitable for related speech tasks...

10.48550/arxiv.2112.06199 preprint EN cc-by arXiv (Cornell University) 2021-01-01

You are what you eat? Feeding foundation models a regionally diverse food dataset of World Wide Dishes

OPENALEX - Publications

Jabez Magomere Shu Ishida Tejumade Afonja Aya Salama Daniel Kochin and 7 more

Foundation models are increasingly ubiquitous in our daily lives, used everyday tasks such as text-image searches, interactions with chatbots, and content generation. As use increases, so does concern over the disparities performance fairness of these for different people parts world. To assess growing regional disparities, we present World Wide Dishes, a mixed text image dataset consisting 765 dishes, dish names collected 131 local languages. Dishes has been purely through human...

10.48550/arxiv.2406.09496 preprint EN arXiv (Cornell University) 2024-06-13

1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis

OPENALEX - Publications

Sewade Ogun Abraham Owodunni Tobi Olatunji Eniola Alese Babatunde Oladimeji and 4 more

Recent advances in speech synthesis have enabled many useful applications like audio directions Google Maps, screen readers, and automated content generation on platforms TikTok. However, these systems are mostly dominated by voices sourced from data-rich geographies with personas representative of their source data. Although 3000 the world's languages domiciled Africa, African under-represented systems. As becomes increasingly democratized, it is desirable to increase representation English...

10.48550/arxiv.2406.11727 preprint EN arXiv (Cornell University) 2024-06-17

Performant ASR Models for Medical Entities in Accented Speech

OPENALEX - Publications

Tejumade Afonja Tobi Olatunji Sewade Ogun Naome A. Etori Abraham Owodunni and 1 more

Recent strides in automatic speech recognition (ASR) have accelerated their application the medical domain where performance on accented named entities (NE) such as drug names, diagnoses, and lab results, is largely unknown. We rigorously evaluate multiple ASR models a clinical English dataset of 93 African accents. Our analysis reveals that despite some achieving low overall word error rates (WER), errors are higher, potentially posing substantial risks to patient safety. To empirically...

10.48550/arxiv.2406.12387 preprint EN arXiv (Cornell University) 2024-06-18

Performant ASR Models for Medical Entities in Accented Speech

OPENALEX - Publications

Tejumade Afonja Tobi Olatunji Sewade Ogun Naome A. Etori Abraham Owodunni and 1 more

Recent strides in automatic speech recognition (ASR) have accelerated their application the medical domain where performance on accented named entities (NE) such as drug names, diagnoses, and lab results, is largely unknown. We rigorously evaluate multiple ASR models a clinical English dataset of 93 African accents. Our analysis reveals that despite some achieving low overall word error rates (WER), errors are higher, potentially posing substantial risks to patient safety. To empirically...

10.21437/interspeech.2024-2261 article EN Interspeech 2022 2024-09-01

1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis

OPENALEX - Publications

Sewade Ogun Abraham Owodunni Tobi Olatunji Eniola Alese Babatunde Oladimeji and 4 more

Recent advances in speech synthesis have enabled many useful applications like audio directions Google Maps, screen readers, and automated content generation on platforms TikTok. However, these systems are mostly dominated by voices sourced from data-rich geographies with personas representative of their source data. Although 3000 the world's languages domiciled Africa, African under-represented systems. As becomes increasingly democratized, it is desirable to increase representation English...

10.21437/interspeech.2024-2281 article EN Interspeech 2022 2024-09-01

LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation

OPENALEX - Publications

Tejumade Afonja Ivaxi Sheth Ruta Binkyte Waqar Hanif Thomas Ulas and 2 more

Gene regulatory networks (GRNs) represent the causal relationships between transcription factors (TFs) and target genes in single-cell RNA sequencing (scRNA-seq) data. Understanding these is crucial for uncovering disease mechanisms identifying therapeutic targets. In this work, we investigate potential of large language models (LLMs) GRN discovery, leveraging their learned biological knowledge alone or combination with traditional statistical methods. We develop a task-based evaluation...

10.48550/arxiv.2410.15828 preprint EN arXiv (Cornell University) 2024-10-21

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

OPENALEX - Publications

Tejumade Afonja Hui‐Po Wang Raouf Kerkouche Mario Fritz

Generating tabular data under differential privacy (DP) protection ensures theoretical guarantees but poses challenges for training machine learning models, primarily due to the need capture complex structures noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at scale of GPT-2 have demonstrated great potential in synthesizing data. However, their applications DP constraints remain largely unexplored. In this work, we address gap by applying...

10.48550/arxiv.2412.02467 preprint EN arXiv (Cornell University) 2024-12-03

Generative Extraction of Audio Classifiers for Speaker Identification

OPENALEX - Publications

Tejumade Afonja Lucas Bourtoule Varun Chandrasekaran Sageev Oore Nicolas Papernot

It is perhaps no longer surprising that machine learning models, especially deep neural networks, are particularly vulnerable to attacks. One such vulnerability has been well studied model extraction: a phenomenon in which the attacker attempts steal victim's by training surrogate mimic decision boundaries of victim model. Previous works have demonstrated effectiveness an attack and its devastating consequences, but much this work done primarily for image text processing tasks. Our first...

10.48550/arxiv.2207.12816 preprint EN public-domain arXiv (Cornell University) 2022-01-01

Proceedings of the NeurIPS 2021 Workshop on Machine Learning for the Developing World: Global Challenges

OPENALEX - Publications

Paula Rodríguez Díaz Tejumade Afonja Konstantin Klemmer Aya Salama Niveditha Kalavakonda and 2 more

These are the proceedings of 5th workshop on Machine Learning for Developing World (ML4D), held as part Thirty-fifth Conference Neural Information Processing Systems (NeurIPS) December 14th, 2021.

10.48550/arxiv.2301.04007 preprint EN other-oa arXiv (Cornell University) 2023-01-01

AfriNames: Most ASR models "butcher" African Names

OPENALEX - Publications

Tobi Olatunji Tejumade Afonja Bonaventure F. P. Dossou Atnafu Lambebo Tonja Chris Chinenye Emezue and 2 more

Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, example, asking a voice assistant play track from certain artist, initiating navigation specific location, or documenting laboratory result patient. However, where such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), ``Ingabire`` (Rwandan) are spoken, automatic speech recognition (ASR) models' performance degrades significantly, propagating errors systems. We model this problem...

10.48550/arxiv.2306.00253 preprint EN cc-by arXiv (Cornell University) 2023-01-01

MargCTGAN: A "Marginally'' Better CTGAN for the Low Sample Regime

OPENALEX - Publications

Tejumade Afonja Dingfan Chen Mario Fritz

The potential of realistic and useful synthetic data is significant. However, current evaluation methods for tabular generation predominantly focus on downstream task usefulness, often neglecting the importance statistical properties. This oversight becomes particularly prominent in low sample scenarios, accompanied by a swift deterioration these measures. In this paper, we address issue conducting an three state-of-the-art generators based their marginal distribution, column-pair...

10.48550/arxiv.2307.07997 preprint EN other-oa arXiv (Cornell University) 2023-01-01

AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR

OPENALEX - Publications

Tobi Olatunji Tejumade Afonja Aditya Yadavalli Chris Chinenye Emezue Sahib Singh and 6 more

Africa has a very low doctor-to-patient ratio. At busy clinics, doctors could see 30+ patients per day -- heavy patient burden compared with developed countries but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, ASR is mature, even ubiquitous, in nations, and clinician-reported performance of commercial systems generally satisfactory. Furthermore, the recent general domain approaching human accuracy. several gaps...

10.48550/arxiv.2310.00274 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Proceedings of NeurIPS 2019 Workshop on Machine Learning for the Developing World: Challenges and Risks of ML4D

OPENALEX - Publications

Maria De‐Arteaga Tejumade Afonja Amanda Coston

This is the proceedings of 3rd ML4D workshop which was help in Vancouver, Canada on December 13, 2019 as part Neural Information Processing Systems conference.

10.48550/arxiv.2001.00249 preprint EN other-oa arXiv (Cornell University) 2020-01-01