From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning

Crowdsourcing Training set Data set
DOI: 10.48550/arxiv.2401.13229 Publication Date: 2024-01-01
ABSTRACT
A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. An option the use of crowdsourcing platforms annotation. However, introduces issues related to annotator's experience, consistency, and biases. alternative zero-shot methods, which turn have limitations compared their few-shot or fully counterparts. Recent advancements driven by large language models show potential, but struggle adapt specialized domains with severely limited data. The most common approaches therefore involve human itself randomly annotating a set datapoints build initial datasets. But sampling be often inefficient as it ignores characteristics specific needs model. situation worsens when working imbalanced datasets, random tends heavily bias towards majority classes, leading excessive To address these issues, this paper contributes an automatic informed selection architecture small dataset Our proposal minimizes quantity maximizes diversity selected annotation, while improving model performance.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....