NFDI4DS | UHH-SEMS - Publication Details

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

FOS: Computer and information sciences Computer Science - Machine Learning Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2502.06434 Publication Date: 2025-02-10

Abstract Supplemental Material References Cited by

AUTHORS (4)

Lingao Xiao

Songhua Liu

Yang He

Xinchao Wang

ABSTRACT

Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational storage efficiency. Despite their overlapping objectives, these approaches rarely compared directly. Even within each field, the evaluation protocols inconsistent across various methods, which complicates fair comparisons hinders reproducibility. Considering limitations, we introduce in this paper a benchmark that equitably evaluates methodologies both literatures. Notably, our reveals mainstream setting large-scale datasets, heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests an overemphasis may be diverting attention intrinsic value of image data, while also imposing additional burdens terms generation, storage, application. To address issues, propose new framework compression, termed Prune, Combine, Augment (PCA), focuses leveraging data exclusively, relies solely hard evaluation, achieves state-of-the-art performance setup. By shifting emphasis back images, PCA pave way more balanced accessible compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....