Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

FOS: Computer and information sciences Computer Science - Machine Learning Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2502.06434 Publication Date: 2025-02-10
ABSTRACT
Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational storage efficiency. Despite their overlapping objectives, these approaches rarely compared directly. Even within each field, the evaluation protocols inconsistent across various methods, which complicates fair comparisons hinders reproducibility. Considering limitations, we introduce in this paper a benchmark that equitably evaluates methodologies both literatures. Notably, our reveals mainstream setting large-scale datasets, heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests an overemphasis may be diverting attention intrinsic value of image data, while also imposing additional burdens terms generation, storage, application. To address issues, propose new framework compression, termed Prune, Combine, Augment (PCA), focuses leveraging data exclusively, relies solely hard evaluation, achieves state-of-the-art performance setup. By shifting emphasis back images, PCA pave way more balanced accessible compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....