RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision
FOS: Computer and information sciences
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
DOI:
10.48550/arxiv.2401.10815
Publication Date:
2024-01-01
AUTHORS (15)
ABSTRACT
Language-supervised pre-training has proven to be a valuable method for extracting semantically meaningful features from images, serving as foundational element in multimodal systems within the computer vision and medical imaging domains. However, resulting are limited by information contained text. This is particularly problematic imaging, where radiologists' written findings focus on specific observations; challenge compounded scarcity of paired imaging-text data due concerns over leakage personal health information. In this work, we fundamentally prevailing reliance language supervision learning general purpose biomedical encoders. We introduce RAD-DINO, image encoder pre-trained solely unimodal that obtains similar or greater performance than state-of-the-art supervised models diverse range benchmarks. Specifically, quality learned representations evaluated standard tasks (classification semantic segmentation), vision-language alignment task (text report generation images). To further demonstrate drawback supervision, show RAD-DINO correlate with other records (e.g., sex age) better language-supervised models, which generally not mentioned radiology reports. Finally, conduct series ablations determining factors RAD-DINO's performance; notably, observe downstream scales well quantity diversity training data, demonstrating image-only scalable approach encoder.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....