DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.48550/arxiv.2502.05091 Publication Date: 2025-02-07
ABSTRACT
Vision-language models (VLMs) align visual and textual representations, enabling high-performance zero-shot classification image-text retrieval in 2D medical imaging. However, extending VLMs to 3D imaging remains computationally challenging. Existing rely on Vision Transformers (ViTs), which are expensive due self-attention's quadratic complexity, or convolutions, demand excessive parameters FLOPs as kernel size increases. We introduce DCFormer, an efficient image encoder that factorizes convolutions into three parallel 1D along depth, height, width. This design preserves spatial information while significantly reducing computational cost. Integrated a CLIP-based vision-language framework, DCFormer is evaluated CT-RATE, dataset of 50,188 paired chest CT volumes radiology reports, for multi-abnormality detection across 18 pathologies. Compared ViT, ConvNeXt, PoolFormer, TransUNet, achieves superior efficiency accuracy, with DCFormer-Tiny reaching 62.0% accuracy 46.3% F1-score using fewer parameters. These results highlight DCFormer's potential scalable, clinically deployable VLMs. Our codes will be publicly available.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....