NFDI4DS | UHH-SEMS - Publication Details

DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition

DOI: 10.48550/arxiv.2502.05091 Publication Date: 2025-02-07

Abstract Supplemental Material References Cited by

AUTHORS (3)

Gorkem Can Ates

Kuang Gong

Wei Shao

ABSTRACT

Vision-language models (VLMs) align visual and textual representations, enabling high-performance zero-shot classification image-text retrieval in 2D medical imaging. However, extending VLMs to 3D imaging remains computationally challenging. Existing rely on Vision Transformers (ViTs), which are expensive due self-attention's quadratic complexity, or convolutions, demand excessive parameters FLOPs as kernel size increases. We introduce DCFormer, an efficient image encoder that factorizes convolutions into three parallel 1D along depth, height, width. This design preserves spatial information while significantly reducing computational cost. Integrated a CLIP-based vision-language framework, DCFormer is evaluated CT-RATE, dataset of 50,188 paired chest CT volumes radiology reports, for multi-abnormality detection across 18 pathologies. Compared ViT, ConvNeXt, PoolFormer, TransUNet, achieves superior efficiency accuracy, with DCFormer-Tiny reaching 62.0% accuracy 46.3% F1-score using fewer parameters. These results highlight DCFormer's potential scalable, clinically deployable VLMs. Our codes will be publicly available.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....