Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets
FOS: Computer and information sciences
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
0202 electrical engineering, electronic engineering, information engineering
DOI:
10.48550/arxiv.2407.19394
Publication Date:
2024-07-28
AUTHORS (4)
ABSTRACT
The Vision Transformer (ViT) leverages the Transformer's encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, self-attention mechanism of ViT captures context from outset, overlooking inherent relationships between neighboring pixels in or videos. Transformers mainly focus on while ignoring fine-grained local details. Consequently, lacks inductive bias during image video dataset training. In contrast, convolutional neural networks (CNNs), with their reliance filters, possess an bias, making them more efficient quicker converge than less data. this paper, we present a lightweight Depth-Wise Convolution module as shortcut models, bypassing entire blocks ensure models both minimal overhead. Additionally, introduce two architecture variants, allowing modules be applied multiple for parameter savings, incorporating independent parallel different kernels enhance acquisition information. proposed approach significantly boosts classification, object detection instance segmentation large margin, especially small datasets, evaluated CIFAR-10, CIFAR-100, Tiny-ImageNet ImageNet COCO segmentation. source code can accessed at https://github.com/ZTX-100/Efficient_ViT_with_DW.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....