Improving Referring Image Segmentation using Vision-Aware Text Features
Feature (linguistics)
Benchmark (surveying)
Similarity (geometry)
Segmentation-based object categorization
DOI:
10.48550/arxiv.2404.08590
Publication Date:
2024-04-12
AUTHORS (6)
ABSTRACT
Referring image segmentation is a challenging task that involves generating pixel-wise masks based on natural language descriptions. Existing methods have relied mostly visual features to generate the while treating text as supporting components. This over-reliance can lead suboptimal results, especially in complex scenarios where prompts are ambiguous or context-dependent. To overcome these challenges, we present novel framework VATEX improve referring by enhancing object and context understanding with Vision-Aware Text Feature. Our method using CLIP derive Prior integrates an object-centric heatmap description, which be used initial query DETR-based architecture for task. Furthermore, observing there multiple ways describe instance image, enforce feature similarity between variations same input two components: Contextual Multimodal Decoder turns embeddings into vision-aware features, Meaning Consistency Constraint ensure further coherent consistent interpretation of expressions obtained from image. achieves significant performance improvement three benchmark datasets RefCOCO, RefCOCO+ G-Ref. Code available at: https://nero1342.github.io/VATEX\_RIS.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....