LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba

Occupancy
DOI: 10.1609/aaai.v39i3.32264 Publication Date: 2025-04-11T09:46:50Z
ABSTRACT
Vision-based 3D occupancy prediction has become a popular research task due to its versatility and affordability. Nowadays, conventional methods usually project the image-based vision features space learn geometric information through attention mechanism, enabling semantic prediction. However, these works face two main challenges: 1) Limited information. Due lack of in image itself, it is challenging directly predict information, especially large-scale outdoor scenes. 2) Local restricted interaction. quadratic complexity they often use modified local fuse features, resulting fusion. To address problems, this paper, we propose language-assisted network, named LOMA. In proposed vision-language framework, first introduce VL-aware Scene Generator (VSG) module generate language feature scene. By leveraging model, provides implicit knowledge explicit from language. Furthermore, present Tri-plane Fusion Mamba (TFM) block efficiently feature. The not only fuses with global modeling but also avoids too much computation costs. Experiments on SemanticKITTI SSCBench-KITTI360 datasets show that our algorithm achieves new state-of-the-art performances both completion tasks. Our code will be open soon.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....