- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- 3D Surveying and Cultural Heritage
- Advanced Image and Video Retrieval Techniques
- Robotics and Sensor-Based Localization
- 3D Shape Modeling and Analysis
- Robotics and Automated Systems
- Natural Language Processing Techniques
University of Hong Kong
2023-2024
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred 3D scenarios due inaccessibility large-scale 3D-text pairs. To end, we propose distill knowledge encoded in pretrained vision-language (VL) foundation models through captioning...
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because model needs both localize novel 3D objects infer their semantic categories. A key factor for recent progress 2D open-world perception availability of large-scale image-text pairs from Internet, which cover a wide range vocabulary concepts. However, this success hard replicate scenarios due scarcity 3D-text pairs....
Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new avenues for human interaction with embodied agents or robots using natural language. Despite this progress, we find a notable limitation: existing 3D-VL models exhibit sensitivity to the styles of language input, struggling understand sentences same semantic meaning but written different variants. This observation raises critical question: Can truly language? To test understandability models, first propose robustness...
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred 3D scenarios due inaccessibility large-scale 3D-text pairs. To end, we propose distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning...
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because model needs both localize novel 3D objects infer their semantic categories. A key factor for recent progress 2D open-world perception availability of large-scale image-text pairs from Internet, which cover a wide range vocabulary concepts. However, this success hard replicate scenarios due scarcity 3D-text pairs....