Learning Robust 3D Representation from CLIP via Dual Denoising

Representation
DOI: 10.48550/arxiv.2407.00905 Publication Date: 2024-06-30
ABSTRACT
In this paper, we explore a critical yet under-investigated issue: how to learn robust and well-generalized 3D representation from pre-trained vision language models such as CLIP. Previous works have demonstrated that cross-modal distillation can provide rich useful knowledge for data. However, like most deep learning models, the resultant network is still vulnerable adversarial attacks especially iterative attack. work, propose Dual Denoising, novel framework representations It combines denoising-based proxy task with feature denoising pre-training. Additionally, utilizing parallel noise inference enhance generalization of point cloud features under cross domain settings. Experiments show our model effectively improve performance robustness zero-shot settings without training. Our code available at https://github.com/luoshuqing2001/Dual_Denoising.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....