- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Robotics and Sensor-Based Localization
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- 2D Materials and Applications
- Human Pose and Action Recognition
- Advanced Vision and Imaging
- Handwritten Text Recognition Techniques
- Visual Attention and Saliency Detection
- Graphene research and applications
- 3D Surveying and Cultural Heritage
- Remote-Sensing Image Classification
- Rock Mechanics and Modeling
- MXene and MAX Phase Materials
- Optical Wireless Communication Technologies
- Advanced Image Processing Techniques
- Image Processing Techniques and Applications
- Digital Media Forensic Detection
- Industrial Vision Systems and Defect Detection
- Ga2O3 and related materials
- Advanced Photocatalysis Techniques
- Medical Image Segmentation Techniques
- Advanced Optical Sensing Technologies
- Grouting, Rheology, and Soil Mechanics
Sun Yat-sen University
2025
University of Science and Technology of China
2018-2024
Tongji University
2022-2024
Australian Centre for Robotic Vision
2023-2024
The University of Adelaide
2023-2024
National University of Defense Technology
2024
North China Electric Power University
2018-2024
The University of Sydney
2022-2024
Guizhou University
2024
Shanghai University
2024
Recent advances on 3D object detection heavily rely how the data are represented, i.e., voxel-based or point-based representation. Many existing high performance detectors because this structure can better retain precise point positions. Nevertheless, point-level features lead to computation overheads due unordered storage. In contrast, is suited for feature extraction but often yields lower accuracy input divided into grids. paper, we take a slightly different viewpoint --- find that...
Abstract 3D object detection is receiving increasing attention from both industry and academia thanks to its wide applications in various fields. In this paper, we propose Point-Voxel Region-based Convolution Neural Networks (PV-RCNNs) for on point clouds. First, a novel detector, PV-RCNN, which boosts the performance by deeply integrating feature learning of point-based set abstraction voxel-based sparse convolution through two steps, i.e. , voxel-to-keypoint scene encoding keypoint-to-grid...
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding language query corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on complex module with manually-designed mechanisms perform reasoning and multi-modal fusion. However, involvement certain in fusion design, such as decomposition image scene graph, makes models easily overfit datasets...
It has been well recognized that modeling object-to-object relations would be helpful for object detection. Nevertheless, the problem is not trivial especially when exploring interactions between objects to boost video detectors. The difficulty originates from aspect reliable in a should depend on only present frame but also all supportive extracted over long range span of video. In this paper, we introduce new design capture across spatio-temporal context. Specifically, Relation...
It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is non-trivial to explore inherently unnatural interaction between sparse points dense 2D pixels. To ease this difficulty, recent approaches generally project onto image plane sample data then aggregate at points. However, these often suffer mismatch resolution of RGB images, leading sub-optimal...
Coral reef limestone at different depositional depths and facies differ remarkably on the textural mineralogical characteristics, owing to complex sedimentary diagenesis. To explore effects of pore structure mineral composition associated with diagenetic variation mechanical behavior limestone, a series quasi-static dynamic compression tests along microscopic examinations were performed shallow deep burial depths. It is revealed that (SRL) classified as porous aragonite-type carbonate rock...
As an emerging data modal with precise distance sensing, LiDAR point clouds have been placed great expectations on 3D scene understanding. However, are always sparsely distributed in the space, and unstructured storage, which makes it difficult to represent them for effective object detection. To this end, work, we regard as hollow-3D propose a new architecture, namely Hallucinated Hollow-3D R-CNN (H <sup xmlns:mml="http://www.w3.org/1998/Math/MathML"...
Temporal language grounding (TLG) is a fundamental and challenging problem for vision understanding. Existing methods mainly focus on fully supervised setting with temporal boundary labels training, which, however, suffers expensive cost of annotation. In this work, we are dedicated to weakly TLG, where multiple description sentences given an untrimmed video without labels. task, it critical learn strong cross-modal semantic alignment between sentence semantics visual content. To end,...
In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, first propose TransVG, which establishes correspondences by Transformers localizes referred regions directly regressing box...
Single shot detectors that are potentially faster and simpler than two-stage tend to be more applicable object detection in videos. Nevertheless, the extension of such from image video is not trivial especially when appearance deterioration exists videos, e.g., motion blur or occlusion. A valid question how explore temporal coherence across frames for boosting detection. In this paper, we propose address problem by enhancing per-frame features through aggregation neighboring frames....
In this work, we propose a new framework, called Document Image Transformer (DocTr), to address the issue of geometry and illumination distortion document images. Specifically, DocTr consists geometric unwarping transformer an correction transformer. By setting set learned query embedding, captures global context image by self-attention mechanism decodes pixel-wise displacement solution correct distortion. After unwarping, our further removes shading artifacts improve visual quality OCR...
In pixel-based reinforcement learning (RL), the states are raw video frames, which mapped into hidden representation before feeding to a policy network. To improve sample efficiency of state learning, recently, most prominent work is based on contrastive unsupervised representation. Witnessing that consecutive frames in game highly correlated, further data efficiency, we propose new algorithm, i.e., masked for RL (M-CURL), takes correlation among inputs consideration. our architecture,...
Recent progress on weakly supervised object detection (WSOD) is characterized by formulating WSOD as a Multiple Instance Learning (MIL) problem and taking online refinement with the selected region proposals from MIL. However, MIL inclines to select most discriminative part rather than entire instance top-scoring proposals, which leads weak localization capability for detectors. We attribute this limited intra-class diversity within single image. Specifically, due lack of annotated bounding...
LiDAR and Radar are two complementary sensing approaches in that specializes capturing an object's 3D shape while provides longer detection ranges as well velocity hints. Though seemingly natural, how to efficiently combine them for improved feature representation is still unclear. The main challenge arises from data extremely sparse lack height information. Therefore, directly integrating features into LiDAR-centric networks not optimal. In this work, we introduce a bi-directional...
A recent trend is to combine multiple sensors ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , cameras, LiDARs and millimeter-wave Radars) achieve robust multi-modal perception for autonomous systems such as self-driving vehicles. Although quite a few sensor fusion algorithms have been proposed, some of which are top-ranked on various leaderboards, systematic study how integrate these three types develop effective 3D object...
Atom-substituting doping by atmospheric-pressure chemical vapor deposition (AP-CVD) is an effective and promising strategy for changing the properties of two-dimensional transition-metal dichalcogenides (2D TMDs). In this paper, we successfully grew V-doped MoSe2 films. The photoluminescence (PL) spectra gradually red-shifted with increase concentration, X-ray photoelectron spectroscopy (XPS) after shifted toward a lower binding energy, change polarity before can be seen in transfer...
Abstract Two-dimensional (2D) WSe 2 has received increasing attention due to its unique optical properties and bipolar behavior. Several -based heterojunctions exhibit bidirectional rectification characteristics, but most devices have a lower ratio. In this work, the Bi O Se/WSe heterojunction prepared by us type Ⅱ band alignment, which can vastly suppress channel current through interface barrier so that device large ratio of about 10 5 . Meanwhile, under different gate voltage modulation,...
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance LMMs achieve fine-grained scene understanding facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, simple yet highly powerful LMM designed act as an intelligent assistant comprehending, reasoning, interacting with the world. Unlike existing top-performing methods that rely on...