- Advanced Neural Network Applications
- Computer Graphics and Visualization Techniques
- 3D Shape Modeling and Analysis
- Domain Adaptation and Few-Shot Learning
- Advanced Vision and Imaging
- Multimodal Machine Learning Applications
- Topic Modeling
- Spectroscopy and Chemometric Analyses
- Prenatal Screening and Diagnostics
- Remote Sensing and Land Use
- Advanced Image and Video Retrieval Techniques
- Video Surveillance and Tracking Methods
- Natural Language Processing Techniques
- Image Processing and 3D Reconstruction
- Industrial Vision Systems and Defect Detection
- 3D Surveying and Cultural Heritage
- Fetal and Pediatric Neurological Disorders
- Explainable Artificial Intelligence (XAI)
- Traditional Chinese Medicine Analysis
- Advanced Chemical Sensor Technologies
- Generative Adversarial Networks and Image Synthesis
- Advanced Image Processing Techniques
- Congenital Diaphragmatic Hernia Studies
- Text and Document Classification Technologies
- Analytical Chemistry and Sensors
Peking University
2019-2024
Sun Yat-sen University
2024
Children’s Hospital of Fudan University Xiamen Branch
2017-2024
King University
2023-2024
North China University of Water Resources and Electric Power
2023
Microsoft Research (India)
2023
Carnegie Mellon University
2023
ETH Zurich
2023
Wuhan University of Technology
2021
Hunan Normal University
2021
In this paper, we study the semi-supervised semantic segmentation problem via exploring both labeled data and extra unlabeled data. We propose a novel consistency regularization approach, called cross pseudo supervision (CPS). Our approach imposes on two networks perturbed with different initialization for same input image. The one-hot label map, output from one network, is used to supervise other network standard cross-entropy loss, vice versa. CPS has roles: encourage high similarity...
The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection achieves promising performance. In this paper, we handle critical issue, slow training convergence, present a conditional cross-attention mechanism for fast training. Our is motivated by that in relies highly on content embeddings localizing four extremities predicting box, which increases need high-quality thus difficulty.Our approach, named DETR, learns spatial query from...
In this paper, we address the semantic segmentation problem with a focus on context aggregation strategy. Our motivation is that label of pixel category object belongs to. We present simple yet effective approach, object-contextual representations, characterizing by exploiting representation corresponding class. First, learn regions under supervision ground-truth segmentation. Second, compute region aggregating representations pixels lying in region. Last, % similarity relation between each...
Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to prediction, for end-to-end detection without NMS post-processing. It is known that one-to-many multiple predictions, succeeds in methods such as Faster R-CNN and FCOS. While the naive assignment does not work DETR, it remains challenging apply DETR training. In this paper, we introduce Group a simple yet efficient training approach introduces group-wise way assignment. This involves using...
Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them immense potential across a range of applications. However, in the field computer vision, despite availability numerous powerful vision foundation (VFMs), they are still restricted to tasks pre-defined form, struggling match open-ended task capabilities LLMs. In this work, we present an LLM-based framework...
Neural Radiance Fields (NeRF) have constituted a remarkable breakthrough in image-based 3D reconstruction. However, their implicit volumetric representations differ significantly from the widely-adopted polygonal meshes and lack support common software hardware, making rendering manipulation inefficient. To overcome this limitation, we present novel framework that generates textured surface images. Our approach begins by efficiently initializing geometry view-dependency decomposed appearance...
The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation volumetric occupancy and semantic labels objects in scene from single-view observation. Since computational cost generally increases explosively along with growth resolution, most current state-of-the-arts have tailor their framework into low-resolution sacrifice detail prediction. Thus, resolution becomes one crucial difficulties that lead performance bottleneck. In this...
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) optimized training strategy, (2) expanded data, and (3) scaling to larger model size. With these improvements, achieves significant advancements in both multimodal understanding text-to-image instruction-following capabilities, while also enhancing stability generation. We hope will inspire further exploration field. Code models are publicly available.
Guided depth super-resolution is a practical task where low-resolution and noisy input map restored to high-resolution version, with the help of RGB guide image. Existing methods usually view this as generalized guided filtering problem that relies on designing explicit filters objective functions, or dense regression directly predicts target image via deep neural networks. These suffer from either model capability interpretability. Inspired by recent progress in implicit representation, we...
Neural Radiance Field (NeRF) has emerged as a compelling method to represent 3D objects and scenes for photo-realistic rendering. However, its implicit representation causes difficulty in manipulating the models like explicit mesh representation. Several recent advances NeRF manipulation are usually restricted by shared renderer network, or suffer from large model size. To circumvent hurdle, this paper, we present an neural field that enables efficient convenient of models. achieve goal,...
Depth information has proven to be a useful cue in the semantic segmentation of RGB-D images for providing geometric counterpart RGB representation. Most existing works simply assume that depth measurements are accurate and well-aligned with pixels models problem as cross-modal feature fusion obtain better representations achieve more segmentation. This, however, may not lead satisfactory results actual data generally noisy, which might worsen accuracy networks go deeper. In this paper, we...
This paper studies the 3D instance segmentation problem, which has a variety of real-world applications such as robotics and augmented reality. Since surroundings objects are high complexity, separating different is very difficult. To address this challenging we propose novel framework to group refine instances. In practice, first learn an offset vector for each point shift it its predicted center. better these points, Hierarchical Point Grouping algorithm merge centrally aggregated points...
We revisit Semantic Scene Completion (SSC), a useful task to predict the semantic and occupancy representation of 3D scenes, in this paper. A number methods for are always based on voxelized scene representations. Although voxel representations keep local structures scene, these suffer from heavy computation redundancy due existence visible empty voxels when network goes deeper. To address dilemma, we propose our novel point-voxel aggregation task. first transfer scenes point clouds by...
Convolutional neural networks (CNN) have achieved great success in RGB semantic segmentation. RGB-D images provide additional depth information, which can improve segmentation performance. To take full advantages of the 3D geometry relations provided by images, this paper, we propose 2.5D convolution, mimics one convolution kernel several masked 2D kernels. Our effectively process spatial between pixels a manner similar to while still sampling on plane, and thus saves computational cost. And...
In RGB-D semantic segmentation tasks, it has been shown that HHA embeddings effectively encode rich depth features and using together with RGB images can improve performance. this paper, we propose a novel method to integrate features. By replacing identity mappings in ResNet-based two-stream network idempotent mappings, couple the originally separated two branches mix from modalities, while still keep good information flow nature of ResNet. Moreover, our does not bring any additional blocks...
In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, reformulate the query into format of is composition embeddings reference point and transformation respect to point. This...
We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, variant DINO~\cite{zhang2022dino}, an efficient training method DETR~\cite{chen2022group}. The process consists of self-supervised finetuning ViT-Huge on ImageNet-1K, the Object365, finally it COCO. v2 achieves $\textbf{64.5}$ mAP COCO test-dev, establishes new SoTA leaderboard...
This paper investigates the potential of enhancing Neural Radiance Fields (NeRF) with semantics to expand their applications. Although NeRF has been proven useful in real-world applications like VR and digital creation, lack hinders interaction objects complex scenes. We propose imitate backbone feature off-the-shelf perception models achieve zero-shot semantic segmentation NeRF. Our framework reformulates process by directly rendering features only applying decoder from models. eliminates...
While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing portraits faster convergence by leveraging recent grid-based NeRF. Our key insight is to decompose inherently high-dimensional portrait representation into three low-dimensional feature grids. Specifically,...