Xiaokang Chen

ORCID: 0000-0002-6188-5821
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Computer Graphics and Visualization Techniques
  • 3D Shape Modeling and Analysis
  • Domain Adaptation and Few-Shot Learning
  • Advanced Vision and Imaging
  • Multimodal Machine Learning Applications
  • Topic Modeling
  • Spectroscopy and Chemometric Analyses
  • Prenatal Screening and Diagnostics
  • Remote Sensing and Land Use
  • Advanced Image and Video Retrieval Techniques
  • Video Surveillance and Tracking Methods
  • Natural Language Processing Techniques
  • Image Processing and 3D Reconstruction
  • Industrial Vision Systems and Defect Detection
  • 3D Surveying and Cultural Heritage
  • Fetal and Pediatric Neurological Disorders
  • Explainable Artificial Intelligence (XAI)
  • Traditional Chinese Medicine Analysis
  • Advanced Chemical Sensor Technologies
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Image Processing Techniques
  • Congenital Diaphragmatic Hernia Studies
  • Text and Document Classification Technologies
  • Analytical Chemistry and Sensors

Peking University
2019-2024

Sun Yat-sen University
2024

Children’s Hospital of Fudan University Xiamen Branch
2017-2024

King University
2023-2024

North China University of Water Resources and Electric Power
2023

Microsoft Research (India)
2023

Carnegie Mellon University
2023

ETH Zurich
2023

Wuhan University of Technology
2021

Hunan Normal University
2021

In this paper, we study the semi-supervised semantic segmentation problem via exploring both labeled data and extra unlabeled data. We propose a novel consistency regularization approach, called cross pseudo supervision (CPS). Our approach imposes on two networks perturbed with different initialization for same input image. The one-hot label map, output from one network, is used to supervise other network standard cross-entropy loss, vice versa. CPS has roles: encourage high similarity...

10.1109/cvpr46437.2021.00264 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection achieves promising performance. In this paper, we handle critical issue, slow training convergence, present a conditional cross-attention mechanism for fast training. Our is motivated by that in relies highly on content embeddings localizing four extremities predicting box, which increases need high-quality thus difficulty.Our approach, named DETR, learns spatial query from...

10.1109/iccv48922.2021.00363 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

In this paper, we address the semantic segmentation problem with a focus on context aggregation strategy. Our motivation is that label of pixel category object belongs to. We present simple yet effective approach, object-contextual representations, characterizing by exploiting representation corresponding class. First, learn regions under supervision ground-truth segmentation. Second, compute region aggregating representations pixels lying in region. Last, % similarity relation between each...

10.48550/arxiv.1909.11065 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to prediction, for end-to-end detection without NMS post-processing. It is known that one-to-many multiple predictions, succeeds in methods such as Faster R-CNN and FCOS. While the naive assignment does not work DETR, it remains challenging apply DETR training. In this paper, we introduce Group a simple yet efficient training approach introduces group-wise way assignment. This involves using...

10.1109/iccv51070.2023.00610 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them immense potential across a range of applications. However, in the field computer vision, despite availability numerous powerful vision foundation (VFMs), they are still restricted to tasks pre-defined form, struggling match open-ended task capabilities LLMs. In this work, we present an LLM-based framework...

10.48550/arxiv.2305.11175 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Neural Radiance Fields (NeRF) have constituted a remarkable breakthrough in image-based 3D reconstruction. However, their implicit volumetric representations differ significantly from the widely-adopted polygonal meshes and lack support common software hardware, making rendering manipulation inefficient. To overcome this limitation, we present novel framework that generates textured surface images. Our approach begins by efficiently initializing geometry view-dependency decomposed appearance...

10.1109/iccv51070.2023.01626 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation volumetric occupancy and semantic labels objects in scene from single-view observation. Since computational cost generally increases explosively along with growth resolution, most current state-of-the-arts have tailor their framework into low-resolution sacrifice detail prediction. Thus, resolution becomes one crucial difficulties that lead performance bottleneck. In this...

10.1109/cvpr42600.2020.00425 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) optimized training strategy, (2) expanded data, and (3) scaling to larger model size. With these improvements, achieves significant advancements in both multimodal understanding text-to-image instruction-following capabilities, while also enhancing stability generation. We hope will inspire further exploration field. Code models are publicly available.

10.48550/arxiv.2501.17811 preprint EN arXiv (Cornell University) 2025-01-29

Guided depth super-resolution is a practical task where low-resolution and noisy input map restored to high-resolution version, with the help of RGB guide image. Existing methods usually view this as generalized guided filtering problem that relies on designing explicit filters objective functions, or dense regression directly predicts target image via deep neural networks. These suffer from either model capability interpretability. Inspired by recent progress in implicit representation, we...

10.1145/3474085.3475584 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Neural Radiance Field (NeRF) has emerged as a compelling method to represent 3D objects and scenes for photo-realistic rendering. However, its implicit representation causes difficulty in manipulating the models like explicit mesh representation. Several recent advances NeRF manipulation are usually restricted by shared renderer network, or suffer from large model size. To circumvent hurdle, this paper, we present an neural field that enables efficient convenient of models. achieve goal,...

10.48550/arxiv.2205.14870 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Depth information has proven to be a useful cue in the semantic segmentation of RGB-D images for providing geometric counterpart RGB representation. Most existing works simply assume that depth measurements are accurate and well-aligned with pixels models problem as cross-modal feature fusion obtain better representations achieve more segmentation. This, however, may not lead satisfactory results actual data generally noisy, which might worsen accuracy networks go deeper. In this paper, we...

10.48550/arxiv.2007.09183 preprint EN other-oa arXiv (Cornell University) 2020-01-01

This paper studies the 3D instance segmentation problem, which has a variety of real-world applications such as robotics and augmented reality. Since surroundings objects are high complexity, separating different is very difficult. To address this challenging we propose novel framework to group refine instances. In practice, first learn an offset vector for each point shift it its predicted center. better these points, Hierarchical Point Grouping algorithm merge centrally aggregated points...

10.1109/icme52920.2022.9859996 article EN 2022 IEEE International Conference on Multimedia and Expo (ICME) 2022-07-18

We revisit Semantic Scene Completion (SSC), a useful task to predict the semantic and occupancy representation of 3D scenes, in this paper. A number methods for are always based on voxelized scene representations. Although voxel representations keep local structures scene, these suffer from heavy computation redundancy due existence visible empty voxels when network goes deeper. To address dilemma, we propose our novel point-voxel aggregation task. first transfer scenes point clouds by...

10.1609/aaai.v36i2.20134 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2022-06-28

Convolutional neural networks (CNN) have achieved great success in RGB semantic segmentation. RGB-D images provide additional depth information, which can improve segmentation performance. To take full advantages of the 3D geometry relations provided by images, this paper, we propose 2.5D convolution, mimics one convolution kernel several masked 2D kernels. Our effectively process spatial between pixels a manner similar to while still sampling on plane, and thus saves computational cost. And...

10.1109/icip.2019.8803757 article EN 2022 IEEE International Conference on Image Processing (ICIP) 2019-08-26

In RGB-D semantic segmentation tasks, it has been shown that HHA embeddings effectively encode rich depth features and using together with RGB images can improve performance. this paper, we propose a novel method to integrate features. By replacing identity mappings in ResNet-based two-stream network idempotent mappings, couple the originally separated two branches mix from modalities, while still keep good information flow nature of ResNet. Moreover, our does not bring any additional blocks...

10.1109/icip.2019.8803146 article EN 2022 IEEE International Conference on Image Processing (ICIP) 2019-08-26

In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, reformulate the query into format of is composition embeddings reference point and transformation respect to point. This...

10.48550/arxiv.2207.08914 preprint EN cc-by arXiv (Cornell University) 2022-01-01

We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, variant DINO~\cite{zhang2022dino}, an efficient training method DETR~\cite{chen2022group}. The process consists of self-supervised finetuning ViT-Huge on ImageNet-1K, the Object365, finally it COCO. v2 achieves $\textbf{64.5}$ mAP COCO test-dev, establishes new SoTA leaderboard...

10.48550/arxiv.2211.03594 preprint EN other-oa arXiv (Cornell University) 2022-01-01

This paper investigates the potential of enhancing Neural Radiance Fields (NeRF) with semantics to expand their applications. Although NeRF has been proven useful in real-world applications like VR and digital creation, lack hinders interaction objects complex scenes. We propose imitate backbone feature off-the-shelf perception models achieve zero-shot semantic segmentation NeRF. Our framework reformulates process by directly rendering features only applying decoder from models. eliminates...

10.48550/arxiv.2305.16233 preprint EN cc-by arXiv (Cornell University) 2023-01-01

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing portraits faster convergence by leveraging recent grid-based NeRF. Our key insight is to decompose inherently high-dimensional portrait representation into three low-dimensional feature grids. Specifically,...

10.48550/arxiv.2211.12368 preprint EN other-oa arXiv (Cornell University) 2022-01-01
Coming Soon ...