Donglai Xiang

ORCID: 0000-0002-6487-1935
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Advanced Vision and Imaging
  • 3D Shape Modeling and Analysis
  • Computer Graphics and Visualization Techniques
  • Face recognition and analysis
  • Human Motion and Animation
  • Video Surveillance and Tracking Methods
  • Advanced Image and Video Retrieval Techniques
  • Image Processing Techniques and Applications
  • Advanced Image Processing Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Virtual Reality Applications and Impacts
  • Handwritten Text Recognition Techniques
  • Advanced Neural Network Applications
  • Multimodal Machine Learning Applications
  • Anatomy and Medical Technology
  • Prosthetics and Rehabilitation Robotics
  • Energy Harvesting in Wireless Networks
  • Face and Expression Recognition
  • Advanced Memory and Neural Computing
  • Data Visualization and Analytics
  • Diabetic Foot Ulcer Assessment and Management
  • Radiation Effects in Electronics
  • Natural Language Processing Techniques
  • 3D Surveying and Cultural Heritage

Nvidia (United States)
2024

Carnegie Mellon University
2019-2024

Meta (Israel)
2021

Tsinghua University
2016-2017

National University of Singapore
2015-2016

We present the first method to capture 3D total motion of a target person from monocular view input. Given an image or video, our reconstructs body, face, and fingers represented by deformable mesh model. use efficient representation called Part Orientation Fields (POFs), encode orientations all body parts in common 2D space. POFs are predicted Fully Convolutional Network, along with joint confidence maps. To train network, we collect new human dataset capturing diverse 40 subjects multiview...

10.1109/cvpr.2019.01122 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Semantic object parsing is a fundamental task for understanding objects in detail computer vision community, where incorporating multi-level contextual information critical achieving such fine-grained pixel-level recognition. Prior methods often leverage the through post-processing predicted confidence maps. In this work, we propose novel deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly incorporate short-distance and long-distance spatial dependencies into...

10.1109/cvpr.2016.347 article EN 2016-06-01

We present the first single-network approach for 2D~whole-body pose estimation, which entails simultaneous localization of body, face, hands, and feet keypoints. Due to bottom-up formulation, our method maintains constant real-time performance regardless number people in image. The network is trained a single stage using multi-task learning, through an improved architecture can handle scale differences between body/foot face/hand Our considerably improves upon...

10.1109/iccv.2019.00708 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

The body pose of a person wearing camera is great interest for applications in augmented reality, healthcare, and robotics, yet much the person's out view typical wearable camera. We propose learning-based approach to estimate wearer's 3D from egocentric video sequences. Our key insight leverage interactions with another person---whose we can directly observe---as signal inherently linked first-person subject. show that since between individuals often induce well-ordered series...

10.1109/cvpr42600.2020.00991 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

We present a method to capture temporally coherent dynamic clothing deformation from monocular RGB video input. In contrast the existing literature, our does not require pre-scanned personalized mesh template, and thus can be applied in-the-wild videos. To constrain output valid space, we build statistical models for three types of clothing: T- shirt, short pants long pants. A differentiable renderer is utilized align captured shapes input frames by minimizing difference in both silhouette,...

10.1109/3dv50981.2020.00042 article EN 2021 International Conference on 3D Vision (3DV) 2020-11-01

Despite recent progress in developing animatable full-body avatars, realistic modeling of clothing - one the core aspects human self-expression remains an open challenge. State-of-the-art physical simulation methods can generate realistically behaving geometry at interactive rates. Modeling photorealistic appearance, however, usually requires physically-based rendering which is too expensive for applications. On other hand, data-driven deep appearance models are capable efficiently producing...

10.1145/3550454.3555456 article EN ACM Transactions on Graphics 2022-11-30

We have recently seen great progress in building photorealistic animatable full-body codec avatars, but generating high-fidelity animation of clothing is still difficult. To address these difficulties, we propose a method to build an clothed body avatar with explicit representation the on upper from multi-view captured videos. use two-layer mesh register each 3D scan separately and templates. In order improve photometric correspondence across different frames, texture alignment then...

10.1145/3478513.3480545 article EN ACM Transactions on Graphics 2021-12-01

We study the problem of single-image depth estimation for images in wild. collect human annotated surface normals and use them to help train a neural network that directly predicts pixel-wise depth. propose two novel loss functions training with normal annotations. Experiments on NYU Depth, KITTI, our own dataset demonstrate approach can significantly improve quality

10.1109/iccv.2017.173 article EN 2017-10-01

We propose a novel sparse constrained formulation and from it derive real-time optimization method for 3D human pose shape estimation. Our method, SCOPE (Sparse Constrained Optimization Pose shapE estimation), is orders of magnitude faster (avg. 4ms convergence) than existing methods, while being mathematically equivalent to their dense unconstrained under mild assumptions. achieve this by exploiting the underlying sparsity constraints our efficiently compute Gauss-Newton direction. show...

10.1109/iccv48922.2021.01126 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Semantic object parsing is a fundamental task for understanding objects in detail computer vision community, where incorporating multi-level contextual information critical achieving such fine-grained pixel-level recognition. Prior methods often leverage the through post-processing predicted confidence maps. In this work, we propose novel deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly incorporate short-distance and long-distance spatial dependencies into...

10.48550/arxiv.1511.04510 preprint EN other-oa arXiv (Cornell University) 2015-01-01

We propose a novel multi-view camera pipeline for the reconstruction and registration of dynamic clothing. Our proposed method relies on specifically designed pattern that allows precise video tracking in each view. triangulate tracked points register cloth surface fine-grained geometric resolution low localization error. Compared to state-of-the-art methods, our exhibits stable correspondence, same deforming along temporal sequence. As an application, we demonstrate how use greatly improves...

10.1145/3550454.3555448 article EN ACM Transactions on Graphics 2022-11-30

Clothing is an important part of human appearance but challenging to model in photorealistic avatars. In this work we present avatars with dynamically moving loose clothing that can be faithfully driven by sparse RGB-D inputs as well body and face motion. We propose a Neural Iterative Closest Point (N-ICP) algorithm efficiently track the coarse garment shape given depth input. Given tracking results, input images are then remapped texel-aligned features, which fed into drivable avatar models...

10.1145/3610548.3618136 preprint EN cc-by 2023-12-10

We present the first method to capture 3D total motion of a target person from monocular view input. Given an image or video, our reconstructs body, face, and fingers represented by deformable mesh model. use efficient representation called Part Orientation Fields (POFs), encode orientations all body parts in common 2D space. POFs are predicted Fully Convolutional Network (FCN), along with joint confidence maps. To train network, we collect new human dataset capturing diverse 40 subjects...

10.48550/arxiv.1812.01598 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Registering clothes from 4D scans with vertex-accurate correspondence is challenging, yet important for dynamic appearance modeling and physics parameter estimation real-world data. However, previous methods either rely on texture information, which not always reliable, or achieve only coarse-level alignment. In this work, we present a novel approach to enabling accurate surface registration of texture-less large deformation. Our key idea effectively leverage shape prior learned pre-captured...

10.1109/3dv62453.2024.00042 article EN 2021 International Conference on 3D Vision (3DV) 2024-03-18

Virtual telepresence is the future of online communication. Clothing an essential part a person's identity and self-expression. Yet, ground truth data registered clothes currently unavailable in required resolution accuracy for training models realistic cloth animation. Here, we propose end-to-end pipeline building drivable representations clothing. The core our approach multi-view patterned tracking algorithm capable capturing deformations with high accuracy. We further rely on high-quality...

10.48550/arxiv.2206.03373 preprint EN cc-by arXiv (Cornell University) 2022-01-01

This paper investigates local discriminant training and global optimization methods for Convolutional Neural Network (CNN) to improve its ability recognition accuracy. For training, we propose combine triplet loss softmax with cross-entropy as the function. The is incorporated into an additional fully-connected layer before final of a CNN model. optimization, use Conditional Random Field (CRF) further utilize pairwise distance feature vectors trained loss. Experiments different models on...

10.1109/icdar.2017.70 article EN 2017-11-01

Nonvolatile processors have manifested strong vitality in battery-less energy harvesting sensor nodes due to their characteristics of zero standby power, resilience power failures and fast read/write operations. However, I/O sensing operations cannot store system states after off, hence they are sensitive high switching overhead is induced during oscillation, which significantly degrades the performance. In this paper, we propose a novel performance-aware task scheduling technique...

10.1145/2897937.2898059 article EN 2016-05-25

Modeling and rendering photorealistic avatars is of crucial importance in many applications. Existing methods that build a 3D avatar from visual observations, however, struggle to reconstruct clothed humans. We introduce PhysAvatar, novel framework combines inverse with physics automatically estimate the shape appearance human multi-view video data along physical parameters fabric their clothes. For this purpose, we adopt mesh-aligned 4D Gaussian technique for spatio-temporal mesh tracking...

10.48550/arxiv.2404.04421 preprint EN arXiv (Cornell University) 2024-04-05

We introduce GenUSD, an end-to-end text-to-scene generation framework that transforms natural language queries into realistic 3D scenes, including objects and layouts. The process involves two main steps: 1) A Large Language Model (LLM) generates a scene layout hierarchically. It first proposes high-level plan to decompose the multiple functionally spatially distinct subscenes. Then, for each subscene, LLM with detailed positions, poses, sizes, descriptions. To manage complex object...

10.1145/3641520.3665306 article EN 2024-07-25

We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in reconstruction is either limited tight with no interactions, requires calibrated multi-view captures personalized template scans which are costly collect at scale. Our key insight for high-quality yet flexible the careful combination of generic priors about articulated shape (learned large-scale training data)...

10.48550/arxiv.2409.20563 preprint EN arXiv (Cornell University) 2024-09-30
Coming Soon ...