- Human Pose and Action Recognition
- Advanced Vision and Imaging
- 3D Shape Modeling and Analysis
- Computer Graphics and Visualization Techniques
- Face recognition and analysis
- Human Motion and Animation
- Video Surveillance and Tracking Methods
- Advanced Image and Video Retrieval Techniques
- Image Processing Techniques and Applications
- Advanced Image Processing Techniques
- Generative Adversarial Networks and Image Synthesis
- Virtual Reality Applications and Impacts
- Handwritten Text Recognition Techniques
- Advanced Neural Network Applications
- Multimodal Machine Learning Applications
- Anatomy and Medical Technology
- Prosthetics and Rehabilitation Robotics
- Energy Harvesting in Wireless Networks
- Face and Expression Recognition
- Advanced Memory and Neural Computing
- Data Visualization and Analytics
- Diabetic Foot Ulcer Assessment and Management
- Radiation Effects in Electronics
- Natural Language Processing Techniques
- 3D Surveying and Cultural Heritage
Nvidia (United States)
2024
Carnegie Mellon University
2019-2024
Meta (Israel)
2021
Tsinghua University
2016-2017
National University of Singapore
2015-2016
We present the first method to capture 3D total motion of a target person from monocular view input. Given an image or video, our reconstructs body, face, and fingers represented by deformable mesh model. use efficient representation called Part Orientation Fields (POFs), encode orientations all body parts in common 2D space. POFs are predicted Fully Convolutional Network, along with joint confidence maps. To train network, we collect new human dataset capturing diverse 40 subjects multiview...
Semantic object parsing is a fundamental task for understanding objects in detail computer vision community, where incorporating multi-level contextual information critical achieving such fine-grained pixel-level recognition. Prior methods often leverage the through post-processing predicted confidence maps. In this work, we propose novel deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly incorporate short-distance and long-distance spatial dependencies into...
We present the first single-network approach for 2D~whole-body pose estimation, which entails simultaneous localization of body, face, hands, and feet keypoints. Due to bottom-up formulation, our method maintains constant real-time performance regardless number people in image. The network is trained a single stage using multi-task learning, through an improved architecture can handle scale differences between body/foot face/hand Our considerably improves upon...
The body pose of a person wearing camera is great interest for applications in augmented reality, healthcare, and robotics, yet much the person's out view typical wearable camera. We propose learning-based approach to estimate wearer's 3D from egocentric video sequences. Our key insight leverage interactions with another person---whose we can directly observe---as signal inherently linked first-person subject. show that since between individuals often induce well-ordered series...
We present a method to capture temporally coherent dynamic clothing deformation from monocular RGB video input. In contrast the existing literature, our does not require pre-scanned personalized mesh template, and thus can be applied in-the-wild videos. To constrain output valid space, we build statistical models for three types of clothing: T- shirt, short pants long pants. A differentiable renderer is utilized align captured shapes input frames by minimizing difference in both silhouette,...
Despite recent progress in developing animatable full-body avatars, realistic modeling of clothing - one the core aspects human self-expression remains an open challenge. State-of-the-art physical simulation methods can generate realistically behaving geometry at interactive rates. Modeling photorealistic appearance, however, usually requires physically-based rendering which is too expensive for applications. On other hand, data-driven deep appearance models are capable efficiently producing...
We have recently seen great progress in building photorealistic animatable full-body codec avatars, but generating high-fidelity animation of clothing is still difficult. To address these difficulties, we propose a method to build an clothed body avatar with explicit representation the on upper from multi-view captured videos. use two-layer mesh register each 3D scan separately and templates. In order improve photometric correspondence across different frames, texture alignment then...
We study the problem of single-image depth estimation for images in wild. collect human annotated surface normals and use them to help train a neural network that directly predicts pixel-wise depth. propose two novel loss functions training with normal annotations. Experiments on NYU Depth, KITTI, our own dataset demonstrate approach can significantly improve quality
We propose a novel sparse constrained formulation and from it derive real-time optimization method for 3D human pose shape estimation. Our method, SCOPE (Sparse Constrained Optimization Pose shapE estimation), is orders of magnitude faster (avg. 4ms convergence) than existing methods, while being mathematically equivalent to their dense unconstrained under mild assumptions. achieve this by exploiting the underlying sparsity constraints our efficiently compute Gauss-Newton direction. show...
Semantic object parsing is a fundamental task for understanding objects in detail computer vision community, where incorporating multi-level contextual information critical achieving such fine-grained pixel-level recognition. Prior methods often leverage the through post-processing predicted confidence maps. In this work, we propose novel deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly incorporate short-distance and long-distance spatial dependencies into...
We propose a novel multi-view camera pipeline for the reconstruction and registration of dynamic clothing. Our proposed method relies on specifically designed pattern that allows precise video tracking in each view. triangulate tracked points register cloth surface fine-grained geometric resolution low localization error. Compared to state-of-the-art methods, our exhibits stable correspondence, same deforming along temporal sequence. As an application, we demonstrate how use greatly improves...
Clothing is an important part of human appearance but challenging to model in photorealistic avatars. In this work we present avatars with dynamically moving loose clothing that can be faithfully driven by sparse RGB-D inputs as well body and face motion. We propose a Neural Iterative Closest Point (N-ICP) algorithm efficiently track the coarse garment shape given depth input. Given tracking results, input images are then remapped texel-aligned features, which fed into drivable avatar models...
We present the first method to capture 3D total motion of a target person from monocular view input. Given an image or video, our reconstructs body, face, and fingers represented by deformable mesh model. use efficient representation called Part Orientation Fields (POFs), encode orientations all body parts in common 2D space. POFs are predicted Fully Convolutional Network (FCN), along with joint confidence maps. To train network, we collect new human dataset capturing diverse 40 subjects...
Registering clothes from 4D scans with vertex-accurate correspondence is challenging, yet important for dynamic appearance modeling and physics parameter estimation real-world data. However, previous methods either rely on texture information, which not always reliable, or achieve only coarse-level alignment. In this work, we present a novel approach to enabling accurate surface registration of texture-less large deformation. Our key idea effectively leverage shape prior learned pre-captured...
Virtual telepresence is the future of online communication. Clothing an essential part a person's identity and self-expression. Yet, ground truth data registered clothes currently unavailable in required resolution accuracy for training models realistic cloth animation. Here, we propose end-to-end pipeline building drivable representations clothing. The core our approach multi-view patterned tracking algorithm capable capturing deformations with high accuracy. We further rely on high-quality...
This paper investigates local discriminant training and global optimization methods for Convolutional Neural Network (CNN) to improve its ability recognition accuracy. For training, we propose combine triplet loss softmax with cross-entropy as the function. The is incorporated into an additional fully-connected layer before final of a CNN model. optimization, use Conditional Random Field (CRF) further utilize pairwise distance feature vectors trained loss. Experiments different models on...
Nonvolatile processors have manifested strong vitality in battery-less energy harvesting sensor nodes due to their characteristics of zero standby power, resilience power failures and fast read/write operations. However, I/O sensing operations cannot store system states after off, hence they are sensitive high switching overhead is induced during oscillation, which significantly degrades the performance. In this paper, we propose a novel performance-aware task scheduling technique...
Modeling and rendering photorealistic avatars is of crucial importance in many applications. Existing methods that build a 3D avatar from visual observations, however, struggle to reconstruct clothed humans. We introduce PhysAvatar, novel framework combines inverse with physics automatically estimate the shape appearance human multi-view video data along physical parameters fabric their clothes. For this purpose, we adopt mesh-aligned 4D Gaussian technique for spatio-temporal mesh tracking...
We introduce GenUSD, an end-to-end text-to-scene generation framework that transforms natural language queries into realistic 3D scenes, including objects and layouts. The process involves two main steps: 1) A Large Language Model (LLM) generates a scene layout hierarchically. It first proposes high-level plan to decompose the multiple functionally spatially distinct subscenes. Then, for each subscene, LLM with detailed positions, poses, sizes, descriptions. To manage complex object...
We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in reconstruction is either limited tight with no interactions, requires calibrated multi-view captures personalized template scans which are costly collect at scale. Our key insight for high-quality yet flexible the careful combination of generic priors about articulated shape (learned large-scale training data)...