- Human Pose and Action Recognition
- Robot Manipulation and Learning
- 3D Shape Modeling and Analysis
- Reinforcement Learning in Robotics
- Advanced Vision and Imaging
- Human Motion and Animation
- Tactile and Sensory Interactions
- Computer Graphics and Visualization Techniques
- Hand Gesture Recognition Systems
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Image Processing Techniques and Applications
- Robotic Locomotion and Control
- Multimodal Machine Learning Applications
- Generative Adversarial Networks and Image Synthesis
- Advanced Image Processing Techniques
- Teleoperation and Haptic Systems
- Advanced Algorithms and Applications
- Fluid Dynamics and Heat Transfer
- Fault Detection and Control Systems
- Virtual Reality Applications and Impacts
- Interactive and Immersive Displays
- Image Processing and 3D Reconstruction
- Advanced Sensor and Control Systems
- Astronomical Observations and Instrumentation
University of California, San Diego
2020-2024
National Synchrotron Radiation Laboratory
2024
University of Science and Technology of China
2024
Northwestern Polytechnical University
2024
UC San Diego Health System
2021-2023
Yanshan University
2023
China Aerodynamics Research and Development Center
2022
Jiangnan University
2021
Ocean University of China
2021
Shanghai Jiao Tong University
2020
How to represent an image? While the visual world is presented in a continuous manner, machines store and see images discrete way with 2D arrays of pixels. In this paper, we seek learn representation for images. Inspired by recent progress 3D reconstruction implicit neural representation, propose Local Implicit Image Function (LIIF), which takes image coordinate deep features around as inputs, predicts RGB value at given output. Since coordinates are continuous, LIIF can be arbitrary...
Estimating 3D hand and object pose from a single image is an extremely challenging problem: hands objects are often self-occluded during interactions, the annotations scarce as even humans cannot directly label ground-truths perfectly. To tackle these challenges, we propose unified framework for estimating poses with semi-supervised learning. We build joint learning where perform explicit contextual reasoning between representations. Going beyond limited in image, leverage spatial-temporal...
While predicting robot grasps with parallel jaw grippers have been well studied and widely applied in manipulation tasks, the study on natural human grasp generation a multi-finger hand remains very challenging problem. In this paper, we propose to generate given 3D object world. Our key observation is that it crucial model consistency between contact points regions. That is, encourage prior be close surface common regions touched by at same time. Based hand-object consistency, design novel...
Synthesizing 3D human motion plays an important role in many graphics applications as well understanding activity. While efforts have been made on generating realistic and natural motion, most approaches neglect the importance of modeling human-scene interactions affordance. On other hand, affordance reasoning (e.g., standing floor or sitting chair) has mainly studied with static pose gestures, it rarely addressed motion. In this paper, we propose to bridge synthesis scene reasoning. We...
Extensive efforts have been made to improve the generalization ability of Reinforcement Learning (RL) methods via domain randomization and data augmentation. However, as more factors variation are introduced during training, optimization becomes increasingly challenging, empirically may result in lower sample efficiency unstable training. Instead learning policies directly from augmented data, we propose SOft Data Augmentation (SODA), a method that decouples augmentation policy learning....
Videos typically record the streaming and continuous visual data as discrete consecutive frames. Since storage cost is expensive for videos of high fidelity, most them are stored in a relatively low resolution frame rate. Recent works Space-Time Video Super-Resolution (STVSR) developed to incorporate temporal interpolation spatial super-resolution unified framework. However, only support fixed up-sampling scale, which limits their flexibility applications. In this work, instead following...
Recent work has made significant progress on using implicit functions, as a continuous representation for 3D rigid object shape reconstruction. However, much less effort been devoted to modeling general articulated objects. Compared objects, objects have higher degrees of freedom, which makes it hard generalize unseen shapes. To deal with the large variance, we introduce Articulated Signed Distance Functions (A-SDF) represent shapes disentangled latent space, where separate codes encoding...
We propose to perform imitation learning for dexterous manipulation with multi-finger robot hand from human demonstrations, and transfer the policy real hand. introduce a novel single-camera teleoperation system collect 3D demonstrations efficiently only an iPad computer. One key contribution of our is that we construct customized each user in simulator, which manipulator resembling same structure operator's It provides intuitive interface avoid unstable human-robot retargeting data...
real robot hand and rotate novel objects that are not presented in training.Extensive ablations
Recent development of neural implicit function has shown tremendous success on high-quality 3D shape re-construction. However, most works divide the space into inside and outside shape, which limits their repre-senting power to single-layer watertight shapes. This limitation leads tedious data processing (converting non-watertight raw watertight) as well incapability representing general object shapes in real world. In this work, we propose a novel method represent including with multi-layer...
We propose to learn generate grasping motion for manipulation with a dexterous hand using implicit functions. With continuous time inputs, the model can and smooth plan. name proposed Continuous Grasping Function (CGF). CGF is learned via generative modeling Conditional Variational Autoencoder 3D human demonstrations. will first convert large-scale human-object interaction trajectories robot demonstrations retargeting, then use these train CGF. During inference, we perform sampling different...
Learning to solve precision-based manipulation tasks from visual feedback using Reinforcement (RL) could drastically reduce the engineering efforts required by traditional robot systems. However, performing fine-grained motor control inputs alone is challenging, especially with a static third-person camera as often used in previous work. We propose setting for robotic which agent receives both and an egocentric mounted on robot's wrist. While static, enables actively its vision aid precise...
The recent advancements in visual reasoning capabilities of large multimodal models (LMMs) and the semantic enrichment 3D feature fields have expanded horizons robotic capabilities. These developments hold significant potential for bridging gap between high-level from LMMs low-level control policies utilizing fields. In this work, we introduce LMM-3DP, a framework that can integrate LMM planners skill Policies. Our approach consists three key perspectives: planning, control, effective...
Data-driven model predictive control has two key advantages over model-free methods: a potential for improved sample efficiency through learning, and better performance as computational budget planning increases. However, it is both costly to plan long horizons challenging obtain an accurate of the environment. In this work, we combine strengths model-based methods. We use learned task-oriented latent dynamics local trajectory optimization short horizon, terminal value function estimate...
A prominent approach to visual Reinforcement Learning (RL) is learn an internal state representation using self-supervised methods, which has the potential benefit of improved sample-efficiency and generalization through additional learning signal inductive biases. However, while real world inherently 3D, prior efforts have largely been focused on leveraging 2D computer vision techniques as auxiliary self-supervision. In this work, we present a unified framework for 3D representations motor...
To enable general-purpose robots, we will require the robot to operate daily articulated objects as humans do. Current manipulation has heavily relied on using a parallel gripper, which restricts limited set of objects. On other hand, operating with multi-finger hand allow better approximation human behavior and diverse this end, propose new benchmark called DexArt, involves Dexterous Articulated in physical simulator. In our benchmark, define multiple complex tasks, need manipulate within...
Recent works on generalizable NeRFs have shown promising results novel view synthesis from single or few images. However, such models rarely been applied other downstream tasks beyond as semantic understanding and parsing. In this paper, we propose a framework named FeatureNeRF to learn by distilling pre-trained vision foundation (e.g., DINO, Latent Diffusion). leverages 2D 3D space via neural rendering, then extract deep features for query points NeRF MLPs. Consequently, it allows map...
Teleoperation serves as a powerful method for collecting on-robot data essential robot learning from demonstrations. The intuitiveness and ease of use the teleoperation system are crucial ensuring high-quality, diverse, scalable data. To achieve this, we propose an immersive Open-TeleVision that allows operators to actively perceive robot's surroundings in stereoscopic manner. Additionally, mirrors operator's arm hand movements on robot, creating experience if mind is transmitted embodiment....
This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild. Without requiring annotations mesh, 2D keypoints, or camera pose for each video frame, we video-based reconstruction as a self-supervised online adaptation problem applied any incoming test video. We first learn category-specific model collection single-view images same category that jointly predicts shape, texture, and image. Then, at inference time, adapt...