- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
- Advanced Vision and Imaging
- Natural Language Processing Techniques
- Topic Modeling
- Domain Adaptation and Few-Shot Learning
- Robotics and Sensor-Based Localization
- Advanced Image and Video Retrieval Techniques
- Medical Imaging Techniques and Applications
- Human Motion and Animation
- Robot Manipulation and Learning
- 3D Surveying and Cultural Heritage
- Advanced Neural Network Applications
- 3D Shape Modeling and Analysis
- Speech and dialogue systems
- Video Surveillance and Tracking Methods
- Hand Gesture Recognition Systems
- Advanced MRI Techniques and Applications
- Anomaly Detection Techniques and Applications
- Remote Sensing and LiDAR Applications
- Optical measurement and interference techniques
- Software Engineering Research
- Gaze Tracking and Assistive Technology
- Soft Robotics and Applications
- Neuroscience and Neuropharmacology Research
Shanghai Artificial Intelligence Laboratory
2023-2025
Shanghai Open University
2024-2025
ShangHai JiAi Genetics & IVF Institute
2024
Beijing Institute for General Artificial Intelligence
2022-2024
Chinese People's Armed Police Force
2024
Beijing Academy of Artificial Intelligence
2022-2024
Shanghai Electric (China)
2024
Chinese People's Armed Police Force Engineering University
2024
Sichuan University
2021-2023
PLA Army Engineering University
2020-2022
Patients with Parkinson's disease tend to have a reduced response levodopa after 5 20 years of therapy, "on—off" fluctuations consisting dyskinesia alternating immobility. In an effort modify the motor disability advanced disease, we implanted embryonic mesencephalic tissue containing dopamine cells into caudate and putamen seven patients. Two patients received unilateral grafts in on side opposite worse symptoms. Five bilateral only. six patients, fetal was obtained from single embryo...
To date, various 3D scene understanding tasks still lack practical and generalizable pre-trained models, primarily due to the intricate nature of their immense variations introduced by camera views, lighting, occlusions, etc. In this paper, we tackle challenge introducing a spatio-temporal representation learning (STRL) framework, capable from unlabeled point clouds in self-supervised fashion. Inspired how infants learn visual data wild, explore rich cues derived data. Specifically, STRL...
Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited the contrastive language-image pre-training. We then question, if more diverse pre-training knowledge can be cascaded further assist representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates prior various pretraining paradigms for...
We introduce the SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides unified solving scene-conditioned generation, optimization, and planning. In contrast to prior work, is intrinsically scene-aware, physics-based, goal-oriented. With an iterative sampling strategy, jointly formulates scene-aware physics-based goal-oriented planning via diffusion-based denoising process in fully differentiable fashion. Such design alleviates discrepancies among...
We present a human-centric method to sample and synthesize 3D room layouts 2D images thereof, obtain large-scale 2D/3D image data with the perfect per-pixel ground truth. An attributed spatial And-Or graph (S-AOG) is proposed represent indoor scenes. The S-AOG probabilistic grammar model, in which terminal nodes are object entities including room, furniture, supported objects. Human contexts as contextual relations encoded by Markov Random Fields (MRF) on nodes. learn distributions from an...
This paper addresses a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying interactions. To tackle this novel challenging problem, we contribute large-scale video dataset, VACATION, covers diverse daily scenes behaviors with complete annotations objects faces, attention, structures labels event-level. Together propose spatio-temporal graph neural network to explicitly represent the interactions infer...
We propose a new 3D holistic <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">++</sup> scene understanding problem, which jointly tackles two tasks from single-view image: (i) parsing and reconstruction-3D estimations of object bounding boxes, camera pose, room layout, (ii) human pose estimation. The intuition behind is to leverage the coupled nature these improve granularity performance understanding. exploit critical essential connections...
This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due its non-Markovian property and the rich context between environments. We use stochastic grammar model capture compositional structure of events, integrating actions, objects, their affordances. represent event by spatial-temporal And-Or graph (ST-AOG). The ST-AOG composed temporal defined on sub-activities, spatial graphs...
Classification is an important technique for remotely sensed hyperspectral image (HSI) exploitation. Often, the presence of wrong (noisy) labels presents a drawback accurate supervised classification. In this article, we introduce new framework noisy label detection that combines superpixel-to-pixel weighting distance (SPWD) and density peak clustering. The proposed method able to accurately detect remove in training set before HSI It considers two weak assumptions when exploiting...
3D vision-language grounding (3D-VL) is an emerging field that aims to connect the physical world with natural language, which crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, calls a simple unified model. In this paper, we propose 3D-VisTA, pre-trained Transformer Vision Text Alignment can be easily adapted various downstream tasks. 3D-VisTA simply utilizes self-attention layers both...
Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks holistic evaluation of their efficacy. This paper presents comprehensive publicly available large models by building an LVLM Hub (LVLM-eHub). Our LVLM-eHub consists 13 representative LVLMs such as InstructBLIP and LLaVA, which are thoroughly evaluated quantitative capability online arena platform. The former evaluates five categories...
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, Song-Chun Zhu. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
Previous neural solvers of math word problems (MWPs) are learned with full supervision and fail to generate diverse solutions. In this paper, we address issue by introducing a weakly-supervised paradigm for learning MWPs. Our method only requires the annotations final answers can various solutions single problem. To boost learning, propose novel learning-by-fixing (LBF) framework, which corrects misperceptions network via symbolic reasoning. Specifically, an incorrect solution tree generated...
Generating dexterous grasping has been a long-standing and challenging robotic task. Despite recent progress, existing methods primarily suffer from two issues. First, most prior art focuses on specific type of robot hand, lacking generalizable capability handling unseen ones. Second, arts oftentimes fail to rapidly generate diverse grasps with high success rate. To jointly tackle these challenges unified solution, we propose the GenDexGrasp, novel hand-agnostic algorithm for grasping....
For years, researchers have been devoted to generalizable object perception and manipulation, where cross-category generalizability is highly desired yet underexplored. In this work, we propose learn such skills via Generalizable Actionable Parts (GAParts). By identifying defining 9 GAPart classes (lids, handles, etc.) in 27 categories, construct a large-scale part-centric interactive dataset, GAPartNet, provide rich, part-level annotations (semantics, poses) for 8,489 part instances on...
Foundation models have made significant strides in various applications, including text-to-image generation, panoptic segmentation, and natural language processing. This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions sequential actions for robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model generate Python programs constitute comprehensive perception, planning, action loop In perception section,...
Fine-grained capture of 3D Human-Object Interactions (HOIs) enhances human activity comprehension and supports various downstream visual tasks. However, previous models often assume that humans interact with rigid objects using only a few body parts, constraining their applicability. In this paper, we address the intricate challenge Full-Body Articulated Interaction (f-AHOI), where complete bodies articulated having interconnected movable joints. We introduce CHAIRS, an extensive...
The k-nearest neighbor (k-NN) method relies on Euclidean distance as a classification measure to obtain the labels of test samples. Recently, many studies show that joint region samples can make full use spatial information hyperspectral image. However, traditional k-NN algorithm holds weight each sample in local is identical, which not reasonable, since may have different importance and distribution. To solve this problem, weighted nearest sparse representation proposed paper, consists...
Recent progress in deep learning is essentially based on a "big data for small tasks" paradigm, under which massive amounts of are used to train classifier single narrow task. In this paper, we call shift that flips paradigm upside down. Specifically, propose "small big wherein artificial intelligence (AI) system challenged develop "common sense," enabling it solve wide range tasks with little training data. We illustrate the potential power new by reviewing models common sense synthesize...
Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks holistic evaluation of their efficacy. This paper presents comprehensive publicly available large models by building LVLM Hub (LVLM-eHub). Our LVLM-eHub consists $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated quantitative capability an online arena platform. The former evaluates $6$ categories...
The intricate kinematics of the human hand enable simultaneous grasping and manipulation multiple objects, essential for tasks, such as object transfer in-hand manipulation. Despite its significance, domain robotic multi-object is relatively unexplored presents notable challenges in kinematics, dynamics, configurations. This letter introduces <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MultiGrasp</i> , a novel two-stage approach using...
Reconstructing detailed 3D scenes from single-view images remains a challenging task due to limitations in existing approaches, which primarily focus on geometric shape recovery, overlooking object appearances and fine details. To address these challenges, we propose novel framework for simultaneous high-fidelity recovery of shapes textures images. Our approach utilizes the proposed Single-view neural implicit Shape Radiance field (SSR) representations leverage both explicit supervision...