- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Advanced Vision and Imaging
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Reinforcement Learning in Robotics
- Anomaly Detection Techniques and Applications
- Robotics and Sensor-Based Localization
- Advanced Neural Network Applications
- Video Surveillance and Tracking Methods
- Topic Modeling
- Robot Manipulation and Learning
- 3D Surveying and Cultural Heritage
- Remote Sensing and LiDAR Applications
- Child and Animal Learning Development
- Natural Language Processing Techniques
- Machine Learning and Algorithms
- Image Retrieval and Classification Techniques
- Hand Gesture Recognition Systems
- Explainable Artificial Intelligence (XAI)
- Computer Graphics and Visualization Techniques
- Multi-Agent Systems and Negotiation
- Molecular Biology Techniques and Applications
- Cancer Cells and Metastasis
- Text Readability and Simplification
Beijing Academy of Artificial Intelligence
2023-2024
Beijing Institute for General Artificial Intelligence
2023-2024
Henan Normal University
2022
Google (United States)
2020-2021
University of California, Los Angeles
2017-2020
UCLA Health
2019-2020
University of Indianapolis
2019
Indiana University – Purdue University Indianapolis
2019
Renmin University of China
2015
We present a human-centric method to sample and synthesize 3D room layouts 2D images thereof, obtain large-scale 2D/3D image data with the perfect per-pixel ground truth. An attributed spatial And-Or graph (S-AOG) is proposed represent indoor scenes. The S-AOG probabilistic grammar model, in which terminal nodes are object entities including room, furniture, supported objects. Human contexts as contextual relations encoded by Markov Random Fields (MRF) on nodes. learn distributions from an...
This work proposes to combine neural networks with the compositional hierarchy of human bodies for efficient and complete parsing. We formulate approach as a information fusion framework. Our model assembles from three inference processes over hierarchy: direct (directly predicting each part body using image information), bottom-up (assembling knowledge constituent parts), top-down (leveraging context parent nodes). The inferences explicitly decompositional relations in bodies, respectively....
Rapid progress has been witnessed for human-object interaction (HOI) recognition, but most existing models are confined to single-stage reasoning pipelines. Considering the intrinsic complexity of task, we introduce a cascade architecture multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines proposals and feeds them into recognition network. Each two networks is also connected its predecessor at previous enabling cross-stage...
We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain reasonable answer based on current question and history, underlying semantic dependencies between entities are essential. In this paper, we explicitly formalize as inference in graphical with partially observed nodes unknown graph structures (relations dialog). The given viewed nodes. is represented by node missing value. first introduce an Expectation Maximization algorithm...
We propose a new 3D holistic <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">++</sup> scene understanding problem, which jointly tackles two tasks from single-view image: (i) parsing and reconstruction-3D estimations of object bounding boxes, camera pose, room layout, (ii) human pose estimation. The intuition behind is to leverage the coupled nature these improve granularity performance understanding. exploit critical essential connections...
Forms of explanation that are best suited to foster trust do not necessarily correspond those components contributing the task performance.
This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images. Considering intrinsic complexity structural nature task, we introduce a cascaded parsing network (CP-HOI) for multi-stage, structured HOI understanding. At each cascade stage, an instance detection module progressively refines proposals feeds them into interaction reasoning module. Each two modules is also connected to its predecessor previous enabling efficient cross-stage information...
This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due its non-Markovian property and the rich context between environments. We use stochastic grammar model capture compositional structure of events, integrating actions, objects, their affordances. represent event by spatial-temporal And-Or graph (ST-AOG). The ST-AOG composed temporal defined on sub-activities, spatial graphs...
Modeling the human structure is central for parsing that extracts pixel-wise semantic information from images. We start with analyzing three types of inference processes over hierarchical bodies: direct (directly predicting parts using image information), bottom-up (assembling knowledge constituent parts), and top-down (leveraging context parent nodes). then formulate problem as a compositional neural fusion (CNIF) framework, which assembles in conditional manner, i.e., considering...
This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images videos. We introduce Graph Parsing Neural Network (GPNN), a framework that incorporates structural knowledge while being differentiable end-to-end. For given scene, GPNN infers parse graph includes i) HOI structure represented by an adjacency matrix, ii) node labels. Within message passing inference framework, iteratively computes matrices extensively evaluate our model on three detection...
As the size of transformer-based, models continues to grow, fine-tuning these large-scale pretrained vision for new tasks has become increasingly parameter-intensive. Parameter-efficient learning been developed reduce number tunable parameters during fine-tuning. Although methods show promising results, there is still a significant performance gap compared full To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E <sup...
Learning complex robot manipulation policies for real-world objects is challenging, often requiring significant tuning within controlled environments. In this paper, we learn a model to execute tasks with multiple stages and variable structure, which typically are not suitable most approaches. The learned from human demonstration using tactile glove that measures both hand pose contact forces. enables observation of visually latent changes in the scene, specifically forces imposed unlock...
Recent progress in deep learning is essentially based on a "big data for small tasks" paradigm, under which massive amounts of are used to train classifier single narrow task. In this paper, we call shift that flips paradigm upside down. Specifically, propose "small big wherein artificial intelligence (AI) system challenged develop "common sense," enabling it solve wide range tasks with little training data. We illustrate the potential power new by reviewing models common sense synthesize...
Large Language Model (LLM) agents frameworks often employ modular architectures, incorporating components such as planning, reasoning, action execution, and reflection to tackle complex tasks. However, quantifying the contribution of each module overall system performance remains a significant challenge, impeding optimization interpretability. To address this, we introduce CapaBench (Capability-level Assessment Benchmark), an evaluation framework grounded in cooperative game theory's Shapley...
Detection, parsing, and future predictions on sequence data (e.g., videos) require the algorithms to capture non-Markovian compositional properties of high-level semantics. Context-free grammars are natural choices such properties, but traditional grammar parsers Earley parser) only take symbolic sentences as inputs. In this paper, we generalize parser parse which is neither segmented nor labeled. Given output an arbitrary probabilistic classifier, generalized finds optimal segmentation...
The release of the generative pre-trained transformer (GPT) series has brought artificial general intelligence (AGI) to forefront (AI) field once again. However, questions how define and evaluate AGI remain unclear. This perspective article proposes that evaluation should be rooted in dynamic embodied physical social interactions (DEPSI). More specifically, we propose five critical characteristics considered as benchmarks suggest Tong test an system. describes a value- ability-oriented...
Holistic 3D indoor scene understanding refers to jointly recovering the i) object bounding boxes, ii) room layout, and iii) camera pose, all in 3D. The existing methods either are ineffective or only tackle problem partially. In this paper, we propose an end-to-end model that simultaneously solves three tasks real-time given a single RGB image. essence of proposed method is improve prediction by parametrizing targets (e.g., boxes) instead directly estimating targets, cooperative training...
This paper proposes an intent-aware multi-agent planning framework as well a learning algorithm. Under this framework, agent plans in the goal space to maximize expected utility. The process takes belief of other agents' intents into consideration. Instead formulating problem partially observable Markov decision (POMDP), we propose simple but effective linear function approximation utility function. It is based on observation that for humans, people's will pose influence our goal. proposed...
We propose VRGym, a virtual reality (VR) testbed for realistic human-robot interaction. Different from existing toolkits and VR environments, the VRGym emphasizes on building training both physical interactive agents robotics, machine learning, cognitive science. leverages mechanisms that can generate diverse 3D scenes with high realism through physics-based simulation. demonstrate is able to (i) collect human interactions fine manipulations, (ii) accommodate various robots ROS bridge, (iii)...
We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain reasonable answer based on current question and history, underlying semantic dependencies between entities are essential. In this paper, we explicitly formalize as inference in graphical with partially observed nodes unknown graph structures (relations dialog). The given viewed nodes. is represented by node missing value. first introduce an Expectation Maximization algorithm...