Siyuan Qi

ORCID: 0000-0002-4070-733X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Multimodal Machine Learning Applications
  • Advanced Vision and Imaging
  • Domain Adaptation and Few-Shot Learning
  • Advanced Image and Video Retrieval Techniques
  • Reinforcement Learning in Robotics
  • Anomaly Detection Techniques and Applications
  • Robotics and Sensor-Based Localization
  • Advanced Neural Network Applications
  • Video Surveillance and Tracking Methods
  • Topic Modeling
  • Robot Manipulation and Learning
  • 3D Surveying and Cultural Heritage
  • Remote Sensing and LiDAR Applications
  • Child and Animal Learning Development
  • Natural Language Processing Techniques
  • Machine Learning and Algorithms
  • Image Retrieval and Classification Techniques
  • Hand Gesture Recognition Systems
  • Explainable Artificial Intelligence (XAI)
  • Computer Graphics and Visualization Techniques
  • Multi-Agent Systems and Negotiation
  • Molecular Biology Techniques and Applications
  • Cancer Cells and Metastasis
  • Text Readability and Simplification

Beijing Academy of Artificial Intelligence
2023-2024

Beijing Institute for General Artificial Intelligence
2023-2024

Henan Normal University
2022

Google (United States)
2020-2021

University of California, Los Angeles
2017-2020

UCLA Health
2019-2020

University of Indianapolis
2019

Indiana University – Purdue University Indianapolis
2019

Renmin University of China
2015

We present a human-centric method to sample and synthesize 3D room layouts 2D images thereof, obtain large-scale 2D/3D image data with the perfect per-pixel ground truth. An attributed spatial And-Or graph (S-AOG) is proposed represent indoor scenes. The S-AOG probabilistic grammar model, in which terminal nodes are object entities including room, furniture, supported objects. Human contexts as contextual relations encoded by Markov Random Fields (MRF) on nodes. learn distributions from an...

10.1109/cvpr.2018.00618 preprint EN 2018-06-01

This work proposes to combine neural networks with the compositional hierarchy of human bodies for efficient and complete parsing. We formulate approach as a information fusion framework. Our model assembles from three inference processes over hierarchy: direct (directly predicting each part body using image information), bottom-up (assembling knowledge constituent parts), top-down (leveraging context parent nodes). The inferences explicitly decompositional relations in bodies, respectively....

10.1109/iccv.2019.00580 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Rapid progress has been witnessed for human-object interaction (HOI) recognition, but most existing models are confined to single-stage reasoning pipelines. Considering the intrinsic complexity of task, we introduce a cascade architecture multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines proposals and feeds them into recognition network. Each two networks is also connected its predecessor at previous enabling cross-stage...

10.1109/cvpr42600.2020.00432 preprint EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain reasonable answer based on current question and history, underlying semantic dependencies between entities are essential. In this paper, we explicitly formalize as inference in graphical with partially observed nodes unknown graph structures (relations dialog). The given viewed nodes. is represented by node missing value. first introduce an Expectation Maximization algorithm...

10.1109/cvpr.2019.00683 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

We propose a new 3D holistic <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">++</sup> scene understanding problem, which jointly tackles two tasks from single-view image: (i) parsing and reconstruction-3D estimations of object bounding boxes, camera pose, room layout, (ii) human pose estimation. The intuition behind is to leverage the coupled nature these improve granularity performance understanding. exploit critical essential connections...

10.1109/iccv.2019.00874 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Forms of explanation that are best suited to foster trust do not necessarily correspond those components contributing the task performance.

10.1126/scirobotics.aay4663 article EN Science Robotics 2019-12-18

This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images. Considering intrinsic complexity structural nature task, we introduce a cascaded parsing network (CP-HOI) for multi-stage, structured HOI understanding. At each cascade stage, an instance detection module progressively refines proposals feeds them into interaction reasoning module. Each two modules is also connected to its predecessor previous enabling efficient cross-stage information...

10.1109/tpami.2021.3049156 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-01-05

This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due its non-Markovian property and the rich context between environments. We use stochastic grammar model capture compositional structure of events, integrating actions, objects, their affordances. represent event by spatial-temporal And-Or graph (ST-AOG). The ST-AOG composed temporal defined on sub-activities, spatial graphs...

10.1109/iccv.2017.132 article EN 2017-10-01

Modeling the human structure is central for parsing that extracts pixel-wise semantic information from images. We start with analyzing three types of inference processes over hierarchical bodies: direct (directly predicting parts using image information), bottom-up (assembling knowledge constituent parts), and top-down (leveraging context parent nodes). then formulate problem as a compositional neural fusion (CNIF) framework, which assembles in conditional manner, i.e., considering...

10.1109/tpami.2021.3055780 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-01-01

This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images videos. We introduce Graph Parsing Neural Network (GPNN), a framework that incorporates structural knowledge while being differentiable end-to-end. For given scene, GPNN infers parse graph includes i) HOI structure represented by an adjacency matrix, ii) node labels. Within message passing inference framework, iteratively computes matrices extensively evaluate our model on three detection...

10.48550/arxiv.1808.07962 preprint EN other-oa arXiv (Cornell University) 2018-01-01

As the size of transformer-based, models continues to grow, fine-tuning these large-scale pretrained vision for new tasks has become increasingly parameter-intensive. Parameter-efficient learning been developed reduce number tunable parameters during fine-tuning. Although methods show promising results, there is still a significant performance gap compared full To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E <sup...

10.1109/iccv51070.2023.01604 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Learning complex robot manipulation policies for real-world objects is challenging, often requiring significant tuning within controlled environments. In this paper, we learn a model to execute tasks with multiple stages and variable structure, which typically are not suitable most approaches. The learned from human demonstration using tactile glove that measures both hand pose contact forces. enables observation of visually latent changes in the scene, specifically forces imposed unlock...

10.1109/iros.2017.8206196 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2017-09-01

Recent progress in deep learning is essentially based on a "big data for small tasks" paradigm, under which massive amounts of are used to train classifier single narrow task. In this paper, we call shift that flips paradigm upside down. Specifically, propose "small big wherein artificial intelligence (AI) system challenged develop "common sense," enabling it solve wide range tasks with little training data. We illustrate the potential power new by reviewing models common sense synthesize...

10.1016/j.eng.2020.01.011 article EN cc-by-nc-nd Engineering 2020-02-22

Large Language Model (LLM) agents frameworks often employ modular architectures, incorporating components such as planning, reasoning, action execution, and reflection to tackle complex tasks. However, quantifying the contribution of each module overall system performance remains a significant challenge, impeding optimization interpretability. To address this, we introduce CapaBench (Capability-level Assessment Benchmark), an evaluation framework grounded in cooperative game theory's Shapley...

10.48550/arxiv.2502.00510 preprint EN arXiv (Cornell University) 2025-02-01

Detection, parsing, and future predictions on sequence data (e.g., videos) require the algorithms to capture non-Markovian compositional properties of high-level semantics. Context-free grammars are natural choices such properties, but traditional grammar parsers Earley parser) only take symbolic sentences as inputs. In this paper, we generalize parser parse which is neither segmented nor labeled. Given output an arbitrary probabilistic classifier, generalized finds optimal segmentation...

10.1109/tpami.2020.2976971 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2020-02-28

The release of the generative pre-trained transformer (GPT) series has brought artificial general intelligence (AGI) to forefront (AI) field once again. However, questions how define and evaluate AGI remain unclear. This perspective article proposes that evaluation should be rooted in dynamic embodied physical social interactions (DEPSI). More specifically, we propose five critical characteristics considered as benchmarks suggest Tong test an system. describes a value- ability-oriented...

10.1016/j.eng.2023.07.006 article EN cc-by-nc-nd Engineering 2023-08-09

Holistic 3D indoor scene understanding refers to jointly recovering the i) object bounding boxes, ii) room layout, and iii) camera pose, all in 3D. The existing methods either are ineffective or only tackle problem partially. In this paper, we propose an end-to-end model that simultaneously solves three tasks real-time given a single RGB image. essence of proposed method is improve prediction by parametrizing targets (e.g., boxes) instead directly estimating targets, cooperative training...

10.48550/arxiv.1810.13049 preprint EN other-oa arXiv (Cornell University) 2018-01-01

This paper proposes an intent-aware multi-agent planning framework as well a learning algorithm. Under this framework, agent plans in the goal space to maximize expected utility. The process takes belief of other agents' intents into consideration. Instead formulating problem partially observable Markov decision (POMDP), we propose simple but effective linear function approximation utility function. It is based on observation that for humans, people's will pose influence our goal. proposed...

10.1109/icra.2018.8463211 preprint EN 2018-05-01

We propose VRGym, a virtual reality (VR) testbed for realistic human-robot interaction. Different from existing toolkits and VR environments, the VRGym emphasizes on building training both physical interactive agents robotics, machine learning, cognitive science. leverages mechanisms that can generate diverse 3D scenes with high realism through physics-based simulation. demonstrate is able to (i) collect human interactions fine manipulations, (ii) accommodate various robots ROS bridge, (iii)...

10.1145/3321408.3322633 article EN Proceedings of the ACM Turing Celebration Conference - China 2019-05-17

We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain reasonable answer based on current question and history, underlying semantic dependencies between entities are essential. In this paper, we explicitly formalize as inference in graphical with partially observed nodes unknown graph structures (relations dialog). The given viewed nodes. is represented by node missing value. first introduce an Expectation Maximization algorithm...

10.48550/arxiv.1904.05548 preprint EN other-oa arXiv (Cornell University) 2019-01-01
Coming Soon ...