Zilong Zheng

ORCID: 0000-0003-1219-5151
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Generative Adversarial Networks and Image Synthesis
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Topic Modeling
  • Advanced Vision and Imaging
  • 3D Shape Modeling and Analysis
  • Computer Graphics and Visualization Techniques
  • Natural Language Processing Techniques
  • Human Pose and Action Recognition
  • Advanced Image and Video Retrieval Techniques
  • Image Processing and 3D Reconstruction
  • Advanced Image Processing Techniques
  • Speech and dialogue systems
  • Optical measurement and interference techniques
  • Video Analysis and Summarization
  • Hand Gesture Recognition Systems
  • Machine Learning in Materials Science
  • Radiomics and Machine Learning in Medical Imaging
  • Machine Learning and Data Classification
  • Schizophrenia research and treatment
  • Ethics and Social Impacts of AI
  • Visual Attention and Saliency Detection
  • Explainable Artificial Intelligence (XAI)
  • Aesthetic Perception and Analysis
  • Speech Recognition and Synthesis

Beijing Institute for General Artificial Intelligence
2022-2024

Beijing Academy of Artificial Intelligence
2022-2024

Inner Mongolia Agricultural University
2024

Second Xiangya Hospital of Central South University
2023

Central South University
2023

University of California, Los Angeles
2018-2022

Baidu (China)
2020-2021

We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain reasonable answer based on current question and history, underlying semantic dependencies between entities are essential. In this paper, we explicitly formalize as inference in graphical with partially observed nodes unknown graph structures (relations dialog). The given viewed nodes. is represented by node missing value. first introduce an Expectation Maximization algorithm...

10.1109/cvpr.2019.00683 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

This paper proposes a 3D shape descriptor network, which is deep convolutional energy-based model, for modeling volumetric patterns. The maximum likelihood training of the model follows an "analysis by synthesis" scheme and can be interpreted as mode seeking shifting process. synthesize patterns sampling from probability distribution via MCMC such Langevin dynamics. used to train generator network teaching. conditional version net object recovery super-resolution. Experiments demonstrate...

10.1109/cvpr.2018.00900 article EN 2018-06-01

A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations multiple goals that differ importance, failure to fulfill the expectation of either role could undermine group performance due misalignment values robots. Specifically, a robot needs serve an effective listener infer human users’ intents from instructions feedback...

10.1126/scirobotics.abm4183 article EN Science Robotics 2022-07-13

We propose a generative model of unordered point sets, such as clouds, in the forms an energy-based model, where energy function is parameterized by input-permutation-invariant bottom-up neural network. The learns coordinate encoding each and then aggregates all individual features into for whole cloud. call our Generative PointNet because it can be derived from discriminative PointNet. Our trained MCMC-based maximum likelihood learning (as well its variants), without help any assisting...

10.1109/cvpr46437.2021.01473 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

This paper studies the dynamic generator model for spatialtemporal processes such as textures and action sequences in video data. In this model, each time frame of sequence is generated by a which non-linear transformation latent state vector, where parametrized top-down neural network. The vectors follows auto-regressive vector next current well an independent noise that provides randomness transition. transition can be feedforward We show learned alternating back-propagation through...

10.1609/aaai.v33i01.33015498 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2019-07-17

We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain reasonable answer based on current question and history, underlying semantic dependencies between entities are essential. In this paper, we explicitly formalize as inference in graphical with partially observed nodes unknown graph structures (relations dialog). The given viewed nodes. is represented by node missing value. first introduce an Expectation Maximization algorithm...

10.48550/arxiv.1904.05548 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Due to the intractable partition function, training energy-based models (EBMs) by maximum likelihood requires Markov chain Monte Carlo (MCMC) sampling approximate gradient of Kullback-Leibler divergence between data and model distributions. However, it is non-trivial sample from an EBM because difficulty mixing modes. In this paper, we propose learn a variational auto-encoder (VAE) initialize finite-step MCMC, such as Langevin dynamics that derived energy for efficient amortized EBM. With...

10.1609/aaai.v35i12.17250 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

Aiming to understand how human (false-)belief- a core socio-cognitive ability-would affect interactions with robots, this paper proposes adopt graphical model unify the representation of object states, robot knowledge, and (false-)beliefs. Specifically, parse graph (pg) is learned from single-view spatiotemporal parsing by aggregating various states along time; such accumulated as robot's knowledge. An inference algorithm derived fuse individual pg all robots across multi-views into joint...

10.1109/icra40945.2020.9197355 article EN 2020-05-01

3D data that contains rich geometry information of objects and scenes is valuable for understanding physical world. With the recent emergence large-scale datasets, it becomes increasingly crucial to have a powerful generative model shape synthesis analysis. This paper proposes deep energy-based represent volumetric shapes. The maximum likelihood training follows an "analysis by synthesis" scheme. benefits proposed are six-fold: first, unlike GANs VAEs, does not rely on any auxiliary models;...

10.1109/tpami.2020.3045010 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2020-01-01

Dynamic patterns are characterized by complex spatial and motion patterns. Understanding dynamic requires a disentangled representational model that separates the factorial components. A commonly used for is state space model, where evolves over time according to transition generates observed image frames an emission model. To motions explicitly, it natural be based on or displacement fields of pixels. Thus in we let hidden generate field, which warps trackable component previous frame next...

10.1609/aaai.v34i07.6931 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

This paper studies the problem of learning conditional distribution a high-dimensional output given an input, where and input may belong to two different domains, e.g., is photo image sketch image. We solve this by cooperative training fast thinking initializer slow solver. The generates directly non-linear transformation as well noise vector that accounts for latent variability in output. solver learns objective function form energy function, so can be generated optimizing or more...

10.1109/tpami.2021.3069023 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-01-01

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for scenes (e.g., graphs) and natural languages dependency trees), individually. However, how to construct joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce new that targets on inducing such VL in an...

10.1109/cvpr52688.2022.01516 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

This paper studies the unsupervised cross-domain translation problem by proposing a generative framework, in which probability distribution of each domain is represented cooperative network that consists an energy-based model and latent variable model. The use enables maximum likelihood learning MCMC teaching, where seeks to fit data distills its knowledge via MCMC. Specifically, teaching process, parameterized encoder-decoder maps examples from source target domain, while further refines...

10.1609/aaai.v35i12.17249 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

Exploiting internal statistics of a single natural image has long been recognized as significant research paradigm where the goal is to learn distribution patches within without relying on external training data. Different from prior works that model such implicitly with top-down latent variable (e.g., generator), this paper proposes explicitly represent statistical by using an energy-based generative framework, pyramid energy functions, each parameterized bottom-up deep neural network, are...

10.1109/cvpr46437.2021.00298 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Humans possess a unique social cognition capability [43], [20]; nonverbal communication can convey rich information among agents. In contrast, such crucial characteristics are mostly missing in the existing scene understanding literature. this paper, we incorporate different cues (e.g., gaze, human poses, and gestures) to represent, model, learn, infer agents' mental states from pure visual inputs. Crucially, representation takes agent's belief into account so that it represents what true...

10.1109/cvpr46437.2021.00723 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

This paper proposes a 3D shape descriptor network, which is deep convolutional energy-based model, for modeling volumetric patterns. The maximum likelihood training of the model follows an "analysis by synthesis" scheme and can be interpreted as mode seeking shifting process. synthesize patterns sampling from probability distribution via MCMC such Langevin dynamics. used to train generator network teaching. conditional version net object recovery super-resolution. Experiments demonstrate...

10.48550/arxiv.1804.00586 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Recent breakthroughs in large language models (LLMs) have brought remarkable success the field of LLM-as-Agent. Nevertheless, a prevalent assumption is that information processed by LLMs consistently honest, neglecting pervasive deceptive or misleading human society and AI-generated content. This oversight makes susceptible to malicious manipulations, potentially resulting detrimental outcomes. study utilizes intricate Avalon game as testbed explore LLMs' potential environments. Avalon, full...

10.48550/arxiv.2310.01320 preprint EN other-oa arXiv (Cornell University) 2023-01-01

This paper studies the problem of learning conditional distribution a high-dimensional output given an input, where and input may belong to two different domains, e.g., is photo image sketch image. We solve this by cooperative training fast thinking initializer slow solver. The generates directly non-linear transformation as well noise vector that accounts for latent variability in output. solver learns objective function form energy function, so can be generated optimizing or more...

10.48550/arxiv.1902.02812 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Conventional saliency prediction models typically learn a deterministic mapping from an image to its map, and thus fail explain the subjective nature of human attention. In this paper, model uncertainty visual saliency, we study problem perspective generative by learning conditional probability distribution over map given input image, treating as sampling process learned distribution. Specifically, propose cooperative framework, where latent variable model~(LVM) energy-based model~(EBM) are...

10.1609/aaai.v36i3.20237 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2022-06-28

We introduce RAM, an innovative RAG-based framework with ever-improving memory. Inspired by humans' pedagogical process, RAM utilizes recursively reasoning-based retrieval and experience reflections to continually update the memory learn from users' communicative feedback, namely learning. Extensive experiments both simulated real users demonstrate significant improvements over traditional RAG self-knowledge methods, particularly excelling in handling false premise multi-hop questions....

10.48550/arxiv.2404.12045 preprint EN arXiv (Cornell University) 2024-04-18
Coming Soon ...