Tianjun Xiao

ORCID: 0000-0003-4705-1545
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Advanced Neural Network Applications
  • Multimodal Machine Learning Applications
  • Video Surveillance and Tracking Methods
  • Brain Tumor Detection and Classification
  • Human Pose and Action Recognition
  • Advanced Graph Neural Networks
  • Visual Attention and Saliency Detection
  • Anomaly Detection Techniques and Applications
  • Computer Graphics and Visualization Techniques
  • Advanced Vision and Imaging
  • Data Management and Algorithms
  • Video Analysis and Summarization
  • Parallel Computing and Optimization Techniques
  • Natural Language Processing Techniques
  • Topic Modeling
  • Machine Learning and Data Classification
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Database Systems and Queries
  • IoT and Edge/Fog Computing
  • Sparse and Compressive Sensing Techniques
  • Health, Environment, Cognitive Aging
  • Robotics and Automated Systems
  • Traffic Prediction and Management Techniques

Amazon (United States)
2021-2024

Peking University
2014-2021

Nankai University
2011

MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation derive gradients. computation and memory efficient runs on various heterogeneous systems, ranging from mobile devices distributed GPU clusters. This paper describes both API design system implementation MXNet, explains how...

10.48550/arxiv.1512.01274 preprint EN cc-by arXiv (Cornell University) 2015-01-01

Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make problem more difficult. Most fine-grained systems follow pipeline of finding foreground object parts (where) to extract discriminative features (what). In this paper, we propose apply visual attention task using deep neural network. Our integrates three types attention: bottom-up that candidate patches, object-level...

10.1109/cvpr.2015.7298685 preprint EN 2015-06-01

Advancing research in the emerging field of deep graph learning requires new tools to support tensor computation over graphs. In this paper, we present design principles and implementation Deep Graph Library (DGL). DGL distills computational patterns GNNs into a few generalized sparse operations suitable for extensive parallelization. By advocating as central programming abstraction, can perform optimizations transparently. cautiously adopting framework-neutral design, allows users easily...

10.48550/arxiv.1909.01315 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Supervised learning using deep convolutional neural network has shown its promise in large-scale image classification task. As a building block, it is now well positioned to be part of larger system that tackles real-life multimedia tasks. An unresolved issue such model trained on static snapshot data. Instead, this paper positions the training as continuous process new classes data arrive. A with capability useful practical scenarios, gradually expands capacity predict increasing number...

10.1145/2647868.2654926 article EN 2014-10-31

Even though convolutional neural networks (CNN) has achieved near-human performance in various computer vision tasks, its ability to tolerate scale variations is limited. The popular practise making the model bigger first, and then train it with data augmentation using extensive scale-jittering. In this paper, we propose a scaleinvariant network (SiCNN), modeldesigned incorporate multi-scale feature exaction classification into structure. SiCNN uses multi-column architecture, each column...

10.48550/arxiv.1411.6369 preprint EN other-oa arXiv (Cornell University) 2014-01-01

This survey presents a comprehensive analysis of the phenomenon hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with visual content, challenge hallucination, poses substantial obstacles to their practical deployment raises concerns regarding reliability...

10.48550/arxiv.2404.18930 preprint EN arXiv (Cornell University) 2024-04-29

Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to seen during training. In this work, we propose InstFormer, carefully designed framework for the OpenVIS task that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data. InstFormer begins open-world mask proposal network, encouraged all potential instance class-agnostic masks...

10.1609/aaai.v39i3.32338 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

We propose a hierarchical graph neural network (GNN) model that learns how to cluster set of images into an unknown number identities using training annotated with labels belonging disjoint identities. Our GNN uses novel approach merge connected components predicted at each level the hierarchy form new next level. Unlike fully unsupervised clustering, choice grouping and complexity criteria stems naturally from supervision in set. The resulting method, Hi-LANDER, achieves average 49%...

10.1109/iccv48922.2021.00345 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in world. Allowing machine learning algorithms derive this decomposition an unsupervised way has become important line research. However, current methods are restricted simulated data or require additional information form motion depth order successfully discover objects. In work, we overcome limitation by showing that reconstructing features from models trained a self-supervised manner...

10.48550/arxiv.2209.14860 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Layout-to-image generation refers to the task of synthesizing photo-realistic images based on semantic layouts. In this paper, we propose LayoutDiffuse that adapts a foundational diffusion model pretrained large-scale image or text-image datasets for layout-to-image generation. By adopting novel neural adaptor layout attention and task-aware prompts, our method trains efficiently, generates with both high perceptual quality alignment, needs less data. Experiments three show significantly...

10.48550/arxiv.2302.08908 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. In this paper, we propose novel approach, called Coarse-to-Fine Segmentation (C2F-Seg), addresses problem by progressively modeling the amodal segmentation. C2F-Seg initially reduces learning space from pixel-level image to vector-quantized latent space. This enables us better handle long-range dependencies learn coarse-grained segment visual features segments. However,...

10.1109/iccv51070.2023.00122 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

This work targets image retrieval task hold by MSR-Bing Grand Challenge. Image is considered as a challenge because of the gap between low-level representation and high-level textual query representation. Recently further developed deep neural network sheds light on narrowing learning from raw pixels. In this paper, we proposed bag-of-words based for task, which learns maps images into space. The DNN model trained large scale clickthrough data, relevance measured cosine similarity query's...

10.1145/2647868.2656402 article EN 2014-10-31

Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing annotation burden multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects often split parts not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy temporal consistency by relying on supervised object detection with ID labels association...

10.1109/iccv51070.2023.01522 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make problem more difficult. Most fine-grained systems follow pipeline of finding foreground object parts (where) to extract discriminative features (what). In this paper, we propose apply visual attention task using deep neural network. Our integrates three types attention: bottom-up that candidate patches, object-level...

10.48550/arxiv.1411.6447 preprint EN other-oa arXiv (Cornell University) 2014-01-01

Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to seen during training. In this work, we propose an OpenVIS framework called InstFormer that achieves powerful open vocabulary capability through lightweight fine-tuning on limited-category labeled dataset. Specifically, comes three steps a) Open-world Mask Proposal: utilize query-based transformer, which is encouraged all...

10.48550/arxiv.2305.16835 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Video amodal segmentation is a particularly challenging task in computer vision, which requires to deduce the full shape of an object from visible parts it. Recently, some studies have achieved promising performance by using motion flow integrate information across frames under self-supervised setting. However, has clear limitation two factors moving cameras and deformation. This paper presents rethinking previous works. We leverage supervised signals with object-centric representation...

10.1109/iccv51070.2023.00123 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements self-supervised object localization. We propose a method first localizes objects videos via slot attention approach then assigns text to the obtained slots. The latter is achieved by an unsupervised way read localized semantic information from CLIP model. resulting localization entirely apart implicit annotation contained CLIP, it effectively...

10.1109/iccv51070.2023.01264 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Amodal perception requires inferring the full shape of an object that is partially occluded. This task particularly challenging on two levels: (1) it more information than what contained in instant retina or imaging sensor, (2) difficult to obtain enough well-annotated amodal labels for supervision. To this end, paper develops a new framework Self-supervised Video segmentation (SaVos). Our method efficiently leverages visual video temporal sequences infer mask objects. The key intuition...

10.48550/arxiv.2210.12733 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Object-centric learning (OCL) extracts the representation of objects with slots, offering an exceptional blend flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention, which utilizes attention mechanisms to iteratively refine representations. However, a major drawback most object-centric models, including their reliance on predefining number slots. This not only necessitates prior knowledge dataset but also...

10.48550/arxiv.2406.09196 preprint EN arXiv (Cornell University) 2024-06-13

In the rapidly expanding domain of web video content, task text-video retrieval has become increasingly critical, bridging semantic gap between textual queries and data. This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address inherent information imbalance text video, enhancing effectiveness systems. Unlike traditional model-centric methods that focus on designing intricate cross-modal interaction mechanisms, GQE aims expand associated with videos...

10.48550/arxiv.2408.07249 preprint EN arXiv (Cornell University) 2024-08-13

10.1109/cvpr52733.2024.02176 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

10.1109/cvpr52733.2024.01618 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16
Coming Soon ...