- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Multimodal Machine Learning Applications
- Video Surveillance and Tracking Methods
- Brain Tumor Detection and Classification
- Human Pose and Action Recognition
- Advanced Graph Neural Networks
- Visual Attention and Saliency Detection
- Anomaly Detection Techniques and Applications
- Computer Graphics and Visualization Techniques
- Advanced Vision and Imaging
- Data Management and Algorithms
- Video Analysis and Summarization
- Parallel Computing and Optimization Techniques
- Natural Language Processing Techniques
- Topic Modeling
- Machine Learning and Data Classification
- Generative Adversarial Networks and Image Synthesis
- Advanced Database Systems and Queries
- IoT and Edge/Fog Computing
- Sparse and Compressive Sensing Techniques
- Health, Environment, Cognitive Aging
- Robotics and Automated Systems
- Traffic Prediction and Management Techniques
Amazon (United States)
2021-2024
Peking University
2014-2021
Nankai University
2011
MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation derive gradients. computation and memory efficient runs on various heterogeneous systems, ranging from mobile devices distributed GPU clusters. This paper describes both API design system implementation MXNet, explains how...
Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make problem more difficult. Most fine-grained systems follow pipeline of finding foreground object parts (where) to extract discriminative features (what). In this paper, we propose apply visual attention task using deep neural network. Our integrates three types attention: bottom-up that candidate patches, object-level...
Advancing research in the emerging field of deep graph learning requires new tools to support tensor computation over graphs. In this paper, we present design principles and implementation Deep Graph Library (DGL). DGL distills computational patterns GNNs into a few generalized sparse operations suitable for extensive parallelization. By advocating as central programming abstraction, can perform optimizations transparently. cautiously adopting framework-neutral design, allows users easily...
Supervised learning using deep convolutional neural network has shown its promise in large-scale image classification task. As a building block, it is now well positioned to be part of larger system that tackles real-life multimedia tasks. An unresolved issue such model trained on static snapshot data. Instead, this paper positions the training as continuous process new classes data arrive. A with capability useful practical scenarios, gradually expands capacity predict increasing number...
Even though convolutional neural networks (CNN) has achieved near-human performance in various computer vision tasks, its ability to tolerate scale variations is limited. The popular practise making the model bigger first, and then train it with data augmentation using extensive scale-jittering. In this paper, we propose a scaleinvariant network (SiCNN), modeldesigned incorporate multi-scale feature exaction classification into structure. SiCNN uses multi-column architecture, each column...
This survey presents a comprehensive analysis of the phenomenon hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with visual content, challenge hallucination, poses substantial obstacles to their practical deployment raises concerns regarding reliability...
Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to seen during training. In this work, we propose InstFormer, carefully designed framework for the OpenVIS task that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data. InstFormer begins open-world mask proposal network, encouraged all potential instance class-agnostic masks...
We propose a hierarchical graph neural network (GNN) model that learns how to cluster set of images into an unknown number identities using training annotated with labels belonging disjoint identities. Our GNN uses novel approach merge connected components predicted at each level the hierarchy form new next level. Unlike fully unsupervised clustering, choice grouping and complexity criteria stems naturally from supervision in set. The resulting method, Hi-LANDER, achieves average 49%...
Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in world. Allowing machine learning algorithms derive this decomposition an unsupervised way has become important line research. However, current methods are restricted simulated data or require additional information form motion depth order successfully discover objects. In work, we overcome limitation by showing that reconstructing features from models trained a self-supervised manner...
Layout-to-image generation refers to the task of synthesizing photo-realistic images based on semantic layouts. In this paper, we propose LayoutDiffuse that adapts a foundational diffusion model pretrained large-scale image or text-image datasets for layout-to-image generation. By adopting novel neural adaptor layout attention and task-aware prompts, our method trains efficiently, generates with both high perceptual quality alignment, needs less data. Experiments three show significantly...
Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. In this paper, we propose novel approach, called Coarse-to-Fine Segmentation (C2F-Seg), addresses problem by progressively modeling the amodal segmentation. C2F-Seg initially reduces learning space from pixel-level image to vector-quantized latent space. This enables us better handle long-range dependencies learn coarse-grained segment visual features segments. However,...
This work targets image retrieval task hold by MSR-Bing Grand Challenge. Image is considered as a challenge because of the gap between low-level representation and high-level textual query representation. Recently further developed deep neural network sheds light on narrowing learning from raw pixels. In this paper, we proposed bag-of-words based for task, which learns maps images into space. The DNN model trained large scale clickthrough data, relevance measured cosine similarity query's...
Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing annotation burden multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects often split parts not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy temporal consistency by relying on supervised object detection with ID labels association...
Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make problem more difficult. Most fine-grained systems follow pipeline of finding foreground object parts (where) to extract discriminative features (what). In this paper, we propose apply visual attention task using deep neural network. Our integrates three types attention: bottom-up that candidate patches, object-level...
Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to seen during training. In this work, we propose an OpenVIS framework called InstFormer that achieves powerful open vocabulary capability through lightweight fine-tuning on limited-category labeled dataset. Specifically, comes three steps a) Open-world Mask Proposal: utilize query-based transformer, which is encouraged all...
Video amodal segmentation is a particularly challenging task in computer vision, which requires to deduce the full shape of an object from visible parts it. Recently, some studies have achieved promising performance by using motion flow integrate information across frames under self-supervised setting. However, has clear limitation two factors moving cameras and deformation. This paper presents rethinking previous works. We leverage supervised signals with object-centric representation...
In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements self-supervised object localization. We propose a method first localizes objects videos via slot attention approach then assigns text to the obtained slots. The latter is achieved by an unsupervised way read localized semantic information from CLIP model. resulting localization entirely apart implicit annotation contained CLIP, it effectively...
Amodal perception requires inferring the full shape of an object that is partially occluded. This task particularly challenging on two levels: (1) it more information than what contained in instant retina or imaging sensor, (2) difficult to obtain enough well-annotated amodal labels for supervision. To this end, paper develops a new framework Self-supervised Video segmentation (SaVos). Our method efficiently leverages visual video temporal sequences infer mask objects. The key intuition...
Object-centric learning (OCL) extracts the representation of objects with slots, offering an exceptional blend flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention, which utilizes attention mechanisms to iteratively refine representations. However, a major drawback most object-centric models, including their reliance on predefining number slots. This not only necessitates prior knowledge dataset but also...
In the rapidly expanding domain of web video content, task text-video retrieval has become increasingly critical, bridging semantic gap between textual queries and data. This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address inherent information imbalance text video, enhancing effectiveness systems. Unlike traditional model-centric methods that focus on designing intricate cross-modal interaction mechanisms, GQE aims expand associated with videos...