- Advanced Image and Video Retrieval Techniques
- Image Retrieval and Classification Techniques
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Remote-Sensing Image Classification
- Human Pose and Action Recognition
- Face and Expression Recognition
- Video Analysis and Summarization
- Bayesian Methods and Mixture Models
- Sparse and Compressive Sensing Techniques
- Text and Document Classification Technologies
- Advanced Clustering Algorithms Research
- Gaussian Processes and Bayesian Inference
- Cancer-related molecular mechanisms research
- Medical Image Segmentation Techniques
- Gait Recognition and Analysis
- Topic Modeling
- Gene expression and cancer classification
- Neural Networks and Applications
- Target Tracking and Data Fusion in Sensor Networks
- Generative Adversarial Networks and Image Synthesis
- Image Processing Techniques and Applications
- Music and Audio Processing
- Machine Learning and ELM
- Advanced Image Processing Techniques
Renmin University of China
2014-2024
Beijing Institute of Big Data Research
2021-2023
Peking University
2005-2014
Ministry of Education of the People's Republic of China
2014
King University
2013
City University of Hong Kong
2009-2011
VQA models may tend to rely on language bias as a shortcut and thus fail sufficiently learn the multi-modal knowledge from both vision language. Recent debiasing methods proposed exclude prior during inference. However, they disentangle "good" context "bad" whole. In this paper, we investigate how mitigate in VQA. Motivated by causal effects, novel counterfactual inference framework, which enables us capture direct effect of questions answers reduce subtracting total effect. Experiments...
Abstract The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities human. Despite tremendous success in AI research, most existing methods have only single-cognitive ability. To overcome this limitation and take a solid step towards general (AGI), we develop foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream tasks. achieve goal, propose pre-train our by self-supervised learning weak semantic...
A weakly supervised semantic segmentation (WSSS) method aims to learn a model from weak (image-level) as opposed strong (pixel-level) labels. By avoiding the tedious pixel-level annotation process, it can exploit unlimited supply of user-tagged images media-sharing sites such Flickr for large scale applications. However, these `free' tags/labels are often noisy and few existing works address problem learning with both In this work, we cast WSSS into label noise reduction problem....
Image annotation aims to annotate a given image with variable number of class labels corresponding diverse visual concepts. In this paper, we address two main issues in large-scale annotation: 1) how learn rich feature representation suitable for predicting set concepts ranging from object, scene abstract concept and 2) an the optimal labels. To first issue, propose novel multi-scale deep model extracting discriminative features capable representing wide range Specifically, two-branch neural...
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high have also promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for improvement). To overcome these limitations, we propose a novel COllaborative Two-Stream vision-language pretraining model...
Due to the rapid technological development of various sensors, a huge volume high spatial resolution (HSR) image data can now be acquired. How efficiently recognize scenes from such HSR has become critical task. Conventional approaches remote sensing scene classification only utilize information images. Therefore, they always need large amount labeled and cannot images an unseen class without any visual sample in data. To overcome this drawback, we propose novel approach for recognizing...
Abstract Motivation : Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding functions. Despite the advances recent decades on sequence alignment, threading alignment-free methods, detection remains challenging open problem. Recently, network methods that try to find transitive paths structure space demonstrate importance of incorporating information space. Yet, current merge into single...
Zero-shot learning (ZSL) is made possible by a projection function between feature space and semantic (e.g., an attribute space). Key to ZSL thus learn that robust against the often large domain gap seen unseen class domains. In this work, achieved data synthesis learning. Specifically, novel strategy proposed, which prototypes vectors) are used simply perturb for generating ones. As in any synthesis/hallucination approach, there ambiguities uncertainties on how well synthesised can capture...
Although artificial intelligence (AI) has made significant progress in understanding molecules a wide range of fields, existing models generally acquire the single cognitive ability from molecular modality. Since hierarchy knowledge is profound, even humans learn different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose multimodal foundation model which pretrained graphs semantically related textual data (crawled...
Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation and number tasks grow, standard full fine-tuning paradigm becomes unsustainable due heavy computational storage costs. This paper proposes UniAdapter, which unifies unimodal multimodal adapters for parameter-efficient cross-modal adaptation on models. Specifically, are distributed different modalities their interactions, with total tunable...
Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed previous works. 1) Generalizability: Existing methods often assume strong semantic correlation between each text-image pair, which thus difficult to generalize real-world scenarios where the weak dominates. 2) Efficiency: Many latest works adopt single-tower architecture with heavy detectors, inefficient during inference stage because...
Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of reference image and relative caption that describes the user intent. Recent studies attempt utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing task. However, these methods typically fail simultaneously meet two key requirements CIR: comprehensively extracting visual information faithfully following In this work, we propose...
This paper presents a new class of 2D string kernels, called spatial mismatch for use with support vector machine (SVM) in discriminative approach to the image categorization problem. We first represent images as sequences those visual keywords obtained by clustering all blocks that we divide into on regular grid. Through decomposing each sequence two parallel 1D (i.e. row-wise and column-wise ones), our kernels can then measure similarity based shared occurrences k-length subsequences,...
This paper presents a multi-modal constraint propagation approach to exploiting pairwise constraints for constrained clustering tasks on datasets. Pairwise methods have previously been designed primarily single modality data and cannot be directly applied or dataset with multiple representations. In this paper, we provide an effective solution the problem by decomposing it into set of independent multi-graph based two-class label subproblems which are then merged unified solved quadratic...
This paper presents a novel semi-supervised learning method which can make use of intra-image semantic context and inter-image cluster consistency for image categorization with less labeled data. The representation is first formed the visual keywords generated by clustering all blocks that we divide images into. 2D spatial Markov chain model then proposed to capture across these within an image. To develop graph-based approach categorization, incorporate into kind kernel be used as affinity...
This paper presents contextual kernel and spectral methods for learning the semantics of images that allow us to automatically annotate an image with keywords. First, exploit context visual words within automatic annotation, we define a novel spatial string quantify similarity between images. Specifically, represent each as 2-D sequence measure two sequences using shared occurrences <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">s</i> -length 1-D...
This paper proposes a novel pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos. It is motivated the fact that videos are nature and learned detecting continuity/discontinuity thus beneficial downstream content analysis tasks. A natural choice of such to construct (3D) jigsaw puzzles learn solve them. However, as we demonstrate experiments, this turns out be intractable. We propose Constrained Spatiotemporal Jigsaw (CSJ) whereby 3D...