Zhiwu Lu

ORCID: 0000-0003-0280-7724
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Image Retrieval and Classification Techniques
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Remote-Sensing Image Classification
  • Human Pose and Action Recognition
  • Face and Expression Recognition
  • Video Analysis and Summarization
  • Bayesian Methods and Mixture Models
  • Sparse and Compressive Sensing Techniques
  • Text and Document Classification Technologies
  • Advanced Clustering Algorithms Research
  • Gaussian Processes and Bayesian Inference
  • Cancer-related molecular mechanisms research
  • Medical Image Segmentation Techniques
  • Gait Recognition and Analysis
  • Topic Modeling
  • Gene expression and cancer classification
  • Neural Networks and Applications
  • Target Tracking and Data Fusion in Sensor Networks
  • Generative Adversarial Networks and Image Synthesis
  • Image Processing Techniques and Applications
  • Music and Audio Processing
  • Machine Learning and ELM
  • Advanced Image Processing Techniques

Renmin University of China
2014-2024

Beijing Institute of Big Data Research
2021-2023

Peking University
2005-2014

Ministry of Education of the People's Republic of China
2014

King University
2013

City University of Hong Kong
2009-2011

VQA models may tend to rely on language bias as a shortcut and thus fail sufficiently learn the multi-modal knowledge from both vision language. Recent debiasing methods proposed exclude prior during inference. However, they disentangle "good" context "bad" whole. In this paper, we investigate how mitigate in VQA. Motivated by causal effects, novel counterfactual inference framework, which enables us capture direct effect of questions answers reduce subtracting total effect. Experiments...

10.1109/cvpr46437.2021.01251 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Abstract The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities human. Despite tremendous success in AI research, most existing methods have only single-cognitive ability. To overcome this limitation and take a solid step towards general (AGI), we develop foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream tasks. achieve goal, propose pre-train our by self-supervised learning weak semantic...

10.1038/s41467-022-30761-2 article EN cc-by Nature Communications 2022-06-02

A weakly supervised semantic segmentation (WSSS) method aims to learn a model from weak (image-level) as opposed strong (pixel-level) labels. By avoiding the tedious pixel-level annotation process, it can exploit unlimited supply of user-tagged images media-sharing sites such Flickr for large scale applications. However, these `free' tags/labels are often noisy and few existing works address problem learning with both In this work, we cast WSSS into label noise reduction problem....

10.1109/tpami.2016.2552172 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2016-04-09

Image annotation aims to annotate a given image with variable number of class labels corresponding diverse visual concepts. In this paper, we address two main issues in large-scale annotation: 1) how learn rich feature representation suitable for predicting set concepts ranging from object, scene abstract concept and 2) an the optimal labels. To first issue, propose novel multi-scale deep model extracting discriminative features capable representing wide range Specifically, two-branch neural...

10.1109/tip.2018.2881928 article EN IEEE Transactions on Image Processing 2018-11-16

Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high have also promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for improvement). To overcome these limitations, we propose a novel COllaborative Two-Stream vision-language pretraining model...

10.1109/cvpr52688.2022.01524 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Due to the rapid technological development of various sensors, a huge volume high spatial resolution (HSR) image data can now be acquired. How efficiently recognize scenes from such HSR has become critical task. Conventional approaches remote sensing scene classification only utilize information images. Therefore, they always need large amount labeled and cannot images an unseen class without any visual sample in data. To overcome this drawback, we propose novel approach for recognizing...

10.1109/tgrs.2017.2689071 article EN IEEE Transactions on Geoscience and Remote Sensing 2017-04-17

Abstract Motivation : Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding functions. Despite the advances recent decades on sequence alignment, threading alignment-free methods, detection remains challenging open problem. Recently, network methods that try to find transitive paths structure space demonstrate importance of incorporating information space. Yet, current merge into single...

10.1093/bioinformatics/btw271 article EN cc-by-nc Bioinformatics 2016-06-11

Zero-shot learning (ZSL) is made possible by a projection function between feature space and semantic (e.g., an attribute space). Key to ZSL thus learn that robust against the often large domain gap seen unseen class domains. In this work, achieved data synthesis learning. Specifically, novel strategy proposed, which prototypes vectors) are used simply perturb for generating ones. As in any synthesis/hallucination approach, there ambiguities uncertainties on how well synthesised can capture...

10.1109/tpami.2020.2965534 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2020-01-10

Although artificial intelligence (AI) has made significant progress in understanding molecules a wide range of fields, existing models generally acquire the single cognitive ability from molecular modality. Since hierarchy knowledge is profound, even humans learn different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose multimodal foundation model which pretrained graphs semantically related textual data (crawled...

10.48550/arxiv.2209.05481 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation and number tasks grow, standard full fine-tuning paradigm becomes unsustainable due heavy computational storage costs. This paper proposes UniAdapter, which unifies unimodal multimodal adapters for parameter-efficient cross-modal adaptation on models. Specifically, are distributed different modalities their interactions, with total tunable...

10.48550/arxiv.2302.06605 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed previous works. 1) Generalizability: Existing methods often assume strong semantic correlation between each text-image pair, which thus difficult to generalize real-world scenarios where the weak dominates. 2) Efficiency: Many latest works adopt single-tower architecture with heavy detectors, inefficient during inference stage because...

10.1007/s11633-022-1386-4 article EN Deleted Journal 2023-05-02

Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of reference image and relative caption that describes the user intent. Recent studies attempt utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing task. However, these methods typically fail simultaneously meet two key requirements CIR: comprehensively extracting visual information faithfully following In this work, we propose...

10.1609/aaai.v39i7.32768 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

10.1016/j.patcog.2014.08.019 article EN Pattern Recognition 2014-08-29

This paper presents a new class of 2D string kernels, called spatial mismatch for use with support vector machine (SVM) in discriminative approach to the image categorization problem. We first represent images as sequences those visual keywords obtained by clustering all blocks that we divide into on regular grid. Through decomposing each sequence two parallel 1D (i.e. row-wise and column-wise ones), our kernels can then measure similarity based shared occurrences k-length subsequences,...

10.1109/cvpr.2009.5206861 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2009-06-01

This paper presents a multi-modal constraint propagation approach to exploiting pairwise constraints for constrained clustering tasks on datasets. Pairwise methods have previously been designed primarily single modality data and cannot be directly applied or dataset with multiple representations. In this paper, we provide an effective solution the problem by decomposing it into set of independent multi-graph based two-class label subproblems which are then merged unified solved quadratic...

10.1145/2072298.2072318 article EN Proceedings of the 30th ACM International Conference on Multimedia 2011-11-28

This paper presents a novel semi-supervised learning method which can make use of intra-image semantic context and inter-image cluster consistency for image categorization with less labeled data. The representation is first formed the visual keywords generated by clustering all blocks that we divide images into. 2D spatial Markov chain model then proposed to capture across these within an image. To develop graph-based approach categorization, incorporate into kind kernel be used as affinity...

10.1109/cvpr.2009.5206851 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2009-06-01

This paper presents contextual kernel and spectral methods for learning the semantics of images that allow us to automatically annotate an image with keywords. First, exploit context visual words within automatic annotation, we define a novel spatial string quantify similarity between images. Specifically, represent each as 2-D sequence measure two sequences using shared occurrences <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">s</i> -length 1-D...

10.1109/tip.2010.2103082 article EN IEEE Transactions on Image Processing 2011-01-03

This paper proposes a novel pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos. It is motivated the fact that videos are nature and learned detecting continuity/discontinuity thus beneficial downstream content analysis tasks. A natural choice of such to construct (3D) jigsaw puzzles learn solve them. However, as we demonstrate experiments, this turns out be intractable. We propose Constrained Spatiotemporal Jigsaw (CSJ) whereby 3D...

10.24963/ijcai.2021/104 article EN 2021-08-01

10.1016/j.patrec.2009.09.003 article EN Pattern Recognition Letters 2009-09-09
Coming Soon ...