Wei Li

ORCID: 0000-0001-8649-6120
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Advanced Image and Video Retrieval Techniques
  • Advanced Neural Network Applications
  • Video Analysis and Summarization
  • Human Pose and Action Recognition
  • Natural Language Processing Techniques
  • Advanced Image Fusion Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Image Enhancement Techniques
  • Speech and dialogue systems
  • Hand Gesture Recognition Systems
  • Cancer-related molecular mechanisms research
  • Topic Modeling
  • Anomaly Detection Techniques and Applications

Third World Newsreel
2022

Baidu (China)
2022

Chinese Academy of Sciences
2021

Shanghai Jiao Tong University
2017-2020

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it not fully explored visual learning. Meanwhile, previous methods only consider the high-level feature learning representation from a global perspective, which may fail to transfer downstream dense prediction tasks focusing on local features. In this paper, we present novel Masked Self-supervised approach named MST, can explicitly capture context of an...

10.48550/arxiv.2106.05656 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to image-conditioned text generation tasks has drawn increasing interest. Prior arts approach captioning by either utilizing the existing large language GPT-2) or pre-training encoder-decoder network an end-to-end manner. In this work, we propose a simple framework, named DeCap, for captioning. We introduce lightweight visual-aware decoder....

10.48550/arxiv.2303.03032 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e.g., either vision or language), leaving rich information multi-modal representations untapped. Inspired recent success vision-language pre-training, this paper enriches landscape from single-modal to regime. Particularly, we propose Maximum Concept Matching (MCM), simple yet effective zero-shot method...

10.48550/arxiv.2211.13445 preprint EN other-oa arXiv (Cornell University) 2022-01-01

We present Answer-Me, a task-aware multi-task framework which unifies variety of question answering tasks, such as, visual answering, entailment, reasoning. In contrast to previous works using contrastive or generative captioning training, we propose novel and simple recipe pre-train vision-language joint model, is as well. The pre-training uses only noisy image data, formulated use the entire architecture end-to-end with both strong language encoder decoder. Our results show...

10.48550/arxiv.2205.00949 preprint EN other-oa arXiv (Cornell University) 2022-01-01

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, generative and contrastive are nontrivial accommodate in one architecture, further need adaptations for downstream tasks. We propose a novel paradigm training with model which is surprisingly effective jointly learning these disparate vision-language This done simple model, called MaMMUT. It consists single vision encoder text...

10.48550/arxiv.2303.16839 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily expensive regional features, which greatly limits their scalability performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning both unaligned image-only text-only corpus. We build a unified Transformer model to jointly...

10.18653/v1/2022.findings-acl.251 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

Diffusion generative models have recently greatly improved the power of text-conditioned image generation. Existing generation mainly include text conditional diffusion model and cross-modal guided model, which are good at small scene complex respectively. In this work, we propose a simple yet effective approach, namely UPainting, to unify generation, as shown in Figure 1. Based on architecture improvements diverse guidance schedules, UPainting effectively integrates from pretrained...

10.48550/arxiv.2210.16031 preprint EN other-oa arXiv (Cornell University) 2022-01-01

In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the including image semantic, instance, panoptic segmentation, as well their video counterparts, open vocabulary settings, prompt-driven, interactive like SAM, object segmentation. To our knowledge, first model these tasks in one achieve satisfactory performance. show a...

10.48550/arxiv.2401.10229 preprint EN other-oa arXiv (Cornell University) 2024-01-01

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize data (i.e. text image) limited image-text pairs). In this work, we propose a unified-modal architecture, namely UNIMO, which both understanding generation tasks. Large scale of free corpus image collections be utilized improve the capability visual textual understanding, cross-modal contrastive learning (CMCL) is leveraged align information...

10.48550/arxiv.2012.15409 preprint EN other-oa arXiv (Cornell University) 2020-01-01

We propose the first mechanism to train object detection models from weak supervision in form of captions at image level. Language-based for is appealing and inexpensive: many blogs with images descriptive text written by human users exist. However, there significant noise this supervision: do not mention all objects that are shown, may extraneous concepts. a technique determine which image-caption pairs provide suitable signal supervision. further several complementary mechanisms extract...

10.1109/tpami.2022.3187350 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-01-01

Inspired by masked language modeling (MLM) in natural processing, image (MIM) has been recognized as a strong and popular self-supervised pre-training method computer vision. However, its high random mask ratio would result two serious problems: 1) the data are not efficiently exploited, which brings inefficient (\eg, 1600 epochs for MAE $vs.$ 300 supervised), 2) uncertainty inconsistency of pre-trained model, \ie, prediction same patch may be inconsistent under different rounds. To tackle...

10.48550/arxiv.2302.14431 preprint EN other-oa arXiv (Cornell University) 2023-01-01

This paper addresses the problem of 3D referring expression comprehension (REC) in autonomous driving scenario, which aims to ground a natural language targeted region LiDAR point clouds. Previous approaches for REC usually focus on 2D or 3D-indoor domain, is not suitable accurately predicting location queried an scene. In addition, upper-bound limitation and heavy computation cost motivate us explore better solution. this work, we propose new multi-modal visual grounding task, termed...

10.48550/arxiv.2305.15765 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Learning to localize and name object instances is a fundamental problem in vision, but state-of-the-art approaches rely on expensive bounding box supervision. While weakly supervised detection (WSOD) methods relax the need for boxes that of image-level annotations, even cheaper supervision naturally available form unstructured textual descriptions users may freely provide when uploading image content. However, straightforward using such data WSOD wastefully discard captions do not exactly...

10.48550/arxiv.1907.10164 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Image deraining is a challenging task that involves restoring degraded images affected by rain streaks.

10.48550/arxiv.2308.03340 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity model deep cross-domain interactions. Analyzing attention maps offers us a perspective find out limitations of current VQA systems and an opportunity further improve them. In this paper, we select two state-of-the-art approaches with study robustness disadvantages by visualizing analyzing estimated maps. We that both methods are sensitive features, simultaneously, they perform badly...

10.48550/arxiv.1810.03821 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily expensive regional features, which greatly limits their scalability performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning both unaligned image-only text-only corpus. We build a unified Transformer model to jointly...

10.48550/arxiv.2203.09067 preprint EN other-oa arXiv (Cornell University) 2022-01-01

We propose FindIt, a simple and versatile framework that unifies variety of visual grounding localization tasks including referring expression comprehension, text-based localization, object detection. Key to our architecture is an efficient multi-scale fusion module the disparate requirements across tasks. In addition, we discover standard detector surprisingly effective in unifying these without need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable...

10.48550/arxiv.2203.17273 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form either language or exemplar image. This offers great flexibility and experience for human-computer interaction. To this end, propose a DETR -- hence name OV-DETR which, once trained,...

10.48550/arxiv.2203.11876 preprint EN cc-by-nc-nd arXiv (Cornell University) 2022-01-01
Coming Soon ...