NFDI4DS | UHH-SEMS - Publication Details

MST: Masked Self-Supervised Transformer for Visual Representation

OPENALEX - Publications

Zhaowen Li Ziyang Chen Fan Yang Wei Li Yousong Zhu and 6 more

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it not fully explored visual learning. Meanwhile, previous methods only consider the high-level feature learning representation from a global perspective, which may fail to transfer downstream dense prediction tasks focusing on local features. In this paper, we present novel Masked Self-supervised approach named MST, can explicitly capture context of an...

10.48550/arxiv.2106.05656 preprint EN cc-by arXiv (Cornell University) 2021-01-01

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

OPENALEX - Publications

Wei Li Linchao Zhu Longyin Wen Yi Yang

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to image-conditioned text generation tasks has drawn increasing interest. Prior arts approach captioning by either utilizing the existing large language GPT-2) or pre-training encoder-decoder network an end-to-end manner. In this work, we propose a simple framework, named DeCap, for captioning. We introduce lightweight visual-aware decoder....

10.48550/arxiv.2303.03032 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Multimodal architecture for video captioning with memory networks and an attention mechanism

OPENALEX - Publications

Wei Li Dashan Guo Xiangzhong Fang

10.1016/j.patrec.2017.10.012 article EN Pattern Recognition Letters 2017-10-13

Delving into Out-of-Distribution Detection with Vision-Language Representations

OPENALEX - Publications

Yifei Ming Ziyang Cai Jiuxiang Gu Yiyou Sun Wei Li and 1 more

Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e.g., either vision or language), leaving rich information multi-modal representations untapped. Inspired recent success vision-language pre-training, this paper enriches landscape from single-modal to regime. Particularly, we propose Maximum Concept Matching (MCM), simple yet effective zero-shot method...

10.48550/arxiv.2211.13445 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Visual question answering with attention transfer and a cross-modal gating mechanism

OPENALEX - Publications

Wei Li Jianhui Sun Ge Liu Linglan Zhao Xiangzhong Fang

10.1016/j.patrec.2020.02.031 article EN Pattern Recognition Letters 2020-02-29

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

OPENALEX - Publications

AJ Piergiovanni Wei Li Weicheng Kuo Mohammad Saffar Fred Bertsch and 1 more

We present Answer-Me, a task-aware multi-task framework which unifies variety of question answering tasks, such as, visual answering, entailment, reasoning. In contrast to previous works using contrastive or generative captioning training, we propose novel and simple recipe pre-train vision-language joint model, is as well. The pre-training uses only noisy image data, formulated use the entire architecture end-to-end with both strong language encoder decoder. Our results show...

10.48550/arxiv.2205.00949 preprint EN other-oa arXiv (Cornell University) 2022-01-01

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

OPENALEX - Publications

Weicheng Kuo AJ Piergiovanni Dahun Kim Xiyang Luo Ben Caine and 7 more

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, generative and contrastive are nontrivial accommodate in one architecture, further need adaptations for downstream tasks. We propose a novel paradigm training with model which is surprisingly effective jointly learning these disparate vision-language This done simple model, called MaMMUT. It consists single vision encoder text...

10.48550/arxiv.2303.16839 preprint EN cc-by arXiv (Cornell University) 2023-01-01

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

OPENALEX - Publications

Wei Li Can Gao Guocheng Niu Xinyan Xiao Hao Líu and 3 more

Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily expensive regional features, which greatly limits their scalability performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning both unaligned image-only text-only corpus. We build a unified Transformer model to jointly...

10.18653/v1/2022.findings-acl.251 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

OPENALEX - Publications

Wei Li Xue Xu Xinyan Xiao Jiachen Liu Hu Yang and 6 more

Diffusion generative models have recently greatly improved the power of text-conditioned image generation. Existing generation mainly include text conditional diffusion model and cross-modal guided model, which are good at small scene complex respectively. In this work, we propose a simple yet effective approach, namely UPainting, to unify generation, as shown in Figure 1. Based on architecture improvements diverse guidance schedules, UPainting effectively integrates from pretrained...

10.48550/arxiv.2210.16031 preprint EN other-oa arXiv (Cornell University) 2022-01-01

OMG-Seg: Is One Model Good Enough For All Segmentation?

OPENALEX - Publications

Xiangtai Li Haobo Yuan Wei Li Henghui Ding Size Wu and 4 more

In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the including image semantic, instance, panoptic segmentation, as well their video counterparts, open vocabulary settings, prompt-driven, interactive like SAM, object segmentation. To our knowledge, first model these tasks in one achieve satisfactory performance. show a...

10.48550/arxiv.2401.10229 preprint EN other-oa arXiv (Cornell University) 2024-01-01

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

OPENALEX - Publications

Wei Li Can Gao Guocheng Niu Xinyan Xiao Hao Líu and 3 more

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize data (i.e. text image) limited image-text pairs). In this work, we propose a unified-modal architecture, namely UNIMO, which both understanding generation tasks. Large scale of free corpus image collections be utilized improve the capability visual textual understanding, cross-modal contrastive learning (CMCL) is leveraged align information...

10.48550/arxiv.2012.15409 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

OPENALEX - Publications

Dashan Guo Wei Li Xiangzhong Fang

10.1007/s11063-017-9591-9 article EN Neural Processing Letters 2017-01-25

Learning to Overcome Noise in Weak Caption Supervision for Object Detection

OPENALEX - Publications

Mesut Erhan Unal Keren Ye Mingda Zhang Christopher Thomas Adriana Kovashka and 3 more

We propose the first mechanism to train object detection models from weak supervision in form of captions at image level. Language-based for is appealing and inexpensive: many blogs with images descriptive text written by human users exist. However, there significant noise this supervision: do not mention all objects that are shown, may extraneous concepts. a technique determine which image-caption pairs provide suitable signal supervision. further several complementary mechanisms extract...

10.1109/tpami.2022.3187350 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-01-01

Efficient Masked Autoencoders with Self-Consistency

OPENALEX - Publications

Zhaowen Li Yousong Zhu Ziyang Chen Wei Li Chaoyang Zhao and 4 more

Inspired by masked language modeling (MLM) in natural processing, image (MIM) has been recognized as a strong and popular self-supervised pre-training method computer vision. However, its high random mask ratio would result two serious problems: 1) the data are not efficiently exploited, which brings inefficient (\eg, 1600 epochs for MAE $vs.$ 300 supervised), 2) uncertainty inconsistency of pre-trained model, \ie, prediction same patch may be inconsistent under different rounds. To tackle...

10.48550/arxiv.2302.14431 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving

OPENALEX - Publications

W. Cheng Junbo Yin Wei Li Ruigang Yang Jianbing Shen

This paper addresses the problem of 3D referring expression comprehension (REC) in autonomous driving scenario, which aims to ground a natural language targeted region LiDAR point clouds. Previous approaches for REC usually focus on 2D or 3D-indoor domain, is not suitable accurately predicting location queried an scene. In addition, upper-bound limitation and heavy computation cost motivate us explore better solution. this work, we propose new multi-modal visual grounding task, termed...

10.48550/arxiv.2305.15765 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection

OPENALEX - Publications

Keren Ye Mingda Zhang Adriana Kovashka Wei Li Danfeng Qin and 1 more

Learning to localize and name object instances is a fundamental problem in vision, but state-of-the-art approaches rely on expensive bounding box supervision. While weakly supervised detection (WSOD) methods relax the need for boxes that of image-level annotations, even cheaper supervision naturally available form unstructured textual descriptions users may freely provide when uploading image content. However, straightforward using such data WSOD wastefully discard captions do not exactly...

10.48550/arxiv.1907.10164 preprint EN other-oa arXiv (Cornell University) 2019-01-01

A Hybrid CNN-Transformer Architecture with Frequency Domain Contrastive Learning for Image Deraining

OPENALEX - Publications

Cheng Wang Wei Li

Image deraining is a challenging task that involves restoring degraded images affected by rain streaks.

10.48550/arxiv.2308.03340 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Knowing Where to Look? Analysis on Attention of Visual Question Answering System

OPENALEX - Publications

Wei Li Zehuan Yuan Xiangzhong Fang Changhu Wang

Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity model deep cross-domain interactions. Analyzing attention maps offers us a perspective find out limitations of current VQA systems and an opportunity further improve them. In this paper, we select two state-of-the-art approaches with study robustness disadvantages by visualizing analyzing estimated maps. We that both methods are sensitive features, simultaneously, they perform badly...

10.48550/arxiv.1810.03821 preprint EN other-oa arXiv (Cornell University) 2018-01-01

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

OPENALEX - Publications

Wei Li Can Gao Guocheng Niu Xinyan Xiao Hao Líu and 3 more

Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily expensive regional features, which greatly limits their scalability performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning both unaligned image-only text-only corpus. We build a unified Transformer model to jointly...

10.48550/arxiv.2203.09067 preprint EN other-oa arXiv (Cornell University) 2022-01-01

FindIt: Generalized Localization with Natural Language Queries

OPENALEX - Publications

Weicheng Kuo Fred Bertsch Wei Li AJ Piergiovanni Mohammad Saffar and 1 more

We propose FindIt, a simple and versatile framework that unifies variety of visual grounding localization tasks including referring expression comprehension, text-based localization, object detection. Key to our architecture is an efficient multi-scale fusion module the disparate requirements across tasks. In addition, we discover standard detector surprisingly effective in unifying these without need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable...

10.48550/arxiv.2203.17273 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Open-Vocabulary DETR with Conditional Matching

OPENALEX - Publications

Yuhang Zang Wei Li Kaiyang Zhou Chen Huang Chen Change Loy

Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form either language or exemplar image. This offers great flexibility and experience for human-computer interaction. To this end, propose a DETR -- hence name OV-DETR which, once trained,...

10.48550/arxiv.2203.11876 preprint EN cc-by-nc-nd arXiv (Cornell University) 2022-01-01