NFDI4DS | UHH-SEMS - Publication Details

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

OPENALEX - Publications

Junnan Li Ramprasaath R. Selvaraju Akhilesh Gotmare Shafiq Joty Caiming Xiong and 1 more

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) word tokens. Because the are unaligned, it is challenging for learn image-text interactions. In this paper, we introduce contrastive loss ALign text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more...

10.48550/arxiv.2107.07651 preprint EN cc-by arXiv (Cornell University) 2021-01-01

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

OPENALEX - Publications

Junnan Li Dongxu Li Silvio Savarese Steven C. H. Hoi

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training large-scale models. This paper proposes BLIP-2, a generic and efficient strategy that bootstraps vision-language from off-the-shelf frozen pre-trained image encoders large language BLIP-2 bridges the modality gap with lightweight Querying Transformer, which is in two stages. first stage representation learning encoder. second vision-to-language generative model. achieves...

10.48550/arxiv.2301.12597 preprint EN cc-by arXiv (Cornell University) 2023-01-01

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

OPENALEX - Publications

Junnan Li Dongxu Li Caiming Xiong Steven C. H. Hoi

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based Furthermore, improvement been largely achieved by scaling up dataset with noisy image-text pairs collected from web, which is a suboptimal source of supervision. In this paper, we propose BLIP, new VLP framework transfers flexibly to both understanding and generation BLIP effectively...

10.48550/arxiv.2201.12086 preprint EN cc-by arXiv (Cornell University) 2022-01-01

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

OPENALEX - Publications

Wenliang Dai Junnan Li Dongxu Li Anthony Meng Huat Tiong Junqi Zhao and 4 more

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building vision-language is challenging due to the rich input distributions task diversity resulting from additional visual input. Although pretraining has widely studied, remains under-explored. In this paper, we conduct a systematic comprehensive study on based pretrained BLIP-2 models. We gather 26 publicly available datasets, covering wide...

10.48550/arxiv.2305.06500 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Learning to Detect Human-Object Interactions With Knowledge

OPENALEX - Publications

Bingjie Xu Yongkang Wong Junnan Li Qi Zhao Mohan Kankanhalli

The recent advances in instance-level detection tasks lay a strong foundation for automated visual scenes understanding. However, the ability to fully comprehend social scene still eludes us. In this work, we focus on detecting human-object interactions (HOIs) images, an essential step towards deeper HOI aims localize human and objects, as well identify complex between them. Innate practical problems with large label space, categories exhibit long-tail distribution, i.e., there exist some...

10.1109/cvpr.2019.00212 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

OPENALEX - Publications

Dongxu Li Junnan Li Hongdong Li Juan Carlos Niebles Steven C. H. Hoi

Yidco-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a standard transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning finegrained visual-language alignment usually requires off-the-shelf object detectors to provide information, which is bottlenecked by detector's limited vocabulary expensive computation cost. In...

10.1109/cvpr52688.2022.00490 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models

OPENALEX - Publications

Jiaxian Guo Junnan Li Dongxu Li Anthony Meng Huat Tiong Boyang Li and 2 more

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new tasks. However, effective utilization of LLMs for visual question-answering (VQA) remains challenging, primarily due the modality disconnect and task between LLM VQA End-to-end training on multimodal data may bridge disconnects, but is inflexible computationally expensive. To address this issue, we propose Img2LLM, a plug-and-play module that provides prompts enable perform zeroshot tasks without...

10.1109/cvpr52729.2023.01046 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

OPENALEX - Publications

Dongxu Li Junnan Li Steven C. H. Hoi

Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing suffer from lengthy fine-tuning and difficulties preserving the fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image model that supports multimodal control which consumes inputs images Unlike other models, BLIP-Diffusion introduces encoder is pre-trained to provide representation. We first pre-train following BLIP-2 produce...

10.48550/arxiv.2305.14720 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Interact as You Intend: Intention-Driven Human-Object Interaction Detection

OPENALEX - Publications

Bingjie Xu Junnan Li Yongkang Wong Qi Zhao Mohan Kankanhalli

The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, ability to fully comprehend a social scene is still its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) images, which demanding terms research and increasingly useful practical applications. To undertake interacting with objects, humans direct their attention move body based intention. Based observation, provide unique...

10.1109/tmm.2019.2943753 article EN IEEE Transactions on Multimedia 2019-09-25

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

OPENALEX - Publications

Anthony Meng Huat Tiong Junnan Li Boyang Li Silvio Savarese Steven C. H. Hoi

Visual question answering (VQA) is a hallmark of vision and language reasoningand challenging task under the zero-shot setting.We propose Plug-and-Play VQA (PNP-VQA),a modular framework for VQA.In contrast to most existing works, which require substantial adaptation pretrained models (PLMs) modality,PNP-VQA requires no additional training PLMs.Instead, we use natural network interpretation as an intermediate representation that glues together. We first generate question-guided informative...

10.18653/v1/2022.findings-emnlp.67 article EN cc-by 2022-01-01

Dual-Glance Model for Deciphering Social Relationships

OPENALEX - Publications

Junnan Li Yongkang Wong Qi Zhao Mohan Kankanhalli

Since the beginning of early civilizations, social relationships derived from each individual fundamentally form basis structure in our daily life. In computer vision literature, much progress has been made scene understanding, such as object detection and parsing. Recent research focuses on relationship between objects based its functionality geometrical relations. this work, we aim to study problem recognition, still images. We have proposed a dual-glance model for where first glance...

10.1109/iccv.2017.289 article EN 2017-10-01

LSTM-based multi-label video event detection

OPENALEX - Publications

An-An Liu Zhuang Shao Yongkang Wong Junnan Li Yuting Su and 1 more

10.1007/s11042-017-5532-x article EN Multimedia Tools and Applications 2017-12-18

Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language

OPENALEX - Publications

An-An Liu Ning Xu Yongkang Wong Junnan Li Yuting Su and 1 more

10.1016/j.cviu.2017.04.013 article EN Computer Vision and Image Understanding 2017-05-09

LAVIS: A Library for Language-Vision Intelligence

OPENALEX - Publications

Dongxu Li Junnan Li Hung Lê Guangsen Wang Silvio Savarese and 1 more

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive that brings recent advancements in the language-vision field accessible researchers practitioners, well fertilizing future development. It features unified interface easily access state-of-the-art image-language, video-language models common datasets. supports training, evaluation benchmarking on rich variety of tasks, including multimodal...

10.48550/arxiv.2209.09019 preprint EN cc-by arXiv (Cornell University) 2022-01-01

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

OPENALEX - Publications

Jiaxian Guo Junnan Li Dongxu Li Anthony Meng Huat Tiong Boyang Li and 2 more

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new tasks. However, effective utilization of LLMs for visual question-answering (VQA) remains challenging, primarily due the modality disconnection and task between LLM VQA task. End-to-end training on vision data may bridge disconnections, but is inflexible computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides prompts can aforementioned so...

10.48550/arxiv.2212.10846 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Multi-stream Deep Learning Framework for Automated Presentation Assessment

OPENALEX - Publications

Junnan Li Yongkang Wong Mohan Kankanhalli

Presentation is one of the most effective methods to disseminate information. Traditional evaluate quality a presentation generally involves human instructor, which infeasible in many scenarios. Recent studies have focused on automated assessment presentations. A variety systems been developed that focus analyzing various aspects However, those are mainly limited by their performance, as they mostly adopt hand-crafted features and ad-hoc algorithms. In this work, we propose multi-stream deep...

10.1109/ism.2016.0051 article EN 2016-12-01

Video Storytelling: Textual Summaries for Events

OPENALEX - Publications

Junnan Li Yongkang Wong Qi Zhao Mohan Kankanhalli

Bridging vision and natural language is a longstanding goal in computer multimedia research. While earlier works focus on generating single-sentence description for visual content, recent have studied paragraph generation. In this work, we introduce the problem of video storytelling, which aims at coherent succinct stories long videos. Video storytelling introduces new challenges, mainly due to diversity story length complexity video. We propose novel methods address challenges. First,...

10.1109/tmm.2019.2930041 article EN IEEE Transactions on Multimedia 2019-07-22

A Multi-sensor Framework for Personal Presentation Analytics

OPENALEX - Publications

Tian Gan Junnan Li Yongkang Wong Mohan Kankanhalli

Presentation has been an effective method for delivering information to audience many years. Over the past few decades, technological advancements have revolutionized way humans deliver presentation. Conventionally, quality of a presentation is usually evaluated through painstaking manual analysis with experts. Although expert feedback in assisting users improve their skills, evaluation suffers from high cost and often not available most individuals. In this work, we propose novel...

10.1145/3300941 article EN ACM Transactions on Multimedia Computing Communications and Applications 2019-05-31

Dual-Glance Model for Deciphering Social Relationships

OPENALEX - Publications

Junnan Li Yongkang Wong Qi Zhao Mohan Kankanhalli

Since the beginning of early civilizations, social relationships derived from each individual fundamentally form basis structure in our daily life. In computer vision literature, much progress has been made scene understanding, such as object detection and parsing. Recent research focuses on relationship between objects based its functionality geometrical relations. this work, we aim to study problem recognition, still images. We have proposed a dual-glance model for where first glance...

10.48550/arxiv.1708.00634 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Interact as You Intend: Intention-Driven Human-Object Interaction Detection

OPENALEX - Publications

Bingjie Xu Junnan Li Yongkang Wong Mohan Kankanhalli Qi Zhao

The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, ability to fully comprehend a social scene is still its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) images, which demanding terms research and increasingly useful practical applications. To undertake interacting with objects, humans direct their attention move body based intention. Based observation, provide unique...

10.48550/arxiv.1808.09796 preprint EN cc-by arXiv (Cornell University) 2018-01-01

Self-supervised Representation Learning Using 360° Data

OPENALEX - Publications

Junnan Li Jianquan Liu Yongkang Wong Shoji Nishimura Mohan Kankanhalli

The amount of 360-degree panoramas shared online has been rapidly increasing due to the availability affordable and compact omnidirectional cameras, which offers huge new information unavailable before. In this paper, we present first work exploit unlabeled data for image representation learning. We propose middle-out, a self-supervised learning task, leverages spatial configuration normal field-of-view images sampled from as supervisory signal. train Siamese ConvNet model identify middle...

10.1145/3343031.3351019 article EN Proceedings of the 30th ACM International Conference on Multimedia 2019-10-15

UniSec

OPENALEX - Publications

Jinli Yan Lu Tang Junnan Li Xiangrui Yang Wei Quan and 2 more

In the public cloud, software security functions that multitenants deploy in their virtual networks have limited performance. SmartNIC overcomes these limitations by implementing with hardware acceleration. However, shared resources are not open for external users considerations. Since requirements of tenants diverse, it is tedious network operators to develop from scratch low-level APIs.

10.1145/3321408.3323087 article EN Proceedings of the ACM Turing Celebration Conference - China 2019-05-17