- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Video Analysis and Summarization
- Human Pose and Action Recognition
- Natural Language Processing Techniques
- Topic Modeling
- Big Data Technologies and Applications
- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Software-Defined Networks and 5G
- Digital Storytelling and Education
- Speech and dialogue systems
- Video Surveillance and Tracking Methods
- Music and Audio Processing
- Human Motion and Animation
- Cloud Computing and Resource Management
- Visual Attention and Saliency Detection
- Industrial Vision Systems and Defect Detection
- Interconnection Networks and Systems
- Network Security and Intrusion Detection
- Software Engineering Research
- IoT-based Smart Home Systems
- Advanced Vision and Imaging
- Data Stream Mining Techniques
- Online Learning and Analytics
Salesforce (United States)
2021-2023
China Southern Power Grid (China)
2023
National University of Defense Technology
2017-2020
Fudan University
2020
National University of Singapore
2016-2019
Fujian University of Technology
2018
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) word tokens. Because the are unaligned, it is challenging for learn image-text interactions. In this paper, we introduce contrastive loss ALign text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more...
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training large-scale models. This paper proposes BLIP-2, a generic and efficient strategy that bootstraps vision-language from off-the-shelf frozen pre-trained image encoders large language BLIP-2 bridges the modality gap with lightweight Querying Transformer, which is in two stages. first stage representation learning encoder. second vision-to-language generative model. achieves...
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based Furthermore, improvement been largely achieved by scaling up dataset with noisy image-text pairs collected from web, which is a suboptimal source of supervision. In this paper, we propose BLIP, new VLP framework transfers flexibly to both understanding and generation BLIP effectively...
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building vision-language is challenging due to the rich input distributions task diversity resulting from additional visual input. Although pretraining has widely studied, remains under-explored. In this paper, we conduct a systematic comprehensive study on based pretrained BLIP-2 models. We gather 26 publicly available datasets, covering wide...
The recent advances in instance-level detection tasks lay a strong foundation for automated visual scenes understanding. However, the ability to fully comprehend social scene still eludes us. In this work, we focus on detecting human-object interactions (HOIs) images, an essential step towards deeper HOI aims localize human and objects, as well identify complex between them. Innate practical problems with large label space, categories exhibit long-tail distribution, i.e., there exist some...
Yidco-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a standard transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning finegrained visual-language alignment usually requires off-the-shelf object detectors to provide information, which is bottlenecked by detector's limited vocabulary expensive computation cost. In...
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new tasks. However, effective utilization of LLMs for visual question-answering (VQA) remains challenging, primarily due the modality disconnect and task between LLM VQA End-to-end training on multimodal data may bridge disconnects, but is inflexible computationally expensive. To address this issue, we propose Img2LLM, a plug-and-play module that provides prompts enable perform zeroshot tasks without...
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing suffer from lengthy fine-tuning and difficulties preserving the fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image model that supports multimodal control which consumes inputs images Unlike other models, BLIP-Diffusion introduces encoder is pre-trained to provide representation. We first pre-train following BLIP-2 produce...
The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, ability to fully comprehend a social scene is still its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) images, which demanding terms research and increasingly useful practical applications. To undertake interacting with objects, humans direct their attention move body based intention. Based observation, provide unique...
Visual question answering (VQA) is a hallmark of vision and language reasoningand challenging task under the zero-shot setting.We propose Plug-and-Play VQA (PNP-VQA),a modular framework for VQA.In contrast to most existing works, which require substantial adaptation pretrained models (PLMs) modality,PNP-VQA requires no additional training PLMs.Instead, we use natural network interpretation as an intermediate representation that glues together. We first generate question-guided informative...
Since the beginning of early civilizations, social relationships derived from each individual fundamentally form basis structure in our daily life. In computer vision literature, much progress has been made scene understanding, such as object detection and parsing. Recent research focuses on relationship between objects based its functionality geometrical relations. this work, we aim to study problem recognition, still images. We have proposed a dual-glance model for where first glance...
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive that brings recent advancements in the language-vision field accessible researchers practitioners, well fertilizing future development. It features unified interface easily access state-of-the-art image-language, video-language models common datasets. supports training, evaluation benchmarking on rich variety of tasks, including multimodal...
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new tasks. However, effective utilization of LLMs for visual question-answering (VQA) remains challenging, primarily due the modality disconnection and task between LLM VQA task. End-to-end training on vision data may bridge disconnections, but is inflexible computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides prompts can aforementioned so...
Presentation is one of the most effective methods to disseminate information. Traditional evaluate quality a presentation generally involves human instructor, which infeasible in many scenarios. Recent studies have focused on automated assessment presentations. A variety systems been developed that focus analyzing various aspects However, those are mainly limited by their performance, as they mostly adopt hand-crafted features and ad-hoc algorithms. In this work, we propose multi-stream deep...
Bridging vision and natural language is a longstanding goal in computer multimedia research. While earlier works focus on generating single-sentence description for visual content, recent have studied paragraph generation. In this work, we introduce the problem of video storytelling, which aims at coherent succinct stories long videos. Video storytelling introduces new challenges, mainly due to diversity story length complexity video. We propose novel methods address challenges. First,...
Presentation has been an effective method for delivering information to audience many years. Over the past few decades, technological advancements have revolutionized way humans deliver presentation. Conventionally, quality of a presentation is usually evaluated through painstaking manual analysis with experts. Although expert feedback in assisting users improve their skills, evaluation suffers from high cost and often not available most individuals. In this work, we propose novel...
Since the beginning of early civilizations, social relationships derived from each individual fundamentally form basis structure in our daily life. In computer vision literature, much progress has been made scene understanding, such as object detection and parsing. Recent research focuses on relationship between objects based its functionality geometrical relations. this work, we aim to study problem recognition, still images. We have proposed a dual-glance model for where first glance...
The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, ability to fully comprehend a social scene is still its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) images, which demanding terms research and increasingly useful practical applications. To undertake interacting with objects, humans direct their attention move body based intention. Based observation, provide unique...
The amount of 360-degree panoramas shared online has been rapidly increasing due to the availability affordable and compact omnidirectional cameras, which offers huge new information unavailable before. In this paper, we present first work exploit unlabeled data for image representation learning. We propose middle-out, a self-supervised learning task, leverages spatial configuration normal field-of-view images sampled from as supervisory signal. train Siamese ConvNet model identify middle...
In the public cloud, software security functions that multitenants deploy in their virtual networks have limited performance. SmartNIC overcomes these limitations by implementing with hardware acceleration. However, shared resources are not open for external users considerations. Since requirements of tenants diverse, it is tedious network operators to develop from scratch low-level APIs.