- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Video Analysis and Summarization
- Human Pose and Action Recognition
- Natural Language Processing Techniques
- Advanced Image Fusion Techniques
- Generative Adversarial Networks and Image Synthesis
- Image Enhancement Techniques
- Speech and dialogue systems
- Hand Gesture Recognition Systems
- Cancer-related molecular mechanisms research
- Topic Modeling
- Anomaly Detection Techniques and Applications
Third World Newsreel
2022
Baidu (China)
2022
Chinese Academy of Sciences
2021
Shanghai Jiao Tong University
2017-2020
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it not fully explored visual learning. Meanwhile, previous methods only consider the high-level feature learning representation from a global perspective, which may fail to transfer downstream dense prediction tasks focusing on local features. In this paper, we present novel Masked Self-supervised approach named MST, can explicitly capture context of an...
Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to image-conditioned text generation tasks has drawn increasing interest. Prior arts approach captioning by either utilizing the existing large language GPT-2) or pre-training encoder-decoder network an end-to-end manner. In this work, we propose a simple framework, named DeCap, for captioning. We introduce lightweight visual-aware decoder....
Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e.g., either vision or language), leaving rich information multi-modal representations untapped. Inspired recent success vision-language pre-training, this paper enriches landscape from single-modal to regime. Particularly, we propose Maximum Concept Matching (MCM), simple yet effective zero-shot method...
We present Answer-Me, a task-aware multi-task framework which unifies variety of question answering tasks, such as, visual answering, entailment, reasoning. In contrast to previous works using contrastive or generative captioning training, we propose novel and simple recipe pre-train vision-language joint model, is as well. The pre-training uses only noisy image data, formulated use the entire architecture end-to-end with both strong language encoder decoder. Our results show...
The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, generative and contrastive are nontrivial accommodate in one architecture, further need adaptations for downstream tasks. We propose a novel paradigm training with model which is surprisingly effective jointly learning these disparate vision-language This done simple model, called MaMMUT. It consists single vision encoder text...
Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily expensive regional features, which greatly limits their scalability performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning both unaligned image-only text-only corpus. We build a unified Transformer model to jointly...
Diffusion generative models have recently greatly improved the power of text-conditioned image generation. Existing generation mainly include text conditional diffusion model and cross-modal guided model, which are good at small scene complex respectively. In this work, we propose a simple yet effective approach, namely UPainting, to unify generation, as shown in Figure 1. Based on architecture improvements diverse guidance schedules, UPainting effectively integrates from pretrained...
In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the including image semantic, instance, panoptic segmentation, as well their video counterparts, open vocabulary settings, prompt-driven, interactive like SAM, object segmentation. To our knowledge, first model these tasks in one achieve satisfactory performance. show a...
Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize data (i.e. text image) limited image-text pairs). In this work, we propose a unified-modal architecture, namely UNIMO, which both understanding generation tasks. Large scale of free corpus image collections be utilized improve the capability visual textual understanding, cross-modal contrastive learning (CMCL) is leveraged align information...
We propose the first mechanism to train object detection models from weak supervision in form of captions at image level. Language-based for is appealing and inexpensive: many blogs with images descriptive text written by human users exist. However, there significant noise this supervision: do not mention all objects that are shown, may extraneous concepts. a technique determine which image-caption pairs provide suitable signal supervision. further several complementary mechanisms extract...
Inspired by masked language modeling (MLM) in natural processing, image (MIM) has been recognized as a strong and popular self-supervised pre-training method computer vision. However, its high random mask ratio would result two serious problems: 1) the data are not efficiently exploited, which brings inefficient (\eg, 1600 epochs for MAE $vs.$ 300 supervised), 2) uncertainty inconsistency of pre-trained model, \ie, prediction same patch may be inconsistent under different rounds. To tackle...
This paper addresses the problem of 3D referring expression comprehension (REC) in autonomous driving scenario, which aims to ground a natural language targeted region LiDAR point clouds. Previous approaches for REC usually focus on 2D or 3D-indoor domain, is not suitable accurately predicting location queried an scene. In addition, upper-bound limitation and heavy computation cost motivate us explore better solution. this work, we propose new multi-modal visual grounding task, termed...
Learning to localize and name object instances is a fundamental problem in vision, but state-of-the-art approaches rely on expensive bounding box supervision. While weakly supervised detection (WSOD) methods relax the need for boxes that of image-level annotations, even cheaper supervision naturally available form unstructured textual descriptions users may freely provide when uploading image content. However, straightforward using such data WSOD wastefully discard captions do not exactly...
Image deraining is a challenging task that involves restoring degraded images affected by rain streaks.
Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity model deep cross-domain interactions. Analyzing attention maps offers us a perspective find out limitations of current VQA systems and an opportunity further improve them. In this paper, we select two state-of-the-art approaches with study robustness disadvantages by visualizing analyzing estimated maps. We that both methods are sensitive features, simultaneously, they perform badly...
Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily expensive regional features, which greatly limits their scalability performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning both unaligned image-only text-only corpus. We build a unified Transformer model to jointly...
We propose FindIt, a simple and versatile framework that unifies variety of visual grounding localization tasks including referring expression comprehension, text-based localization, object detection. Key to our architecture is an efficient multi-scale fusion module the disparate requirements across tasks. In addition, we discover standard detector surprisingly effective in unifying these without need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable...
Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form either language or exemplar image. This offers great flexibility and experience for human-computer interaction. To this end, propose a DETR -- hence name OV-DETR which, once trained,...