- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Human Pose and Action Recognition
- Topic Modeling
- Advanced Image and Video Retrieval Techniques
- Generative Adversarial Networks and Image Synthesis
- Adversarial Robustness in Machine Learning
- Video Analysis and Summarization
- Natural Language Processing Techniques
- Stochastic Gradient Optimization Techniques
- Advanced Mathematical Modeling in Engineering
- Machine Learning and Data Classification
- Sparse and Compressive Sensing Techniques
- Advanced Numerical Methods in Computational Mathematics
- Advanced Image Processing Techniques
- Gaussian Processes and Bayesian Inference
- Digital Storytelling and Education
- Neural Networks and Applications
- COVID-19 diagnosis using AI
- Bayesian Methods and Mixture Models
- Advanced Memory and Neural Computing
- Markov Chains and Monte Carlo Methods
- Composite Material Mechanics
- Face recognition and analysis
Microsoft Research (United Kingdom)
2018-2023
Microsoft (Finland)
2021-2022
Microsoft (United States)
2019-2021
Princeton University
2018
California Institute of Technology
2017-2018
Tianjin Normal University
2011-2013
Anhui Provincial Center for Disease Control and Prevention
2010-2011
Soochow University
2010
In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation. With a novel attentional generative network, the AttnGAN can synthesize details at different sub-regions of image by paying attentions to relevant words in natural language description. addition, deep multimodal similarity model is proposed compute image-text matching loss training generator. The significantly...
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric images. Compared the most widely used bottom-up top-down [2], new is bigger, better-designed VL tasks, pre-trained on much larger training corpora that combine multiple public annotated datasets. Therefore, it can generate richer collection objects concepts. While previous research focuses mainly vision-language...
This paper presents a grounded language-image pretraining (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection phrase grounding pre-training. The unification brings two benefits: 1) it allows to learn from both data improve tasks bootstrap good model; 2) can leverage massive image-text pairs by generating boxes in self-training fashion, making the learned representations semantic-rich. In our experiments, we pre-train...
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar human vision. Computer foundation models, which are trained on diverse, large-scale dataset can be adapted a wide range downstream critical this mission solve real-world applications. While existing such as CLIP, ALIGN, Wu Dao 2.0 focus mainly mapping images textual representations cross-modal shared representation, we introduce...
In this paper, we propose Object-driven Attentive Generative Adversarial Newtorks (Obj-GANs) that allow attention-driven, multi-stage refinement for synthesizing complex images from text descriptions. With a novel object-driven attentive generative network, the Obj-GAN can synthesize salient objects by paying attention to their most relevant words in descriptions and pre-generated class label. addition, object-wise discriminator based on Fast R-CNN model is proposed provide rich...
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning set-tings. However, we show that directly applying such mod-els to recognize regions for object detection leads unsatisfactory performance due a major domain shift: CLIP was trained match an as whole text de-scription, without capturing the fine-grained alignment be-tween spans. To mitigate this issue, propose new method called...
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Longformer, which significantly enhances the ViT of [12] for encoding high-resolution images using two techniques. The first is multi-scale model structure, provides image encodings at multiple scales with manageable computational cost. second attention mechanism Long-former, variant Longformer [3], originally developed natural language processing, and achieves linear complexity w.r.t. number input tokens. A...
In this paper, we present a novel Dynamic DETR (Detection with Transformers) approach by introducing dynamic attentions into both the encoder and decoder stages of to break its two limitations on small feature resolution slow training convergence. To address first limitation, which is due quadratic computational complexity self-attention module in Transformer encoders, propose approximate encoder's attention mechanism using convolution-based various types. Such an can dynamically adjust...
Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- long-range visual dependencies through self-attention is arguably the main source for success. But it also brings challenges due to quadratic computational overhead, especially high-resolution tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local coarse-grained global...
Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work shown that fully transformer-based models can more efficient than previous region-feature-based methods, their performance tasks often degrades significantly. In this paper, we present Meter, a Multimodal End-to-end TransformER framework, through which investigate how design and pre-train model in an end-to-end manner. Specifically, dissect the designs along multiple...
Visual recognition is recently learned via either super-vised learning on human-annotated image-label data or language-image contrastive with webly-crawled image-text pairs. While supervised may result in a more discriminative representation, pretraining shows unprecedented zero-shot ca-pability, largely due to the different properties of sources and objectives. In this work, we intro-duce new formulation by combining two into common image-text-label space. space, propose paradigm, called...
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) VQA, image captioning). GLIPv2 elegantly unifies pre-training Pre-training (VLP) with three tasks: phrase grounding as reformulation of the detection task, region-word contrastive learning novel level masked language modeling. This unification not only simplifies previous multi-stage VLP procedure but also achieves mutual benefits...
Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build unified completing many vision-language tasks including image description, visual question answering, and grounding, among others. The challenge is use single model performing diverse effectively with simple multi-modal instructions. Towards this objective, introduce MiniGPT-v2, that can be treated better handling tasks. We...
Recent works have shown the effectiveness of randomized smoothing as a scalable technique for building neural network-based classifiers that are provably robust to $\ell_2$-norm adversarial perturbations. In this paper, we employ training improve performance smoothing. We design an adapted attack smoothed classifiers, and show how can be used in setting boost provable robustness classifiers. demonstrate through extensive experimentation our method consistently outperforms all existing...
Verification of neural networks enables us to gauge their robustness against adversarial attacks. algorithms fall into two categories: exact verifiers that run in exponential time and relaxed are efficient but incomplete. In this paper, we unify all existing LP-relaxed verifiers, the best our knowledge, under a general convex relaxation framework. This framework works for with diverse architectures nonlinearities covers both primal dual views verification. We further prove strong duality...
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but cost of losing the ability to capture fine-grained correspondences between image regions. Second, propose new pre-training task region matching which allows model dependencies and as result...
We begin with the hypothesis that a model must be able to understand individual objects and relationships between in order generate complex scenes multiple well. Our layout-to-image-generation method, which we call Object-Centric Generative Adversarial Network (or OC-GAN), relies on novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of spatial scene, lead our model's improved layout-fidelity. also propose changes conditioning mechanism generator enhance its object...
Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented models demonstrate strong transferability to variety datasets and tasks. However, it remains challenging evaluate the transferablity due lack easy-to-use evaluation toolkits public benchmarks. To tackle this, we build ELEVATER (Evaluation Language-augmented Visual Task-level Transfer), first benchmark toolkit for...
Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or target region-level for phrase grounding object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new model architecture can seamlessly handle both these types tasks. Instead...
Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according custom language queries (e.g., sentences words), is key for video browsing on social media. Most methods in this direction develop task-specific models that are trained with type-specific labels, such moment retrieval (time interval) and highlight detection (worthiness curve), limits their abilities generalize various VTG tasks labels. In paper, we propose...