NFDI4DS | UHH-SEMS - Publication Details

Pengchuan Zhang

ORCID: 0000-0003-1155-9507

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5059735251

Research Areas

Multimodal Machine Learning Applications
Domain Adaptation and Few-Shot Learning
Advanced Neural Network Applications
Human Pose and Action Recognition
Topic Modeling
Advanced Image and Video Retrieval Techniques
Generative Adversarial Networks and Image Synthesis
Adversarial Robustness in Machine Learning
Video Analysis and Summarization
Natural Language Processing Techniques
Stochastic Gradient Optimization Techniques
Advanced Mathematical Modeling in Engineering
Machine Learning and Data Classification
Sparse and Compressive Sensing Techniques
Advanced Numerical Methods in Computational Mathematics
Advanced Image Processing Techniques
Gaussian Processes and Bayesian Inference
Digital Storytelling and Education
Neural Networks and Applications
COVID-19 diagnosis using AI
Bayesian Methods and Mixture Models
Advanced Memory and Neural Computing
Markov Chains and Monte Carlo Methods
Composite Material Mechanics
Face recognition and analysis

Microsoft Research (United Kingdom)
2018-2023

Microsoft (Finland)
2021-2022

Microsoft (United States)
2019-2021

Princeton University
2018

California Institute of Technology
2017-2018

Tianjin Normal University
2011-2013

Anhui Provincial Center for Disease Control and Prevention
2010-2011

Soochow University
2010

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

OPENALEX - Publications

Tao Xu Pengchuan Zhang Qiuyuan Huang Han Zhang Zhe Gan and 2 more

In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation. With a novel attentional generative network, the AttnGAN can synthesize details at different sub-regions of image by paying attentions to relevant words in natural language description. addition, deep multimodal similarity model is proposed compute image-text matching loss training generator. The significantly...

10.1109/cvpr.2018.00143 article EN 2018-06-01

VinVL: Revisiting Visual Representations in Vision-Language Models

OPENALEX - Publications

Pengchuan Zhang Xiujun Li Xiaowei Hu Jianwei Yang Lei Zhang and 3 more

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric images. Compared the most widely used bottom-up top-down [2], new is bigger, better-designed VL tasks, pre-trained on much larger training corpora that combine multiple public annotated datasets. Therefore, it can generate richer collection objects concepts. While previous research focuses mainly vision-language...

10.1109/cvpr46437.2021.00553 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Grounded Language-Image Pre-training

OPENALEX - Publications

Liunian Harold Li Pengchuan Zhang Haotian Zhang Jianwei Yang Chunyuan Li and 7 more

This paper presents a grounded language-image pretraining (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection phrase grounding pre-training. The unification brings two benefits: 1) it allows to learn from both data improve tasks bootstrap good model; 2) can leverage massive image-text pairs by generating boxes in self-training fashion, making the learned representations semantic-rich. In our experiments, we pre-train...

10.1109/cvpr52688.2022.01069 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Florence: A New Foundation Model for Computer Vision

OPENALEX - Publications

Lu Yuan Dongdong Chen Yi‐Ling Chen Noel Codella Xiyang Dai and 18 more

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar human vision. Computer foundation models, which are trained on diverse, large-scale dataset can be adapted a wide range downstream critical this mission solve real-world applications. While existing such as CLIP, ALIGN, Wu Dao 2.0 focus mainly mapping images textual representations cross-modal shared representation, we introduce...

10.48550/arxiv.2111.11432 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Object-Driven Text-To-Image Synthesis via Adversarial Training

OPENALEX - Publications

Wenbo Li Pengchuan Zhang Lei Zhang Qiuyuan Huang Xiaodong He and 2 more

In this paper, we propose Object-driven Attentive Generative Adversarial Newtorks (Obj-GANs) that allow attention-driven, multi-stage refinement for synthesizing complex images from text descriptions. With a novel object-driven attentive generative network, the Obj-GAN can synthesize salient objects by paying attention to their most relevant words in descriptions and pre-generated class label. addition, object-wise discriminator based on Fast R-CNN model is proposed provide rich...

10.1109/cvpr.2019.01245 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

RegionCLIP: Region-based Language-Image Pretraining

OPENALEX - Publications

Yiwu Zhong Jianwei Yang Pengchuan Zhang Chunyuan Li Noel Codella and 6 more

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning set-tings. However, we show that directly applying such mod-els to recognize regions for object detection leads unsatisfactory performance due a major domain shift: CLIP was trained match an as whole text de-scription, without capturing the fine-grained alignment be-tween spans. To mitigate this issue, propose new method called...

10.1109/cvpr52688.2022.01629 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

OPENALEX - Publications

Pengchuan Zhang Xiyang Dai Jianwei Yang Bin Xiao Lu Yuan and 2 more

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Longformer, which significantly enhances the ViT of [12] for encoding high-resolution images using two techniques. The first is multi-scale model structure, provides image encodings at multiple scales with manageable computational cost. second attention mechanism Long-former, variant Longformer [3], originally developed natural language processing, and achieves linear complexity w.r.t. number input tokens. A...

10.1109/iccv48922.2021.00299 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Dynamic DETR: End-to-End Object Detection with Dynamic Attention

OPENALEX - Publications

Xiyang Dai Yinpeng Chen Jianwei Yang Pengchuan Zhang Lu Yuan and 1 more

In this paper, we present a novel Dynamic DETR (Detection with Transformers) approach by introducing dynamic attentions into both the encoder and decoder stages of to break its two limitations on small feature resolution slow training convergence. To address first limitation, which is due quadratic computational complexity self-attention module in Transformer encoders, propose approximate encoder's attention mechanism using convolution-based various types. Such an can dynamically adjust...

10.1109/iccv48922.2021.00298 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Focal Self-attention for Local-Global Interactions in Vision Transformers

OPENALEX - Publications

Jianwei Yang Chunyuan Li Pengchuan Zhang Xiyang Dai Bin Xiao and 2 more

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- long-range visual dependencies through self-attention is arguably the main source for success. But it also brings challenges due to quadratic computational overhead, especially high-resolution tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local coarse-grained global...

10.48550/arxiv.2107.00641 preprint EN other-oa arXiv (Cornell University) 2021-01-01

An Empirical Study of Training End-to-End Vision-and-Language Transformers

OPENALEX - Publications

Zi-Yi Dou Yichong Xu Zhe Gan Jianfeng Wang Shuohang Wang and 7 more

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work shown that fully transformer-based models can more efficient than previous region-feature-based methods, their performance tasks often degrades significantly. In this paper, we present Meter, a Multimodal End-to-end TransformER framework, through which investigate how design and pre-train model in an end-to-end manner. Specifically, dissect the designs along multiple...

10.1109/cvpr52688.2022.01763 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Unified Contrastive Learning in Image-Text-Label Space

OPENALEX - Publications

Jianwei Yang Chunyuan Li Pengchuan Zhang Bin Xiao Ce Liu and 2 more

Visual recognition is recently learned via either super-vised learning on human-annotated image-label data or language-image contrastive with webly-crawled image-text pairs. While supervised may result in a more discriminative representation, pretraining shows unprecedented zero-shot ca-pability, largely due to the different properties of sources and objectives. In this work, we intro-duce new formulation by combining two into common image-text-label space. space, propose paradigm, called...

10.1109/cvpr52688.2022.01857 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

GLIPv2: Unifying Localization and Vision-Language Understanding

OPENALEX - Publications

Haotian Zhang Pengchuan Zhang Xiaowei Hu Yen‐Chun Chen Liunian Harold Li and 5 more

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) VQA, image captioning). GLIPv2 elegantly unifies pre-training Pre-training (VLP) with three tasks: phrase grounding as reformulation of the detection task, region-word contrastive learning novel level masked language modeling. This unification not only simplifies previous multi-stage VLP procedure but also achieves mutual benefits...

10.48550/arxiv.2206.05836 preprint EN other-oa arXiv (Cornell University) 2022-01-01

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

OPENALEX - Publications

Jun Chen Deyao Zhu Xiaoqian Shen Xiang Li Zechun Liu and 5 more

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build unified completing many vision-language tasks including image description, visual question answering, and grounding, among others. The challenge is use single model performing diverse effectively with simple multi-modal instructions. Towards this objective, introduce MiniGPT-v2, that can be treated better handling tasks. We...

10.48550/arxiv.2310.09478 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers

OPENALEX - Publications

Hadi Salman Greg Yang Jerry Li Pengchuan Zhang Huan Zhang and 2 more

Recent works have shown the effectiveness of randomized smoothing as a scalable technique for building neural network-based classifiers that are provably robust to $\ell_2$-norm adversarial perturbations. In this paper, we employ training improve performance smoothing. We design an adapted attack smoothed classifiers, and show how can be used in setting boost provable robustness classifiers. demonstrate through extensive experimentation our method consistently outperforms all existing...

10.48550/arxiv.1906.04584 preprint EN other-oa arXiv (Cornell University) 2019-01-01

A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

OPENALEX - Publications

Hadi Salman Greg Yang Huan Zhang Cho‐Jui Hsieh Pengchuan Zhang

Verification of neural networks enables us to gauge their robustness against adversarial attacks. algorithms fall into two categories: exact verifiers that run in exponential time and relaxed are efficient but incomplete. In this paper, we unify all existing LP-relaxed verifiers, the best our knowledge, under a general convex relaxation framework. This framework works for with diverse architectures nonlinearities covers both primal dual views verification. We further prove strong duality...

10.48550/arxiv.1902.08722 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Efficient Self-supervised Vision Transformers for Representation Learning

OPENALEX - Publications

Chunyuan Li Jianwei Yang Pengchuan Zhang Mei Gao Bin Xiao and 3 more

This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but cost of losing the ability to capture fine-grained correspondences between image regions. Second, propose new pre-training task region matching which allows model dependencies and as result...

10.48550/arxiv.2106.09785 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Object-Centric Image Generation from Layouts

OPENALEX - Publications

Tristan Sylvain Pengchuan Zhang Yoshua Bengio R Devon Hjelm Shikhar Sharma

We begin with the hypothesis that a model must be able to understand individual objects and relationships between in order generate complex scenes multiple well. Our layout-to-image-generation method, which we call Object-Centric Generative Adversarial Network (or OC-GAN), relies on novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of spatial scene, lead our model's improved layout-fidelity. also propose changes conditioning mechanism generator enhance its object...

10.1609/aaai.v35i3.16368 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

OPENALEX - Publications

Chunyuan Li Haotian Liu Liunian Harold Li Pengchuan Zhang Jyoti Aneja and 6 more

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented models demonstrate strong transferability to variety datasets and tasks. However, it remains challenging evaluate the transferablity due lack easy-to-use evaluation toolkits public benchmarks. To tackle this, we build ELEVATER (Evaluation Language-augmented Visual Task-level Transfer), first benchmark toolkit for...

10.48550/arxiv.2204.08790 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

OPENALEX - Publications

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang and 7 more

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or target region-level for phrase grounding object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new model architecture can seamlessly handle both these types tasks. Instead...

10.48550/arxiv.2206.07643 preprint EN cc-by arXiv (Cornell University) 2022-01-01

UniVTG: Towards Unified Video-Language Temporal Grounding

OPENALEX - Publications

Kevin Qinghong Lin Pengchuan Zhang Joya Chen Shraman Pramanick Difei Gao and 3 more

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according custom language queries (e.g., sentences words), is key for video browsing on social media. Most methods in this direction develop task-specific models that are trained with type-specific labels, such moment retrieval (time interval) and highlight detection (worthiness curve), limits their abilities generalize various VTG tasks labels. In paper, we propose...

10.1109/iccv51070.2023.00262 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Coming Soon ...