NFDI4DS | UHH-SEMS - Publication Details

Jianlong Fu

ORCID: 0000-0002-1025-2012

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5072029041

Research Areas

Multimodal Machine Learning Applications
Advanced Image and Video Retrieval Techniques
Generative Adversarial Networks and Image Synthesis
Domain Adaptation and Few-Shot Learning
Advanced Neural Network Applications
Human Pose and Action Recognition
Video Analysis and Summarization
Advanced Image Processing Techniques
Advanced Vision and Imaging
Image Retrieval and Classification Techniques
Video Surveillance and Tracking Methods
Image Processing Techniques and Applications
Topic Modeling
Image and Signal Denoising Methods
Visual Attention and Saliency Detection
Computer Graphics and Visualization Techniques
Image Enhancement Techniques
Robot Manipulation and Learning
Human Motion and Animation
Anomaly Detection Techniques and Applications
Face recognition and analysis
Adversarial Robustness in Machine Learning
CCD and CMOS Imaging Sensors
Music and Audio Processing
Natural Language Processing Techniques

Microsoft Research (United Kingdom)
2018-2025

Yantai University
2025

Microsoft Research Asia (China)
2015-2024

Lanzhou University
2024

Microsoft Research (India)
2024

Northwest Normal University
2024

Microsoft (United States)
2017-2023

Université de Bordeaux
2023

Laboratoire Bordelais de Recherche en Informatique
2023

Peking University
2023

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition

OPENALEX - Publications

Jianlong Fu Heliang Zheng Tao Mei

Recognizing fine-grained categories (e.g., bird species) is difficult due to the challenges of discriminative region localization and feature learning. Existing approaches predominantly solve these independently, while neglecting fact that detection learning are mutually correlated thus can reinforce each other. In this paper, we propose a novel recurrent attention convolutional neural network (RA-CNN) which recursively learns region-based representation at multiple scales in mutual...

10.1109/cvpr.2017.476 article EN 2017-07-01

Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition

OPENALEX - Publications

Heliang Zheng Jianlong Fu Tao Mei Jiebo Luo

Recognizing fine-grained categories (e.g., bird species) highly relies on discriminative part localization and part-based feature learning. Existing approaches predominantly solve these challenges independently, while neglecting the fact that head of a bird) learning shape) are mutually correlated. In this paper, we propose novel approach by multi-attention convolutional neural network (MA-CNN), where generation can reinforce each other. MA-CNN consists convolution, channel grouping...

10.1109/iccv.2017.557 article EN 2017-10-01

Learning Texture Transformer Network for Image Super-Resolution

OPENALEX - Publications

Fuzhi Yang Huan Yang Jianlong Fu Hongtao Lu Baining Guo

We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant can be transferred LR images. However, existing SR approaches neglect use attention mechanisms transfer (HR) Ref images, limits these in challenging cases. In this paper, we propose novel Texture Transformer Network for Image Super-Resolution (TTSR), the and are formulated...

10.1109/cvpr42600.2020.00583 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Learning Spatio-Temporal Transformer for Visual Tracking

OPENALEX - Publications

Bin Yan Houwen Peng Jianlong Fu Dong Wang Huchuan Lu

In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models global spatio-temporal feature dependencies between target objects and search regions, while decoder learns query embedding to predict spatial positions of objects. Our method casts object direct bounding box prediction problem, without using any proposals or predefined anchors. With transformer, just uses simple fully-convolutional network, which estimates...

10.1109/iccv48922.2021.01028 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting

OPENALEX - Publications

Yanhong Zeng Jianlong Fu Hongyang Chao Baining Guo

High-quality image inpainting requires filling missing regions in a damaged with plausible content. Existing works either fill the by copying high-resolution patches or generating semantically-coherent from region context, while neglecting fact that both visual and semantic plausibility are highly-demanded. In this paper, we propose Pyramid-context Encoder Network (denoted as PEN-Net) for deep generative models. The proposed PEN-Net is built upon U-Net structure three tailored components,...

10.1109/cvpr.2019.00158 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition

OPENALEX - Publications

Heliang Zheng Jianlong Fu Zheng-Jun Zha Jiebo Luo

Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays significant role in fine-grained image recognition. Existing attention-based approaches localize amplify parts to learn details, which often suffer from limited number of heavy computational cost. In this paper, we propose such hundreds part proposals by Trilinear Attention Sampling Network (TASN) an efficient teacher-student manner. Specifically, TASN consists 1) trilinear attention module, generates maps...

10.1109/cvpr.2019.00515 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

The Seventh Visual Object Tracking VOT2019 Challenge Results

OPENALEX - Publications

Matej Kristan Jǐŕı Matas Aleš Leonardis Michael Felsberg Roman Pflugfelder and 95 more

The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by VOT initiative. Results of 81 trackers are presented; many state-of-the-art published at major computer vision conferences or in journals recent years. evaluation included standard and other popular methodologies for short-term tracking analysis as well methodology long-term analysis. was composed five challenges focusing on different domains: (i) VOTST2019 focused RGB, (ii)...

10.1109/iccvw.2019.00276 article EN 2019-10-01

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

OPENALEX - Publications

Songyang Zhang Houwen Peng Jianlong Fu Jiebo Luo

We address the problem of retrieving a specific moment from an untrimmed video by query sentence. This is challenging because target may take place in relations to other temporal moments video. Existing methods cannot tackle this challenge well since they consider individually and neglect dependencies. In paper, we model between two-dimensional map, where one dimension indicates starting time end time. 2D map can cover diverse with different lengths, while representing their adjacent...

10.1609/aaai.v34i07.6984 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

OPENALEX - Publications

Zhicheng Huang Zhaoyang Zeng Bei Liu Dongmei Fu Jianlong Fu

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. aim build more accurate thorough connection between semantics directly from sentence pairs instead of using region-based features as the most recent vision tasks. Our which aligns semantic pixel level solves limitation task-specific representation for It also relieves cost bounding box annotations overcomes unbalance labels...

10.48550/arxiv.2004.00849 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Rethinking and Improving Relative Position Encoding for Vision Transformer

OPENALEX - Publications

Kan Wu Houwen Peng Minghao Chen Jianlong Fu Hongyang Chao

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, computer vision, its not well studied and even remains controversial, e.g., whether relative can work equally as absolute position? In order clarify this, we first review existing methods analyze their pros cons when applied vision transformers. We then propose new dedicated 2D images, called image RPE (iRPE)....

10.1109/iccv48922.2021.00988 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Multi-level Attention Networks for Visual Question Answering

OPENALEX - Publications

Dongfei Yu Jianlong Fu Tao Mei Yong Rui

Inspired by the recent success of text-based question answering, visual answering (VQA) is proposed to automatically answer natural language questions with reference a given image. Compared QA, VQA more challenging because reasoning process on domain needs both effective semantic embedding and fine-grained understanding. Existing approaches predominantly infer answers from abstract low-level features, while neglecting modeling high-level image semantics rich spatial context regions. To solve...

10.1109/cvpr.2017.446 article EN 2017-07-01

AutoFormer: Searching Transformers for Visual Recognition

OPENALEX - Publications

Minghao Chen Houwen Peng Jianlong Fu Haibin Ling

Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that depth, embedding dimension, number heads can largely affect performance transformers. Previous configure these dimensions based upon manual crafting. In this work, we propose a new one-shot architecture search framework, namely AutoFormer, dedicated to search. AutoFormer entangles...

10.1109/iccv48922.2021.01205 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

OPENALEX - Publications

Zhicheng Huang Zhaoyang Zeng Yupan Huang Bei Liu Dongmei Fu and 1 more

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions image-text pairs. State-of-the-art approaches extract salient image regions align with words step-by-step. As region-based visual features usually represent parts an image, it is challenging existing models fully understand the semantics paired natural languages. In this paper, we propose SOHO "Seeing Out tHe bOx" that...

10.1109/cvpr46437.2021.01278 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Revisiting Anchor Mechanisms for Temporal Action Localization

OPENALEX - Publications

Le Yang Houwen Peng Dingwen Zhang Jianlong Fu Junwei Han

Most of the current action localization methods follow an anchor-based pipeline: depicting instances by pre-defined anchors, learning to select anchors closest ground truth, and predicting confidence with refinements. Pre-defined set prior about location duration for instances, which facilitates common but limits flexibility tackling drastic varieties, especially extremely short or long ones. To address this problem, paper proposes a novel anchor-free module that assists temporal points....

10.1109/tip.2020.3016486 article EN IEEE Transactions on Image Processing 2020-01-01

Aggregated Contextual Transformations for High-Resolution Image Inpainting

OPENALEX - Publications

Yanhong Zeng Jianlong Fu Hongyang Chao Baining Guo

Image inpainting that completes large free-form missing regions in images is a promising yet challenging task. State-of-the-art approaches have achieved significant progress by taking advantage of generative adversarial networks (GAN). However, these can suffer from generating distorted structures and blurry textures high-resolution (e.g., 512×512). The challenges mainly drive (1) image content reasoning distant contexts, (2) fine-grained texture synthesis for region. To overcome two...

10.1109/tvcg.2022.3156949 article EN IEEE Transactions on Visualization and Computer Graphics 2022-03-07

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

OPENALEX - Publications

Bin Yan Houwen Peng Kan Wu Dong Wang Jianlong Fu and 1 more

Object tracking has achieved significant progress over the past few years. However, state-of-the-art trackers become increasingly heavy and expensive, which limits their deployments in resource-constrained applications. In this work, we present LightTrack, uses neural architecture search (NAS) to design more lightweight efficient object trackers. Comprehensive experiments show that our LightTrack is effective. It can find achieve superior performance compared handcrafted SOTA trackers, such...

10.1109/cvpr46437.2021.01493 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

WSOD2: Learning Bottom-Up and Top-Down Objectness Distillation for Weakly-Supervised Object Detection

OPENALEX - Publications

Zhaoyang Zeng Bei Liu Jianlong Fu Hongyang Chao Lei Zhang

We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations. Predominant works integrate region proposal mechanisms with convolutional neural networks (CNN). Although CNN is proficient extracting discriminative local features, grand challenges still exist to measure the likelihood of bounding box containing complete (i.e., "objectness"). In this paper, we propose novel WSOD framework Objectness Distillation...

10.1109/iccv.2019.00838 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting

OPENALEX - Publications

Chenfeng Xu Kai Qiu Jianlong Fu Song Bai Yongchao Xu and 1 more

Dense crowd counting aims to predict thousands of human instances from an image, by calculating integrals a density map over image pixels. Existing approaches mainly suffer the extreme variations. Such pattern shift poses challenges even for multi-scale model ensembling. In this paper, we propose simple yet effective approach tackle problem. First, patch-level is extracted estimation and further grouped into several levels which are determined full datasets. Second, each patch automatically...

10.1109/iccv.2019.00847 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Learning Trajectory-Aware Transformer for Video Super-Resolution

OPENALEX - Publications

Chengxu Liu Huan Yang Jianlong Fu Xueming Qian

Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate limited adjacent (e.g., 5 or 7 frames), which prevents these satisfactory results. In this paper, we take one step further enable effective spatio-temporal learning videos. We propose...

10.1109/cvpr52688.2022.00560 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

MiniViT: Compressing Vision Transformers with Weight Multiplexing

OPENALEX - Publications

Jinnian Zhang Houwen Peng Kan Wu Mengchen Liu Bin Xiao and 2 more

Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability. However, ViT suffer from huge number of parameters, restricting applicability on devices with limited memory. To alleviate this problem, we propose MiniViT, a new compression framework, which achieves parameter reduction transformers while retaining the same performance. The central idea MiniViT is multiplex weights consecutive transformer blocks. More specifically, make...

10.1109/cvpr52688.2022.01183 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Coming Soon ...