NFDI4DS | UHH-SEMS - Publication Details

A ConvNet for the 2020s

OPENALEX - Publications

Zhuang Liu Hanzi Mao Chao-Yuan Wu Christoph Feichtenhofer Trevor Darrell and 1 more

The "Roaring 20s" of visual recognition began with the introduction Vision Transformers (ViTs), which quickly superseded ConvNets as state-of-the-art image classification model. A vanilla ViT, on other hand, faces difficulties when applied to general computer vision tasks such object detection and semantic segmentation. It is hierarchical (e.g., Swin Transformers) that reintroduced several ConvNet priors, making practically viable a generic backbone demonstrating remarkable performance wide...

10.1109/cvpr52688.2022.01167 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

SlowFast Networks for Video Recognition

OPENALEX - Publications

Christoph Feichtenhofer Haoqi Fan Jitendra Malik Kaiming He

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) Fast high motion fine temporal resolution. The pathway can be made very lightweight by reducing its channel capacity, yet learn useful information models achieve strong performance both action classification detection in video, large improvements are pin-pointed as contributions our concept. report state-of-the-art accuracy on major...

10.1109/iccv.2019.00630 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Convolutional Two-Stream Network Fusion for Video Action Recognition

OPENALEX - Publications

Christoph Feichtenhofer Axel Pinz Andrew Zisserman

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions incorporating the appearance and motion information. We study a number ways fusing ConvNet towers both spatially temporally order to best take advantage this spatio-temporal make following findings: (i) that rather than at softmax layer, spatial temporal network can be fused convolution layer without loss performance, but with substantial saving parameters,...

10.1109/cvpr.2016.213 article EN 2016-06-01

3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training

OPENALEX - Publications

Dario Pavllo Christoph Feichtenhofer David Grangier Michael Auli

In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, simple and effective semi-supervised training method leverages unlabeled data. start predicted keypoints for video, then estimate finally back-project to the input supervised setting, our fully-convolutional outperforms previous best result from literature by 6 mm mean per-joint position...

10.1109/cvpr.2019.00794 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

X3D: Expanding Architectures for Efficient Video Recognition

OPENALEX - Publications

Christoph Feichtenhofer

This paper presents X3D, a family of efficient video networks that progressively expand tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods machine learning, simple stepwise expansion approach is employed expands single axis each step, such good accuracy to complexity trade-off achieved. To X3D specific target complexity, we perform progressive forward followed backward contraction. achieves...

10.1109/cvpr42600.2020.00028 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Multiscale Vision Transformers

OPENALEX - Publications

Haoqi Fan Bo Xiong Karttikeya Mangalam Yanghao Li Zhicheng Yan and 2 more

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. have several channel-resolution scale stages. Starting from input resolution a small channel dimension, stages hierarchically expand capacity while reducing spatial resolution. This creates pyramid features early layers operating at high to model simple low-level visual information, deeper spatially coarse, but complex,...

10.1109/iccv48922.2021.00675 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Spatiotemporal Multiplier Networks for Video Action Recognition

OPENALEX - Publications

Christoph Feichtenhofer Axel Pinz Richard P. Wildes

This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features. Our model combines the appearance and motion pathways two-stream by gating is trained end-to-end. We theoretically motivate functions residual networks empirically study their effect classification accuracy. To capture long-term dependencies we inject identity mapping kernels learning temporal relationships. fully convolutional in able to evaluate single...

10.1109/cvpr.2017.787 article EN 2017-07-01

TrackFormer: Multi-Object Tracking with Transformers

OPENALEX - Publications

Tim Meinhardt Alexander Kirillov Laura Leal-Taixé Christoph Feichtenhofer

The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this as a frame-to-frame set prediction problem introduce TrackFormer, an end-to-end trainable MOT approach based on encoder-decoder Transformer architecture. Our model achieves data association between frames via attention by evolving predictions through video sequence. decoder initializes new tracks from static object...

10.1109/cvpr52688.2022.00864 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Detect to Track and Track to Detect

OPENALEX - Publications

Christoph Feichtenhofer Axel Pinz Andrew Zisserman

Recent approaches for high accuracy detection and tracking of object categories in video consist complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture jointly performs tracking, solving the task simple effective way. Our contributions are threefold: (i) set up simultaneous using multi-task objective frame-based across-frame track regression; (ii) introduce correlation features represent co-occurrences across time to aid during...

10.1109/iccv.2017.330 article EN 2017-10-01

Spatiotemporal Residual Networks for Video Action Recognition

OPENALEX - Publications

Christoph Feichtenhofer Axel Pinz Richard P. Wildes

Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual (ResNets) arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets combination of these two approaches. Our novel architecture generalizes the domain by introducing residual connections ways. First, inject between appearance and motion pathways two-stream allow interaction streams. Second, transform...

10.48550/arxiv.1611.02155 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Long-Term Feature Banks for Detailed Video Understanding

OPENALEX - Publications

Chao-Yuan Wu Christoph Feichtenhofer Haoqi Fan Kaiming He Philipp Krähenbühl and 1 more

To understand the world, we humans constantly need to relate present past, and put events in context. In this paper, enable existing video models do same. We propose a long-term feature bank—supportive information extracted over entire span of video—to augment state-of-the-art that otherwise would only view short clips 2-5 seconds. Our experiments demonstrate augmenting 3D convolutional networks with bank yields results on three challenging datasets: AVA, EPIC-Kitchens, Charades. Code is...

10.1109/cvpr.2019.00037 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

OPENALEX - Publications

Yanghao Li Chao-Yuan Wu Haoqi Fan Karttikeya Mangalam Bo Xiong and 2 more

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, well object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings residual pooling connections. instantiate in five sizes evaluate it ImageNet COCO detection Kinetics recognition where outperforms prior work. further compare MViTv2s' attention to window mechanisms the latter accuracy/compute. Without...

10.1109/cvpr52688.2022.00476 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Masked Feature Prediction for Self-Supervised Visual Pre-Training

OPENALEX - Publications

Chen Wei Haoqi Fan Saining Xie Chao-Yuan Wu Alan Yuille and 1 more

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion the input sequence and then predicts feature masked regions. study five different types features find Histograms Oriented Gradients (HOG), hand-crafted descriptor, works particularly well in terms both performance efficiency. observe that local contrast normalization HOG is essential good results, which line with earlier work using visual...

10.1109/cvpr52688.2022.01426 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Ego4D: Around the World in 3,000 Hours of Egocentric Video

OPENALEX - Publications

Kristen Grauman Andrew Westbury Eugene H. Byrne Zachary Chavis Antonino Furnari and 80 more

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards, with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...

10.1109/cvpr52688.2022.01842 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

OPENALEX - Publications

Hu Xu Gargi Ghosh Po-Yao Huang Dmytro Okhonko Armen Aghajanyan and 3 more

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

10.18653/v1/2021.emnlp-main.544 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

OPENALEX - Publications

Christoph Feichtenhofer Haoqi Fan Bo Xiong Ross Girshick Kaiming He

We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With unified perspective four recent image-based frameworks, we simple objective that can easily generalize all these methods to space-time. Our encourages temporally-persistent features in the same video, and spite of its simplicity, it works surprisingly well across: (i) different (ii) pre-training datasets, (iii) downstream (iv) backbone architectures. draw series intriguing observations...

10.1109/cvpr46437.2021.00331 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Masked Autoencoders As Spatiotemporal Learners

OPENALEX - Publications

Christoph Feichtenhofer Haoqi Fan Yanghao Li Kaiming He

This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder reconstruct them pixels. Interestingly, we show that our MAE method can strong representations with almost no inductive bias on (only except for patch positional embeddings), spacetime-agnostic random masking performs the best. observe optimal ratio is as high 90% (vs. 75% images),...

10.48550/arxiv.2205.09113 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Modeling Human Motion with Quaternion-Based Neural Networks

OPENALEX - Publications

Dario Pavllo Christoph Feichtenhofer Michael Auli David Grangier

10.1007/s11263-019-01245-6 article EN International Journal of Computer Vision 2019-10-08

Audiovisual SlowFast Networks for Video Recognition

OPENALEX - Publications

Fanyi Xiao Yong Jae Lee Kristen Grauman Jitendra Malik Christoph Feichtenhofer

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply with a Faster Audio pathway to model vision sound in unified representation. fuse audio features at multiple layers, enabling contribute the formation of hierarchical concepts. To overcome training difficulties arise from different learning dynamics modalities, we introduce DropPathway, which randomly drops during as effective...

10.48550/arxiv.2001.08740 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Scaling Language-Image Pre-Training via Masking

OPENALEX - Publications

Yanghao Li Haoqi Fan Ronghang Hu Christoph Feichtenhofer Kaiming He

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP [52]. Our randomly masks out removes large portion of image patches during training. Masking allows us to learn from image-text pairs given the same wall-clock time contrast samples per iteration with similar memory footprint. It leads favorable trade-off between accuracy time. In our experiments on 400 million pairs, FLIP improves both speed over no-masking baseline. On diversity...

10.1109/cvpr52729.2023.02240 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

A ConvNet for the 2020s

OPENALEX - Publications

Zhuang Liu Hanzi Mao Chao-Yuan Wu Christoph Feichtenhofer Trevor Darrell and 1 more

The "Roaring 20s" of visual recognition began with the introduction Vision Transformers (ViTs), which quickly superseded ConvNets as state-of-the-art image classification model. A vanilla ViT, on other hand, faces difficulties when applied to general computer vision tasks such object detection and semantic segmentation. It is hierarchical (e.g., Swin Transformers) that reintroduced several ConvNet priors, making practically viable a generic backbone demonstrating remarkable performance wide...

10.48550/arxiv.2201.03545 preprint EN cc-by arXiv (Cornell University) 2022-01-01

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

OPENALEX - Publications

Chao-Yuan Wu Yanghao Li Karttikeya Mangalam Haoqi Fan Bo Xiong and 2 more

While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing architectures can only process <5 seconds without hitting computation memory bottlenecks. In this paper, we propose new strategy to overcome challenge. Instead trying more frames at once like most methods, videos in an online fashion cache "memory" each iteration. Through memory, model reference prior context for long-term...

10.1109/cvpr52688.2022.01322 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Masked Autoencoders that Listen

OPENALEX - Publications

Po-Yao Huang Xu Hu Juncheng Li Alexei Baevski Michael Auli and 3 more

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes spectrogram patches with high masking ratio, feeding only non-masked tokens through encoder layers. The decoder then re-orders and decodes encoded context padded mask tokens, order reconstruct input spectrogram. We find it beneficial incorporate local window...

10.48550/arxiv.2207.06405 preprint EN cc-by arXiv (Cornell University) 2022-01-01