- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Video Surveillance and Tracking Methods
- Domain Adaptation and Few-Shot Learning
- Anomaly Detection Techniques and Applications
- Advanced Image and Video Retrieval Techniques
- Advanced Vision and Imaging
- Generative Adversarial Networks and Image Synthesis
- Video Analysis and Summarization
- Music and Audio Processing
- Gait Recognition and Analysis
- Cell Image Analysis Techniques
- Hand Gesture Recognition Systems
- Topic Modeling
- Autonomous Vehicle Technology and Safety
- CCD and CMOS Imaging Sensors
- Advanced Memory and Neural Computing
- Image Processing Techniques and Applications
- Human Motion and Animation
- Speech and Audio Processing
- Robotics and Sensor-Based Localization
- Natural Language Processing Techniques
- Neural Networks and Applications
- Image Processing and 3D Reconstruction
Meta (Israel)
2018-2022
Meta (United States)
2019-2021
Menlo School
2019-2021
Carnegie Mellon University
2021
Graz University of Technology
2013-2019
University of Oxford
2018
Austrian Academy of Sciences
2017
University of Graz
2014
The "Roaring 20s" of visual recognition began with the introduction Vision Transformers (ViTs), which quickly superseded ConvNets as state-of-the-art image classification model. A vanilla ViT, on other hand, faces difficulties when applied to general computer vision tasks such object detection and semantic segmentation. It is hierarchical (e.g., Swin Transformers) that reintroduced several ConvNet priors, making practically viable a generic backbone demonstrating remarkable performance wide...
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) Fast high motion fine temporal resolution. The pathway can be made very lightweight by reducing its channel capacity, yet learn useful information models achieve strong performance both action classification detection in video, large improvements are pin-pointed as contributions our concept. report state-of-the-art accuracy on major...
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions incorporating the appearance and motion information. We study a number ways fusing ConvNet towers both spatially temporally order to best take advantage this spatio-temporal make following findings: (i) that rather than at softmax layer, spatial temporal network can be fused convolution layer without loss performance, but with substantial saving parameters,...
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, simple and effective semi-supervised training method leverages unlabeled data. start predicted keypoints for video, then estimate finally back-project to the input supervised setting, our fully-convolutional outperforms previous best result from literature by 6 mm mean per-joint position...
This paper presents X3D, a family of efficient video networks that progressively expand tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods machine learning, simple stepwise expansion approach is employed expands single axis each step, such good accuracy to complexity trade-off achieved. To X3D specific target complexity, we perform progressive forward followed backward contraction. achieves...
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. have several channel-resolution scale stages. Starting from input resolution a small channel dimension, stages hierarchically expand capacity while reducing spatial resolution. This creates pyramid features early layers operating at high to model simple low-level visual information, deeper spatially coarse, but complex,...
This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features. Our model combines the appearance and motion pathways two-stream by gating is trained end-to-end. We theoretically motivate functions residual networks empirically study their effect classification accuracy. To capture long-term dependencies we inject identity mapping kernels learning temporal relationships. fully convolutional in able to evaluate single...
The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this as a frame-to-frame set prediction problem introduce TrackFormer, an end-to-end trainable MOT approach based on encoder-decoder Transformer architecture. Our model achieves data association between frames via attention by evolving predictions through video sequence. decoder initializes new tracks from static object...
Recent approaches for high accuracy detection and tracking of object categories in video consist complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture jointly performs tracking, solving the task simple effective way. Our contributions are threefold: (i) set up simultaneous using multi-task objective frame-based across-frame track regression; (ii) introduce correlation features represent co-occurrences across time to aid during...
Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual (ResNets) arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets combination of these two approaches. Our novel architecture generalizes the domain by introducing residual connections ways. First, inject between appearance and motion pathways two-stream allow interaction streams. Second, transform...
To understand the world, we humans constantly need to relate present past, and put events in context. In this paper, enable existing video models do same. We propose a long-term feature bank—supportive information extracted over entire span of video—to augment state-of-the-art that otherwise would only view short clips 2-5 seconds. Our experiments demonstrate augmenting 3D convolutional networks with bank yields results on three challenging datasets: AVA, EPIC-Kitchens, Charades. Code is...
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, well object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings residual pooling connections. instantiate in five sizes evaluate it ImageNet COCO detection Kinetics recognition where outperforms prior work. further compare MViTv2s' attention to window mechanisms the latter accuracy/compute. Without...
We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion the input sequence and then predicts feature masked regions. study five different types features find Histograms Oriented Gradients (HOG), hand-crafted descriptor, works particularly well in terms both performance efficiency. observe that local contrast normalization HOG is essential good results, which line with earlier work using visual...
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards, with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With unified perspective four recent image-based frameworks, we simple objective that can easily generalize all these methods to space-time. Our encourages temporally-persistent features in the same video, and spite of its simplicity, it works surprisingly well across: (i) different (ii) pre-training datasets, (iii) downstream (iv) backbone architectures. draw series intriguing observations...
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder reconstruct them pixels. Interestingly, we show that our MAE method can strong representations with almost no inductive bias on (only except for patch positional embeddings), spacetime-agnostic random masking performs the best. observe optimal ratio is as high 90% (vs. 75% images),...
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply with a Faster Audio pathway to model vision sound in unified representation. fuse audio features at multiple layers, enabling contribute the formation of hierarchical concepts. To overcome training difficulties arise from different learning dynamics modalities, we introduce DropPathway, which randomly drops during as effective...
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP [52]. Our randomly masks out removes large portion of image patches during training. Masking allows us to learn from image-text pairs given the same wall-clock time contrast samples per iteration with similar memory footprint. It leads favorable trade-off between accuracy time. In our experiments on 400 million pairs, FLIP improves both speed over no-masking baseline. On diversity...
The "Roaring 20s" of visual recognition began with the introduction Vision Transformers (ViTs), which quickly superseded ConvNets as state-of-the-art image classification model. A vanilla ViT, on other hand, faces difficulties when applied to general computer vision tasks such object detection and semantic segmentation. It is hierarchical (e.g., Swin Transformers) that reintroduced several ConvNet priors, making practically viable a generic backbone demonstrating remarkable performance wide...
While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing architectures can only process <5 seconds without hitting computation memory bottlenecks. In this paper, we propose new strategy to overcome challenge. Instead trying more frames at once like most methods, videos in an online fashion cache "memory" each iteration. Through memory, model reference prior context for long-term...
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes spectrogram patches with high masking ratio, feeding only non-masked tokens through encoder layers. The decoder then re-orders and decodes encoded context padded mask tokens, order reconstruct input spectrogram. We find it beneficial incorporate local window...