- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Hand Gesture Recognition Systems
- Advanced Memory and Neural Computing
- Video Analysis and Summarization
- Advanced Vision and Imaging
- Advanced Image and Video Retrieval Techniques
- Neuroscience and Neural Engineering
- CCD and CMOS Imaging Sensors
- Image Processing Techniques and Applications
- Anomaly Detection Techniques and Applications
- Ferroelectric and Negative Capacitance Devices
- Music and Audio Processing
- Image Enhancement Techniques
- Diabetic Foot Ulcer Assessment and Management
- Video Surveillance and Tracking Methods
- Advanced Image Processing Techniques
- Cancer-related molecular mechanisms research
- Gaze Tracking and Assistive Technology
- Natural Language Processing Techniques
- Neural Networks and Reservoir Computing
- Advanced MRI Techniques and Applications
- Medical Image Segmentation Techniques
Google (United States)
2020-2024
University of Pennsylvania
2018-2019
California University of Pennsylvania
2019
In this work, we propose a novel framework for unsupervised learning event cameras that learns motion information from only the stream. particular, an input representation of events in form discretized volume maintains temporal distribution events, which pass through neural network to predict events. This is used attempt remove any blur image. We then loss function applied compensated image measures train two networks with framework, one optical flow, and egomotion depths, evaluate these on...
We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text (VATT) takes raw signals as inputs and extracts that are rich enough to benefit variety of downstream tasks. train VATT end-to-end scratch contrastive losses evaluate its performance by the tasks video action recognition, audio event classification, image text-to-video retrieval. Furthermore, we study modality-agnostic,...
We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming for online inference. 3D convolutional neural (CNNs) are accurate at recognition but require large budgets do not support inference, making them difficult to work mobile devices. propose three-step approach improve computational efficiency while substantially reducing the peak usage CNNs. First, we design network search space employ architecture generate...
Event-based cameras have shown great promise in a variety of situations where frame based suffer, such as high speed motions and dynamic range scenes. However, developing algorithms for event measurements requires new class hand crafted algorithms. Deep learning has success providing model free solutions to many problems the vision community, but existing networks been developed with images mind, there does not exist wealth labeled data events supervised training. To these points, we present...
In this letter, we address the problem of providing human-assisted quadrotor navigation using a set eye tracking glasses. The advent these devices (i.e., glasses, virtual reality tools, etc.) provides opportunity to create new, noninvasive forms interaction between humans and robots. We show how glasses equipped with gaze tracker, camera, an inertial measurement unit (IMU) can be used estimate relative position human respect quadrotor, decouple direction from head orientation, which allows...
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide state-of-the-art and easy-to-use codebase general dense pixel prediction problems in computer vision. includes all our recently developed DeepLab model variants with pretrained checkpoints as well training evaluation code, allowing the community reproduce further improve upon state-of-art systems. To showcase effectiveness of DeepLab2, Panoptic-DeepLab employing Axial-SWideRNet network backbone achieves 68.0% PQ or 83.5%...
We introduce VideoPrism, a general-purpose video encoder that tackles diverse understanding tasks with single frozen model. pretrain VideoPrism on heterogeneous corpus containing 36M high-quality video-caption pairs and 582M clips noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic embeddings token shuffling scheme, enabling to focus primarily the modality while leveraging invaluable associated...
Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated per-pixel classification (discrete outputs) or regression (continuous outputs). This paradigm has remained popular due to the prevalence of fully convolutional networks. However, on recent frontier segmentation task, community been witnessing a shift from cluster-prediction with emergence transformer architectures, particularly mask transformers, which directly...
We propose a light-weight video frame interpolation algorithm. Our key innovation is an instance-level supervision that allows information to be learned from the high-resolution version of similar objects. experiment shows proposed method can generate state-of-the-art results across different datasets, with fractional computation resources (time and memory) competing methods. Given two image frames, cascade network creates intermediate 1) flow-warping module computes coarse bi-directional...
We introduce a novel representation learning method to disentangle pose-dependent as well view-dependent factors from 2D human poses. The trains network using cross-view mutual information maximization (CV-MIM) which maximizes of the same pose performed different viewpoints in contrastive manner. further propose two regularization terms ensure disentanglement and smoothness learned representations. resulting representations can be used for action recognition.To evaluate power...
Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on holistic image and video representations, such an objective becomes suboptimal for spatio-temporally fine-grained features in videos, where scenes instances evolve through space time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn via self-supervision. We first design a region-based...
The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a \textit{perturbed loss} defined as the maximum loss within neighborhood in parameter space. However, we show that both sharp and flat minima can have low perturbed loss, implying SAM does not always prefer minima. Instead, define \textit{surrogate gap}, measure equivalent to dominant eigenvalue of Hessian at local minimum when radius (to derive loss) is small. surrogate gap easy compute feasible...
We propose a demo of our work, Unsupervised Event-based Learning Optical Flow, Depth and Egomotion, which will also appear at CVPR 2019. Our consists CNN takes as input events from DAVIS-346b event camera, represented discretized volume, predicts optical flow for each pixel in the image. Due to generalization abilities network, we are able predict accurate very wide range scenes, including fast motions challenging lighting.
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in detection when second-order semantics are introduced between pairs objects. To address this challenge, we propose UniVRD, novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs)....
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim replicate this success for video-language models, but there simply not enough human-curated video-text data available. thus resort fine-tuning a model from strong image-language baseline with synthesized instructional resulting then used auto-label millions videos generate high-quality captions. show adapted performs well on wide range benchmarks. For instance, it surpasses best...
We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive representations while cross-view semantic information exhibits variations. propose Semantics-based Unpaired Multiview Learning (SUM-L) tackle learning problem. The key idea is build pseudopairs and do view-invariant alignment by leveraging of videos. To facilitate data efficiency learning, we further perform video-text for first-person third-person videos,...
We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, four adaptation methods tailoring model (FM) for downstream task. Moreover, we propose scalar VideoGLUE score (VGS) to measure an FMs efficacy efficiency when adapting general tasks. Our main findings are as follows....
We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming for online inference. 3D convolutional neural (CNNs) are accurate at recognition but require large budgets do not support inference, making them difficult to work mobile devices. propose three-step approach improve computational efficiency while substantially reducing the peak usage CNNs. First, we design network search space employ architecture generate...
This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in video representations. In TeG, we sample long clip from and short that lies inside the clip. We then extract their dense temporal embeddings. The training objective consists of two parts: fine-grained maximize similarity between corresponding embeddings clip, persistent pull together global clips. Our study reveals impact granularity with three major findings. 1) Different tasks may require...
We explore a novel video creation experience, namely Video Creation by Demonstration. Given demonstration and context image from different scene, we generate physically plausible that continues naturally the carries out action concepts demonstration. To enable this capability, present $\delta$-Diffusion, self-supervised training approach learns unlabeled videos conditional future frame prediction. Unlike most existing generation controls are based on explicit signals, adopts form of implicit...
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in detection when second-order semantics are introduced between pairs objects. To address this challenge, we propose UniVRD, novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs)....