NFDI4DS | UHH-SEMS - Publication Details

Unsupervised Event-Based Learning of Optical Flow, Depth, and Egomotion

OPENALEX - Publications

Alex Zihao Zhu Liangzhe Yuan Kenneth Chaney Kostas Daniilidis

In this work, we propose a novel framework for unsupervised learning event cameras that learns motion information from only the stream. particular, an input representation of events in form discretized volume maintains temporal distribution events, which pass through neural network to predict events. This is used attempt remove any blur image. We then loss function applied compensated image measures train two networks with framework, one optical flow, and egomotion depths, evaluate these on...

10.1109/cvpr.2019.00108 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

OPENALEX - Publications

Hassan Akbari Liangzhe Yuan Rui Qian Wei-Hong Chuang Shih‐Fu Chang and 2 more

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text (VATT) takes raw signals as inputs and extracts that are rich enough to benefit variety of downstream tasks. train VATT end-to-end scratch contrastive losses evaluate its performance by the tasks video action recognition, audio event classification, image text-to-video retrieval. Furthermore, we study modality-agnostic,...

10.48550/arxiv.2104.11178 preprint EN other-oa arXiv (Cornell University) 2021-01-01

MoViNets: Mobile Video Networks for Efficient Video Recognition

OPENALEX - Publications

Dan Kondratyuk Liangzhe Yuan Yandong Li Zhang Li Mingxing Tan and 2 more

We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming for online inference. 3D convolutional neural (CNNs) are accurate at recognition but require large budgets do not support inference, making them difficult to work mobile devices. propose three-step approach improve computational efficiency while substantially reducing the peak usage CNNs. First, we design network search space employ architecture generate...

10.1109/cvpr46437.2021.01576 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras

OPENALEX - Publications

Alex Zihao Zhu Liangzhe Yuan Kenneth Chaney Kostas Daniilidis

Event-based cameras have shown great promise in a variety of situations where frame based suffer, such as high speed motions and dynamic range scenes. However, developing algorithms for event measurements requires new class hand crafted algorithms. Deep learning has success providing model free solutions to many problems the vision community, but existing networks been developed with images mind, there does not exist wealth labeled data events supervised training. To these points, we present...

10.15607/rss.2018.xiv.062 preprint EN 2018-06-26

Human Gaze-Driven Spatial Tasking of an Autonomous MAV

OPENALEX - Publications

Liangzhe Yuan Christopher Reardon Garrett Warnell Giuseppe Loianno

In this letter, we address the problem of providing human-assisted quadrotor navigation using a set eye tracking glasses. The advent these devices (i.e., glasses, virtual reality tools, etc.) provides opportunity to create new, noninvasive forms interaction between humans and robots. We show how glasses equipped with gaze tracker, camera, an inertial measurement unit (IMU) can be used estimate relative position human respect quadrotor, decouple direction from head orientation, which allows...

10.1109/lra.2019.2895419 article EN IEEE Robotics and Automation Letters 2019-01-25

DeepLab2: A TensorFlow Library for Deep Labeling

OPENALEX - Publications

Mark Weber Huiyu Wang Siyuan Qiao Jun Xie Maxwell D. Collins and 10 more

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide state-of-the-art and easy-to-use codebase general dense pixel prediction problems in computer vision. includes all our recently developed DeepLab model variants with pretrained checkpoints as well training evaluation code, allowing the community reproduce further improve upon state-of-art systems. To showcase effectiveness of DeepLab2, Panoptic-DeepLab employing Axial-SWideRNet network backbone achieves 68.0% PQ or 83.5%...

10.48550/arxiv.2106.09748 preprint EN other-oa arXiv (Cornell University) 2021-01-01

VideoPrism: A Foundational Visual Encoder for Video Understanding

OPENALEX - Publications

L. Zhao Nitesh B. Gundavarapu Liangzhe Yuan Hao Zhou Shen Yan and 14 more

We introduce VideoPrism, a general-purpose video encoder that tackles diverse understanding tasks with single frozen model. pretrain VideoPrism on heterogeneous corpus containing 36M high-quality video-caption pairs and 582M clips noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic embeddings token shuffling scheme, enabling to focus primarily the modality while leveraging invaluable associated...

10.48550/arxiv.2402.13217 preprint EN arXiv (Cornell University) 2024-02-20

PolyMaX: General Dense Prediction with Mask Transformer

OPENALEX - Publications

Xuan Yang Liangzhe Yuan Michael J. Wilber Astuti Sharma Xiuye Gu and 6 more

Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated per-pixel classification (discrete outputs) or regression (continuous outputs). This paradigm has remained popular due to the prevalence of fully convolutional networks. However, on recent frontier segmentation task, community been witnessing a shift from cluster-prediction with emergence transformer architectures, particularly mask transformers, which directly...

10.1109/wacv57701.2024.00109 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03

Distilling Vision-Language Models on Millions of Videos

OPENALEX - Publications

Yue Zhao L. Zhao Xingyi Zhou Jialin Wu Chun-Te Chu and 7 more

10.1109/cvpr52733.2024.01245 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Zoom-In-To-Check: Boosting Video Interpolation via Instance-Level Discrimination

OPENALEX - Publications

Liangzhe Yuan Yibo Chen Hantian Liu Tao Kong Jianbo Shi

We propose a light-weight video frame interpolation algorithm. Our key innovation is an instance-level supervision that allows information to be learned from the high-resolution version of similar objects. experiment shows proposed method can generate state-of-the-art results across different datasets, with fractional computation resources (time and memory) competing methods. Given two image frames, cascade network creates intermediate 1) flow-warping module computes coarse bi-directional...

10.1109/cvpr.2019.01246 preprint EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization

OPENALEX - Publications

L. Zhao Yuxiao Wang Jiaping Zhao Liangzhe Yuan Jennifer J. Sun and 5 more

We introduce a novel representation learning method to disentangle pose-dependent as well view-dependent factors from 2D human poses. The trains network using cross-view mutual information maximization (CV-MIM) which maximizes of the same pose performed different viewpoints in contrastive manner. further propose two regularization terms ensure disentanglement and smoothness learned representations. resulting representations can be used for action recognition.To evaluate power...

10.1109/cvpr46437.2021.01260 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

OPENALEX - Publications

Liangzhe Yuan Rui Qian Yin Cui Boqing Gong Florian Schroff and 3 more

Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on holistic image and video representations, such an objective becomes suboptimal for spatio-temporally fine-grained features in videos, where scenes instances evolve through space time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn via self-supervision. We first design a region-based...

10.1109/cvpr52688.2022.01359 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Surrogate Gap Minimization Improves Sharpness-Aware Training

OPENALEX - Publications

Juntang Zhuang Boqing Gong Liangzhe Yuan Yin Cui Hartwig Adam and 4 more

The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a \textit{perturbed loss} defined as the maximum loss within neighborhood in parameter space. However, we show that both sharp and flat minima can have low perturbed loss, implying SAM does not always prefer minima. Instead, define \textit{surrogate gap}, measure equivalent to dominant eigenvalue of Hessian at local minimum when radius (to derive loss) is small. surrogate gap easy compute feasible...

10.48550/arxiv.2203.08065 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Live Demonstration: Unsupervised Event-Based Learning of Optical Flow, Depth and Egomotion

OPENALEX - Publications

Alex Zihao Zhu Liangzhe Yuan Kenneth Chaney Kostas Daniilidis

We propose a demo of our work, Unsupervised Event-based Learning Optical Flow, Depth and Egomotion, which will also appear at CVPR 2019. Our consists CNN takes as input events from DAVIS-346b event camera, represented discretized volume, predicts optical flow for each pixel in the image. Due to generalization abilities network, we are able predict accurate very wide range scenes, including fast motions challenging lighting.

10.1109/cvprw.2019.00216 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2019-06-01

Unified Visual Relationship Detection with Vision and Language Models

OPENALEX - Publications

L. Zhao Liangzhe Yuan Boqing Gong Yin Cui Florian Schroff and 3 more

This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in detection when second-order semantics are introduced between pairs objects. To address this challenge, we propose UniVRD, novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs)....

10.1109/iccv51070.2023.00641 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

OPENALEX - Publications

Ting Liu Jennifer J. Sun L. Zhao Jiaping Zhao Liangzhe Yuan and 4 more

10.1007/s11263-021-01529-w article EN International Journal of Computer Vision 2021-11-16

Distilling Vision-Language Models on Millions of Videos

OPENALEX - Publications

Yue Zhao L. Zhao Xingyi Zhou Jialin Wu Chun-Te Chu and 7 more

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim replicate this success for video-language models, but there simply not enough human-curated video-text data available. thus resort fine-tuning a model from strong image-language baseline with synthesized instructional resulting then used auto-label millions videos generate high-quality captions. show adapted performs well on wide range benchmarks. For instance, it surpasses best...

10.48550/arxiv.2401.06129 preprint EN cc-by arXiv (Cornell University) 2024-01-01

Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition

OPENALEX - Publications

Qitong Wang L. Zhao Liangzhe Yuan Ting Liu Xi Peng

We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive representations while cross-view semantic information exhibits variations. propose Semantics-based Unpaired Multiview Learning (SUM-L) tackle learning problem. The key idea is build pseudopairs and do view-invariant alignment by leveraging of videos. To facilitate data efficiency learning, we further perform video-text for first-person third-person videos,...

10.1109/iccv51070.2023.00306 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

VideoGLUE: Video General Understanding Evaluation of Foundation Models

OPENALEX - Publications

Liangzhe Yuan Nitesh B. Gundavarapu L. Zhao Hao Zhou Yin Cui and 12 more

We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, four adaptation methods tailoring model (FM) for downstream task. Moreover, we propose scalar VideoGLUE score (VGS) to measure an FMs efficacy efficiency when adapting general tasks. Our main findings are as follows....

10.48550/arxiv.2307.03166 preprint EN cc-by arXiv (Cornell University) 2023-01-01

MoViNets: Mobile Video Networks for Efficient Video Recognition

OPENALEX - Publications

Dan Kondratyuk Liangzhe Yuan Yandong Li Zhang Li Mingxing Tan and 2 more

We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming for online inference. 3D convolutional neural (CNNs) are accurate at recognition but require large budgets do not support inference, making them difficult to work mobile devices. propose three-step approach improve computational efficiency while substantially reducing the peak usage CNNs. First, we design network search space employ architecture generate...

10.48550/arxiv.2103.11511 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

OPENALEX - Publications

Rui Qian Yeqing Li Liangzhe Yuan Boqing Gong Ting Liu and 5 more

This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in video representations. In TeG, we sample long clip from and short that lies inside the clip. We then extract their dense temporal embeddings. The training objective consists of two parts: fine-grained maximize similarity between corresponding embeddings clip, persistent pull together global clips. Our study reveals impact granularity with three major findings. 1) Different tasks may require...

10.48550/arxiv.2112.04480 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Video Creation by Demonstration

OPENALEX - Publications

Yihong Sun Hao Zhou Liangzhe Yuan Jennifer J. Sun Yandong Li and 5 more

We explore a novel video creation experience, namely Video Creation by Demonstration. Given demonstration and context image from different scene, we generate physically plausible that continues naturally the carries out action concepts demonstration. To enable this capability, present $\delta$-Diffusion, self-supervised training approach learns unlabeled videos conditional future frame prediction. Unlike most existing generation controls are based on explicit signals, adopts form of implicit...

10.48550/arxiv.2412.09551 preprint EN arXiv (Cornell University) 2024-12-12

Unified Visual Relationship Detection with Vision and Language Models

OPENALEX - Publications

L. Zhao Liangzhe Yuan Boqing Gong Yin Cui Florian Schroff and 3 more

This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in detection when second-order semantics are introduced between pairs objects. To address this challenge, we propose UniVRD, novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs)....

10.48550/arxiv.2303.08998 preprint EN other-oa arXiv (Cornell University) 2023-01-01