- Video Surveillance and Tracking Methods
- Advanced Neural Network Applications
- Anomaly Detection Techniques and Applications
- Human Pose and Action Recognition
- Domain Adaptation and Few-Shot Learning
- Autonomous Vehicle Technology and Safety
- Advanced Image and Video Retrieval Techniques
- Generative Adversarial Networks and Image Synthesis
- Advanced Vision and Imaging
- Multimodal Machine Learning Applications
- Topic Modeling
- Visual Attention and Saliency Detection
- Computer Graphics and Visualization Techniques
- Explainable Artificial Intelligence (XAI)
- Face recognition and analysis
- COVID-19 diagnosis using AI
- Human-Animal Interaction Studies
- Natural Language Processing Techniques
- Video Analysis and Summarization
- Machine Learning and Data Classification
- Remote Sensing and LiDAR Applications
- Air Quality Monitoring and Forecasting
- 3D Shape Modeling and Analysis
- Healthcare Technology and Patient Monitoring
- Neural Networks and Applications
Toyota Research Institute
2023-2024
Toyota Industries (United States)
2024
Amazon (United States)
2022-2023
Amazon (Germany)
2023
Carnegie Mellon University
2017-2022
Seattle University
2022
Indian Institute of Technology Guwahati
2021
University of California, Berkeley
2014
We study how robust current ImageNet models are to distribution shifts arising from natural variations in datasets. Most research on robustness focuses synthetic image perturbations (noise, simulated weather artifacts, adversarial examples, etc.), which leaves open shift relates real data. Informed by an evaluation of 204 213 different test conditions, we find that there is often little no transfer shift. Moreover, most techniques provide the our testbed. The main exception training larger...
Detecting and segmenting individual objects, regardless of their category, is crucial for many applications such as action detection or robotic interaction. While this problem has been well-studied under the classic formulation spatio-temporal grouping, state-of-the-art approaches do not make use learning-based methods. To bridge gap, we propose a simple approach grouping. Our leverages motion cues from optical flow bottom-up signal separating objects each other. Motion are then combined...
Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) Multi-Object Tracking (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets metrics (e.g. $\mathcal{J}\& {\mathcal{F}}$, mAP, sMOTSA). As a result, published works usually target particular benchmark, are not easily comparable each another. We believe that development generalized methods can tackle multiple tasks requires greater...
Tracking and detecting any object, including ones never-seen-before during model training, is a crucial but elusive capability of autonomous systems. An agent that blind to objects poses safety hazard when operating in the real world - yet this how almost all current systems work. One main obstacles towards advancing tracking object task notoriously difficult evaluate. A benchmark would allow us perform an apples-to-apples comparison existing efforts first step important research field. This...
While deep feature learning has revolutionized techniques for static-image understanding, the same does not quite hold video processing. Architectures and optimization used are largely based off those static images, potentially underutilizing rich information. In this work, we rethink both underlying network architecture stochastic paradigm temporal data. To do so, draw inspiration from classic theory on linear dynamic systems modeling time series. By extending such models to include...
Monocular object detection and tracking have improved drastically in recent years, but rely on a key assumption: that objects are visible to the camera. Many offline approaches reason about occluded post-hoc, by linking together tracklets after re-appears, making use of reidentification (ReID). However, online embodied robotic agents (such as self-driving vehicle) fundamentally requires permanence, which is ability before they re-appear. In this work, we re-purpose benchmarks propose new...
Emerging head-worn computing devices can enable interactions with smart objects in physical spaces. We present the iterative design and evaluation of HOBS -- a Head-Orientation Based Selection technique for interacting these at distance. augment commercial wearable device, Google Glass, an infrared (IR) emitter to select targets equipped IR receivers. Our first shows that naive implementation outperform list selection, but has poor performance when refinement between multiple is needed. A...
By design, average precision (AP) for object detection aims to treat all classes independently: AP is computed independently per category and averaged. On one hand, this desirable as it treats equally. the other ignores cross-category confidence calibration, a key property in real-world use cases. Unfortunately, under important conditions (i.e., large vocabulary, high instance counts) default implementation of neither independent, nor does directly reward properly calibrated detectors. In...
We introduce DataComp for Language Models (DCLM), a testbed controlled dataset experiments with the goal of improving language models. As part DCLM, we provide standardized corpus 240T tokens extracted from Common Crawl, effective pretraining recipes based on OpenLM framework, and broad suite 53 downstream evaluations. Participants in DCLM benchmark can experiment data curation strategies such as deduplication, filtering, mixing at model scales ranging 412M to 7B parameters. baseline conduct...
Drug-related errors are a leading cause of preventable patient harm in the clinical setting. We present first wearable camera system to automatically detect potential errors, prior medication delivery. demonstrate that using deep learning algorithms, our can and classify drug labels on syringes vials preparation events recorded real-world operating rooms. created first-of-its-kind large-scale video dataset from head-mounted cameras comprising 4K footage across 13 anesthesiology providers, 2...
Vision models notoriously flicker when applied to videos: they correctly recognize objects in some frames, but fail on perceptually similar, nearby frames. In this work, we systematically analyze the robustness of image classifiers such temporal perturbations videos. To do so, construct two new datasets, ImageNet-Vid-Robust and YTBB-Robust, containing a total 57,897 images grouped into 3,139 sets similar images. Our datasets were derived from ImageNet-Vid Youtube-BB, respectively, thoroughly...
Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these differ from previous training approaches in several ways, an important question is what causes the large gains. We answer this via a systematic experimental investigation. Concretely, we study five different possible for gains: (i) set size, (ii) distribution, (iii) language supervision at time, (iv) test (v)...
This paper studies the problem of concept-based interpretability transformer representations for videos. Concretely, we seek to explain decision-making process video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research has concentrated solely image-level tasks. Comparatively, models deal with added temporal dimension, increasing complexity and posing challenges in identifying dynamic over time. In this work, systematically address these...
Scaling laws are useful guides for developing language models, but there still gaps between current scaling studies and how models ultimately trained evaluated. For instance, is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime); however, practice, often over-trained to reduce inference costs. Moreover, mostly predict loss on next-token prediction, compared based downstream task performance. In this paper, we address both shortcomings. To do so, create...
Recent work leverages the expressive power of generative adversarial networks (GANs) to generate labeled synthetic datasets. These dataset generation methods often require new annotations images, which forces practitioners seek out annotators, curate a set and ensure quality generated labels. We introduce HandsOff framework, technique capable producing an unlimited number images corresponding labels after being trained on less than 50 preexisting images. Our framework avoids practical...
Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends applications like autonomous driving, where clear understanding of heavily occluded objects essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due prevalence \textit{modal} annotations in most benchmarks. To address scarcity amodal benchmarks, we introduce TAO-Amodal,...