- Human Pose and Action Recognition
- Advanced Neural Network Applications
- Video Surveillance and Tracking Methods
- Advanced Vision and Imaging
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Multimodal Machine Learning Applications
- Anomaly Detection Techniques and Applications
- Autonomous Vehicle Technology and Safety
- Robotics and Sensor-Based Localization
- Computer Graphics and Visualization Techniques
- Generative Adversarial Networks and Image Synthesis
- 3D Shape Modeling and Analysis
- Visual Attention and Saliency Detection
- Remote Sensing and LiDAR Applications
- Face recognition and analysis
- Video Analysis and Summarization
- Machine Learning and Data Classification
- Adversarial Robustness in Machine Learning
- Hand Gesture Recognition Systems
- Face and Expression Recognition
- 3D Surveying and Cultural Heritage
- COVID-19 diagnosis using AI
- Advanced Image Processing Techniques
- Image Retrieval and Classification Techniques
Carnegie Mellon University
2015-2024
Perrigo (United States)
2019-2020
University of California, Irvine
2009-2017
UC Irvine Health
2008-2014
University of California System
2013
Toyota Technological Institute at Chicago
2005-2007
University of California, Berkeley
2003-2005
University of Delaware
2002
We describe an object detection system based on mixtures of multiscale deformable part models. Our is able to represent highly variable classes and achieves state-of-the-art results in the PASCAL challenges. While models have become quite popular, their value had not been demonstrated difficult benchmarks such as data sets. relies new methods for discriminative training with partially labeled data. combine a margin-sensitive approach data-mining hard negative examples formalism we call...
This paper describes a discriminatively trained, multiscale, deformable part model for object detection. Our system achieves two-fold improvement in average precision over the best performance 2006 PASCAL person detection challenge. It also outperforms results 2007 challenge ten out of twenty categories. The relies heavily on parts. While models have become quite popular, their value had not been demonstrated difficult benchmarks such as new methods discriminative training. We combine...
We present a unified model for face detection, pose estimation, and landmark estimation in real-world, cluttered images. Our is based on mixtures of trees with shared pool parts; we every facial as part use global to capture topological changes due viewpoint. show that tree-structured models are surprisingly effective at capturing elastic deformation, while being easy optimize unlike dense graph structures. extensive results standard benchmarks, well new "in the wild" annotated dataset,...
We present a new dataset with the goal of advancing state-of-the-art in object recognition by placing question context broader scene understanding. This is achieved gathering images complex everyday scenes containing common objects their natural context. Objects are labeled using per-instance segmentations to aid precise localization. Our contains photos 91 types that would be easily recognizable 4 year old. With total 2.5 million instances 328k images, creation our drew upon extensive crowd...
We present Argoverse, a dataset designed to support autonomous vehicle perception tasks including 3D tracking and motion forecasting. Argoverse includes sensor data collected by fleet of vehicles in Pittsburgh Miami as well annotations, 300k extracted interesting trajectories, rich semantic maps. The consists 360 degree images from 7 cameras with overlapping fields view, forward-facing stereo imagery, point clouds long range LiDAR, 6-DOF pose. Our 290km mapped lanes contain geometric...
We describe a method for human pose estimation in static images based on novel representation of part models. Notably, we do not use articulated limb parts, but rather capture orientation with mixture templates each part. general, flexible model capturing contextual co-occurrence relations between augmenting standard spring models that encode spatial relations. show such can notions local rigidity. When and are tree-structured, our be efficiently optimized dynamic programming. present...
We describe a method for articulated human detection and pose estimation in static images based on new representation of deformable part models. Rather than modeling articulation using family warped (rotated foreshortened) templates, we use mixture small, nonoriented parts. general, flexible model that jointly captures spatial relations between locations co-occurrence mixtures, augmenting standard pictorial structure models encode just relations. Our have several notable properties: 1) They...
We analyze the computational problem of multi-object tracking in video sequences. formulate using a cost function that requires estimating number tracks, as well their birth and death states. show global solution can be obtained with greedy algorithm sequentially instantiates tracks shortest path computations on flow network. Greedy algorithms allow one to embed pre-processing steps, such nonmax suppression, within algorithm. Furthermore, we give near-optimal based dynamic programming which...
Though tremendous strides have been made in object recognition, one of the remaining open challenges is detecting small objects. We explore three aspects problem context finding faces: role scale invariance, image resolution, and contextual reasoning. While most recognition approaches aim to be scale-invariant, cues for recognizing a 3px tall face are fundamentally different than those 300px face. take approach train separate detectors scales. To maintain efficiency, trained multi-task...
We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with focus on continuous (CVER) in outdoor areas wide coverage. Previous datasets for action are unrealistic real-world surveillance because they consist short clips showing one by individual [15, 8]. Datasets have been developed movies [11] and sports [12], but, these actions scene conditions do not apply effectively videos. Our consists many scenes occurring...
We present a novel dataset and algorithms for the problem of detecting activities daily living (ADL) in firstperson camera views. have collected 1 million frames dozens people performing unscripted, everyday activities. The is annotated with activities, object tracks, hand positions, interaction events. ADLs differ from typical actions that they can involve long-scale temporal structure (making tea take few minutes) complex interactions (a fridge looks different when its door open). develop...
We explore 3D human pose estimation from a single RGB image. While many approaches try to directly predict image measurements, we simple architecture that reasons through intermediate 2D predictions. Our approach is based on two key observations (1) Deep neural nets have revolutionized estimation, producing accurate predictions even for poses with self-occlusions (2) Big-datasets of mocap data are now readily available, making it tempting lift predicted memorization (e.g., nearest...
A commonly observed failure mode of Neural Radiance Field (NeRF) is fitting incorrect geometries when given an insufficient number input views. One potential reason that standard volumetric rendering does not enforce the constraint most a scene's geometry consist empty space and opaque surfaces. We formalize above assumption through DS-NeRF (Depth-supervised Fields), loss for learning radiance fields takes advantage readily-available depth supervision. leverage fact current NeRF pipelines...
In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of video. We do so by integrating state-of-the-art two-stream networks [42] with learnable feature aggregation [6]. The resulting architecture is end-to-end trainable whole-video classification. investigate different strategies pooling space and time combining signals from streams. find that: (i) it important to pool jointly...
In this paper, we propose the first higher frame rate video dataset (called Need for Speed - NfS) and benchmark visual object tracking. The consists of 100 videos (380K frames) captured with now commonly available (240 FPS) cameras from real world scenarios. All frames are annotated axis aligned bounding boxes all sequences manually labelled nine attributes such as occlusion, fast motion, background clutter, etc. Our provides an extensive evaluation many recent state-of-the-art trackers on...
While feedforward deep convolutional neural networks (CNNs) have been a great success in computer vision, it is important to note that the human visual cortex generally contains more feedback than connections. In this paper, we will briefly introduce background of feedbacks cortex, which motivates us develop computational mechanism networks. addition inference traditional networks, loop introduced infer activation status hidden layer neurons according "goal" network, e.g., high-level...
We address the problem of long-term object tracking, where may become occluded or leave-the-view. In this setting, we show that an accurate appearance model is considerably more effective than a strong motion model. develop simple but algorithms alternate between tracking and learning good given track. it crucial to learn from "right" frames, use formalism self-paced curriculum automatically select such frames. leverage techniques detection for appearance-based templates, demonstrating...
Few-shot learning, i.e., learning novel concepts from few examples, is fundamental to practical visual recognition systems. While most of existing work has focused on few-shot classification, we make a step towards object detection, more challenging yet under-explored task. We develop conceptually simple but powerful meta-learning based framework that simultaneously tackles classification and localization in unified, coherent way. This leverages meta-level knowledge about "model parameter...
Many state-of-the-art approaches for object recognition reduce the problem to a 0-1 classification task. Such reductions allow one leverage sophisticated classifiers learning. These models are typically trained independently each class using positive and negative examples cropped from images. At test-time, various post-processing heuristics such as non-maxima suppression (NMS) required reconcile multiple detections within between different classes image. Though crucial good performance on...