- Human Pose and Action Recognition
- Video Analysis and Summarization
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Video Surveillance and Tracking Methods
- Anomaly Detection Techniques and Applications
- Image Retrieval and Classification Techniques
- Advanced Vision and Imaging
- Generative Adversarial Networks and Image Synthesis
- Advanced Neural Network Applications
- Face recognition and analysis
- Domain Adaptation and Few-Shot Learning
- Gait Recognition and Analysis
- Visual Attention and Saliency Detection
- Robotics and Sensor-Based Localization
- Venous Thromboembolism Diagnosis and Management
- Machine Learning and Data Classification
- Computer Graphics and Visualization Techniques
- Industrial Vision Systems and Defect Detection
- Acute Ischemic Stroke Management
- 3D Shape Modeling and Analysis
- Adversarial Robustness in Machine Learning
- Infrared Target Detection Methodologies
- Natural Language Processing Techniques
- Image Enhancement Techniques
Amazon (United States)
2024
Walt Disney (United States)
2023
JDA Software (United States)
2022
JDSU (United States)
2020-2022
Wuhan University of Technology
2009-2020
SRI International
2012-2017
Rutgers Sexual and Reproductive Health and Rights
2015
Fujian Blood Center
2014
Princeton University
2013
University of Michigan
2010-2011
In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild.” Such unconstrained are abundant in personal collections as well on web. Recognizing action such has not been addressed extensively, primarily due to tremendous variations that result camera motion, background clutter, changes object appearance, and scale, etc. The main challenge is how extract reliable informative features videos. We both motion static Since raw of types dense yet...
In this paper we explore the idea of using high-level semantic concepts, also called attributes, to represent human actions from videos and argue that attributes enable construction more descriptive models for action recognition. We propose a unified framework wherein manually specified are: i) selected in discriminative fashion so as account intra-class variability; ii) coherently integrated with data-driven make attribute set descriptive. Data-driven are automatically inferred training...
In this paper, we present a novel approach for automatically learning compact and yet discriminative appearance-based human action model. A video sequence is represented by bag of spatiotemporal features called video-words quantizing the extracted 3D interest points (cuboids) from videos. Our proposed able to discover optimal number video-word clusters utilizing Maximization Mutual Information(MMI). Unlike k-means algorithm, which typically used cluster cuboids into words based on their...
In this paper, we present a novel approach to recognizing human actions from different views by view knowledge transfer. An action is originally modelled as bag of visual-words (BoVW), which sensitive changes. We argue that, opposed visual words, there exist some higher level features can be shared across and enable the connection models for views. To discover these features, use bipartite graph model two view-dependent vocabularies, then apply partitioning co-cluster vocabularies into...
In this paper, we propose a Customizable Architecture Search (CAS) approach to automatically generate network architecture for semantic image segmentation. The generated consists of sequence stacked computation cells. A cell is represented as directed acyclic graph, in which each node hidden representation (i.e., feature map) and edge associated with an operation (e.g., convolution pooling), transforms data new layer. During the training, CAS algorithm explores search space optimized build...
In this paper, we propose a framework that fuses multiple features for improved action recognition in videos. The fusion of is important recognizing actions as often single feature based representation not enough to capture the imaging variations (view-point, illumination etc.) and attributes individuals (size, age, gender etc.). Hence, use two types features: i) quantized vocabulary local spatio-temporal (ST) volumes (or cuboids), ii) spin-images, which aims shape deformation actor by...
In this paper, we present a systematic framework for recognizing realistic actions from videos "in the wild". Such unconstrained are abundant in personal collections as well on Web. Recognizing action such has not been addressed extensively, primarily due to tremendous variations that result camera motion, background clutter, changes object appearance, and scale, etc. The main challenge is how extract reliable informative features videos. We both motion static Since raw of types dense yet...
In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to automatically learn semantic vocabulary from abundant quantized midlevel features. Each feature is represented by the vector of pointwise mutual information (PMI). space, believe features produced similar sources must lie on certain manifold. To capture intrinsic geometric relations between features, measure their dissimilarity using distance. The underlying idea embed into...
The success of deep neural networks generally requires a vast amount training data to be labeled, which is expensive and unfeasible in scale, especially for video collections. To alleviate this problem, paper, we propose 3DRotNet: fully self-supervised approach learn spatiotemporal features from unlabeled videos. A set rotations are applied all videos, pretext task defined as prediction these rotations. When accomplishing task, 3DRotNet actually trained understand the semantic concepts...
Low-level appearance as well spatio-temporal features, appropriately quantized and aggregated into Bag-of-Words (BoW) descriptors, have been shown to be effective in many detection recognition tasks. However, their effcacy for complex event unconstrained videos not systematically evaluated. In this paper, we use the NIST TRECVID Multimedia Event Detection (MED11 [1]) open source dataset, containing annotated data 15 high-level events, standardized test bed evaluating low-level features. This...
Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training. However, without frame-level annotations, it is challenging achieve completeness relieve background interference. In this paper, we present an Action Unit Memory Network (AUMN) for weakly localization, which can mitigate the above two challenges by learning unit memory bank. proposed AUMN, attention modules are designed update bank adaptively...
The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity organized by VOT initiative. Results of 71 trackers are presented; many state-of-the-art published at major computer vision conferences or in journals recent years. was composed four sub-challenges focusing on different tracking domains: (i) VOT-ST2021 focused short-term RGB, (ii) VOT-RT2021 "real-time" (iii) VOT-LT2021 long-term tracking, namely coping with target disappearance and reappearance...
In this paper, we propose a novel approach for scene modeling. The proposed method is able to automatically discover the intermediate semantic concepts. We utilize Maximization of Mutual Information (MMI) co-clustering clusters concepts, which call Each concept corresponds cluster visterms in bag Vis- terms (BOV) paradigm classification. MMI co- clustering results fewer but meaningful clusters. Unlike k-means used image patches based on their appearances BOV, can group are highly correlated...
Action recognition methods suffer from many drawbacks in practice, which include (1)the inability to cope with incremental problems; (2)the requirement of an intensive training stage obtain good performance; (3) the recognize simultaneous multiple actions; and (4) difficulty performing frame by frame. In order overcome all these using a single method, we propose novel framework involving feature-tree index large scale motion features Sphere/Rectangle-tree (SR-tree). The consists following...
We propose to use action, scene and object concepts as semantic attributes for classification of video events in InTheWild content, such YouTube videos. model using a variety complementary attribute features developed concept space. Our contribution is systematically demonstrate the advantages this concept-based event representation (CBER) applications understanding. Specifically, CBER has better generalization capability, which enables recognize with few training examples. In addition,...
Contrastive learning, which aims at minimizing the distance between positive pairs while maximizing that of negative ones, has been widely and successfully applied in unsupervised feature where design (pos/neg) is one its keys. In this paper, we attempt to devise a feature-level data manipulation, differing from augmentation, enhance generic contrastive self-supervised learning. To end, first visualization scheme for pos/neg score <sup xmlns:mml="http://www.w3.org/1998/Math/MathML"...
In this paper, we propose a simple yet effective video super-resolution method that aims at generating highfidelity high-resolution (HR) videos from low-resolution (LR) ones. Previous methods predominantly leverage temporal neighbor frames to assist the of current frame. Those achieve limited performance as they suffer challenges in spatial frame alignment and lack useful information similar LR frames. contrast, devise cross-frame non-local attention mechanism allows superresolution without...
We propose a novel method for automatically discovering key motion patterns happening in scene by observing the an extended period. Our does not rely on object detection and tracking, uses low level features, direction of pixel wise optical flow. first divide video into clips estimate sequence flow-fields. Each moving is quantized based its location direction. This essentially bag words representation clips. Once obtained, we proceed to screening stage, using measure called `conditional...
We present a method whereby an embodied agent using visual perception can efficiently create model of local indoor environment from its experience moving within it. Our uses motion cues to compute likelihoods structure hypotheses, based on simple, generic geometric knowledge about points, lines, planes, and motion. single-image analysis, not attempt identify single accurate model, but propose set plausible hypotheses the initial frame. then use data subsequent frames update Bayesian...
Region sampling or weighting is significantly important to the success of modern region-based object detectors. Unlike some previous works, which only focus on "hard'' samples when optimizing objective function, we argue that sample should be data-dependent and task-dependent. The importance a for function optimization determined by its uncertainties both classification bounding box regression tasks. To this end, devise general loss cover most detectors with various strategies, then based it...
Virtual try-on methods aim to generate images of fashion models wearing arbitrary combinations garments. This is a challenging task because the generated image must appear realistic and accurately display interaction between Prior works produce that are filled with artifacts fail capture important visual details necessary for commercial applications. We propose Outfit Visualization Net (OVNet) these (e.g. buttons, shading, textures, hemlines, interactions garments) high quality...
We propose a novel statistical manifold modeling approach that is capable of classifying poses object categories from video sequences by simultaneously minimizing the intra-class variability and maximizing inter-pose distance. Following intuition an part based representation suitable selection process may help achieve our purpose, we formulate problem perspective treat as adjusting (parameterized pose) means "alignment" "expansion" operations. show alignment expansion are equivalent to...
Multimedia event detection has drawn a lot of attention in recent years. Given recognized event, this paper, we conduct pilot study the multimedia recounting problem, which answers question why video is as i.e. what evidences decision made on. In order to provide semantic adopt concept-based representation for learning discriminative model. Then, present approach that exactly recovers contribution evidence classification decision. This can be applied on any additive classifiers. The...