- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Video Surveillance and Tracking Methods
- Anomaly Detection Techniques and Applications
- Video Analysis and Summarization
- Advanced Image and Video Retrieval Techniques
- Image Retrieval and Classification Techniques
- Gait Recognition and Analysis
- Hand Gesture Recognition Systems
- Domain Adaptation and Few-Shot Learning
- Music and Audio Processing
- Natural Language Processing Techniques
- Advanced Neural Network Applications
- Speech and dialogue systems
- Advanced Vision and Imaging
- Topic Modeling
- Face recognition and analysis
- Fire Detection and Safety Systems
- Face and Expression Recognition
- Text and Document Classification Technologies
- Autonomous Vehicle Technology and Safety
- Human Motion and Animation
- Speech and Audio Processing
- Machine Learning and Algorithms
- Generative Adversarial Networks and Image Synthesis
Carnegie Mellon University
2016-2025
Meta (Israel)
2021
Google (United States)
2020
Association for Computing Machinery
2019
MSIGHT Technologies (China)
2017
Microsoft Research Asia (China)
2012
Laboratoire d'Informatique de Paris-Nord
1988
The robust detection of small targets is one the key techniques in infrared search and tracking applications. A novel target method a single image proposed this paper. Initially, traditional model generalized to new patch-image using local patch construction. Then, because non-local self-correlation property background image, based on formulated as an optimization problem recovering low-rank sparse matrices, which effectively solved stable principle component pursuit. Finally, simple...
Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in source domain. Previous methods minimize discrepancy neglecting class information, which may lead to misalignment and poor generalization performance. To address this issue, paper proposes Contrastive Network (CAN) optimizing a new metric explicitly models intra-class inter-class discrepancy. We design an alternating update strategy training CAN end-to-end manner....
Based on keypoints extracted as salient image patches, an can be described a "bag of visual words" and this representation has been used in scene classification. The choice dimension, selection, weighting words is crucial to the classification performance but not thoroughly studied previous work. Given analogy between bag-of-words text documents, we apply techniques categorization, including term weighting, stop word removal, feature generate representations that differ words. impact these...
Many multimedia applications can benefit from techniques for adapting existing classifiers to data with different distributions. One example is cross-domain video concept detection which aims adapt across various domains. In this paper, we explore two key problems classifier adaptation: (1) how transform classifier(s) into an effective a new dataset that only has limited number of labeled examples, and (2) select the best adaptation. For first problem, propose Adaptive Support Vector...
Curriculum learning (CL) or self-paced (SPL) represents a recently proposed regime inspired by the process of humans and animals that gradually proceeds from easy to more complex samples in training. The two methods share similar conceptual paradigm, but differ specific schemes. In CL, curriculum is predetermined prior knowledge, remain fixed thereafter. Therefore, this type method heavily relies on quality knowledge while ignoring feedback about learner. SPL, dynamically determined adjust...
In this paper, we propose a discriminative video representation for event detection over large scale dataset when only limited hardware resources are available. The focus of paper is to effectively leverage deep Convolutional Neural Networks (CNNs) advance detection, where frame level static descriptors can be extracted by the existing CNN toolkits. This makes two contributions inference representation. First, while average pooling and max have long been standard approaches aggregating...
In real-world crowd counting applications, the densities vary greatly in spatial and temporal domains. A detection based method will estimate crowds accurately low density scenes, while its reliability congested areas is downgraded. regression approach, on other hand, captures general information crowded regions. Without knowing location of each person, it tends to overestimate count areas. Thus, exclusively using either one them not sufficient handle all kinds scenes with varying densities....
In this paper, we focus on complex event detection in internet videos while also providing the key evidences of results. Convolutional Neural Networks (CNNs) have achieved promising performance image classification and action recognition tasks. However, it remains an open problem how to use CNNs for video recounting, mainly due complexity diversity events. work, propose a flexible deep CNN infrastructure, namely Deep Event Network (DevNet), that simultaneously detects pre-defined events...
Video semantic recognition usually suffers from the curse of dimensionality and absence enough high-quality labeled instances, thus semisupervised feature selection gains increasing attentions for its efficiency comprehensibility. Most previous methods assume that videos with close distance (neighbors) have similar labels characterize intrinsic local structure through a predetermined graph both unlabeled data. However, besides parameter tuning problem underlying construction graph, affinity...
The goal of this paper is to build robust human action recognition for real world surveillance videos. Local spatio-temporal features around interest points provide compact but descriptive representations video analysis and motion recognition. Current approaches tend extend spatial descriptions by adding a temporal component the appearance descriptor, which only implicitly captures information. We propose an algorithm called MoSIFT, detects encodes not their local also explicitly models...
Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency information. This attenuation introduces bias the resulting features generates ill-conditioned matrices. The Gaussian Pyramid has been used a enhancing technique that encodes scale-invariant characteristics into space in an attempt deal with this attenuation. However, at core of is convolutional smoothing operation, makes it incapable generating...
Multimedia event detection has been one of the major endeavors in video analysis. A variety approaches have proposed recently to tackle this problem. Among others, using semantic representation accredited for its promising performance and desirable ability human-understandable reasoning. To generate representation, we usually utilize several external image/video archives apply concept detectors trained on them videos. Due intrinsic difference these archives, resulted is presumable different...
Feature selection is one of the most important dimension reduction techniques for its efficiency and interpretation. Since practical data in large scale are usually collected without labels, labeling these dramatically expensive time-consuming, unsupervised feature has become a ubiquitous challenging problem. Without label information, fundamental problem lies how to characterize geometry structure original space produce faithful subset, which preserves intrinsic accurately. In this paper,...
Fault diagnosis and remaining useful life (RUL) prediction are always two major issues in modern industrial systems, which usually regarded as separated tasks to make the problem easier but ignore fact that there certain information of these can be shared improve performance. Therefore, capture common features between different relative problems, a joint-loss convolutional neural network (JL-CNN) architecture is proposed this paper, implement bearing fault recognition RUL parallel by sharing...
The aim of crowd counting is to estimate the number people in images by leveraging annotation center positions for pedestrians' heads. Promising progresses have been made with prevalence deep Convolutional Neural Networks. Existing methods widely employ Euclidean distance (i.e., L <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> loss) optimize model, which, however, has two main drawbacks: (1) loss difficulty learning spatial awareness...
Previous work generally believes that improving the spatial invariance of convolutional networks is key to object counting. However, after verifying several mainstream counting networks, we surprisingly found too strict pixel-level would cause overfit noise in density map generation. In this paper, try use locally connected Gaussian kernels replace original convolution filter estimate position map. The purpose allow feature extraction process potentially stimulate generation overcome...
Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, Jianfeng Gao. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. 3D tokenizer quantize into spatial-temporal visual tokens and propose an embedding method for masked token modeling facilitate multi-task learning. conduct extensive experiments demonstrate quality, efficiency, flexibility of MAGVIT. Our show that (i) MAGVIT performs favorably against state-of-the-art approaches establishes best-published FVD on three generation...
Multimedia data are usually represented by multiple features. In this paper, we propose a new algorithm, namely Multi-feature Learning via Hierarchical Regression for multimedia semantics understanding, where two issues considered. First, labeling large amount of training is labor-intensive. It meaningful to effectively leverage unlabeled facilitate understanding. Second, given that can be features, it advantageous develop an algorithm which combines evidence obtained from different features...
In this paper we propose a unified action recognition framework fusing local descriptors and holistic features. The motivation is that the features emphasize different aspects of actions are suitable for types databases. proposed based on frame differencing, bag-of-words feature fusion. We extract two kinds descriptors, i.e. 2D 3D SIFT both interest points. apply Zernike moments to features, one single frames other motion energy image. perform experiments KTH Weizmann databases, using...
In order to exploit the abundant potential information of unlabeled data and contribute analyzing correlation among heterogeneous data, we propose semi-supervised model named adaptive feature selection for cross-modal retrieval. First, utilize semantic regression strengthen neighboring relationship between with same semantic. And can be optimized via keeping pairwise closeness when learning common latent space. Second, adopt graph-based constraint predict accurate labels it also keep...