Alexander G. Hauptmann

ORCID: 0000-0003-2123-0684
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Multimodal Machine Learning Applications
  • Video Surveillance and Tracking Methods
  • Anomaly Detection Techniques and Applications
  • Video Analysis and Summarization
  • Advanced Image and Video Retrieval Techniques
  • Image Retrieval and Classification Techniques
  • Gait Recognition and Analysis
  • Hand Gesture Recognition Systems
  • Domain Adaptation and Few-Shot Learning
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Advanced Neural Network Applications
  • Speech and dialogue systems
  • Advanced Vision and Imaging
  • Topic Modeling
  • Face recognition and analysis
  • Fire Detection and Safety Systems
  • Face and Expression Recognition
  • Text and Document Classification Technologies
  • Autonomous Vehicle Technology and Safety
  • Human Motion and Animation
  • Speech and Audio Processing
  • Machine Learning and Algorithms
  • Generative Adversarial Networks and Image Synthesis

Carnegie Mellon University
2016-2025

Meta (Israel)
2021

Google (United States)
2020

Association for Computing Machinery
2019

MSIGHT Technologies (China)
2017

Microsoft Research Asia (China)
2012

Laboratoire d'Informatique de Paris-Nord
1988

The robust detection of small targets is one the key techniques in infrared search and tracking applications. A novel target method a single image proposed this paper. Initially, traditional model generalized to new patch-image using local patch construction. Then, because non-local self-correlation property background image, based on formulated as an optimization problem recovering low-rank sparse matrices, which effectively solved stable principle component pursuit. Finally, simple...

10.1109/tip.2013.2281420 article EN IEEE Transactions on Image Processing 2013-09-11

Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in source domain. Previous methods minimize discrepancy neglecting class information, which may lead to misalignment and poor generalization performance. To address this issue, paper proposes Contrastive Network (CAN) optimizing a new metric explicitly models intra-class inter-class discrepancy. We design an alternating update strategy training CAN end-to-end manner....

10.1109/cvpr.2019.00503 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Based on keypoints extracted as salient image patches, an can be described a "bag of visual words" and this representation has been used in scene classification. The choice dimension, selection, weighting words is crucial to the classification performance but not thoroughly studied previous work. Given analogy between bag-of-words text documents, we apply techniques categorization, including term weighting, stop word removal, feature generate representations that differ words. impact these...

10.1145/1290082.1290111 article EN 2007-09-24

Many multimedia applications can benefit from techniques for adapting existing classifiers to data with different distributions. One example is cross-domain video concept detection which aims adapt across various domains. In this paper, we explore two key problems classifier adaptation: (1) how transform classifier(s) into an effective a new dataset that only has limited number of labeled examples, and (2) select the best adaptation. For first problem, propose Adaptive Support Vector...

10.1145/1291233.1291276 article EN Proceedings of the 30th ACM International Conference on Multimedia 2007-09-29

Curriculum learning (CL) or self-paced (SPL) represents a recently proposed regime inspired by the process of humans and animals that gradually proceeds from easy to more complex samples in training. The two methods share similar conceptual paradigm, but differ specific schemes. In CL, curriculum is predetermined prior knowledge, remain fixed thereafter. Therefore, this type method heavily relies on quality knowledge while ignoring feedback about learner. SPL, dynamically determined adjust...

10.1609/aaai.v29i1.9608 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2015-02-21

In this paper, we propose a discriminative video representation for event detection over large scale dataset when only limited hardware resources are available. The focus of paper is to effectively leverage deep Convolutional Neural Networks (CNNs) advance detection, where frame level static descriptors can be extracted by the existing CNN toolkits. This makes two contributions inference representation. First, while average pooling and max have long been standard approaches aggregating...

10.1109/cvpr.2015.7298789 preprint EN 2015-06-01

In real-world crowd counting applications, the densities vary greatly in spatial and temporal domains. A detection based method will estimate crowds accurately low density scenes, while its reliability congested areas is downgraded. regression approach, on other hand, captures general information crowded regions. Without knowing location of each person, it tends to overestimate count areas. Thus, exclusively using either one them not sufficient handle all kinds scenes with varying densities....

10.1109/cvpr.2018.00545 preprint EN 2018-06-01

In this paper, we focus on complex event detection in internet videos while also providing the key evidences of results. Convolutional Neural Networks (CNNs) have achieved promising performance image classification and action recognition tasks. However, it remains an open problem how to use CNNs for video recounting, mainly due complexity diversity events. work, propose a flexible deep CNN infrastructure, namely Deep Event Network (DevNet), that simultaneously detects pre-defined events...

10.1109/cvpr.2015.7298872 article EN 2015-06-01

Video semantic recognition usually suffers from the curse of dimensionality and absence enough high-quality labeled instances, thus semisupervised feature selection gains increasing attentions for its efficiency comprehensibility. Most previous methods assume that videos with close distance (neighbors) have similar labels characterize intrinsic local structure through a predetermined graph both unlabeled data. However, besides parameter tuning problem underlying construction graph, affinity...

10.1109/tcyb.2017.2647904 article EN IEEE Transactions on Cybernetics 2017-02-20

The goal of this paper is to build robust human action recognition for real world surveillance videos. Local spatio-temporal features around interest points provide compact but descriptive representations video analysis and motion recognition. Current approaches tend extend spatial descriptions by adding a temporal component the appearance descriptor, which only implicitly captures information. We propose an algorithm called MoSIFT, detects encodes not their local also explicitly models...

10.1184/r1/6607523.v1 article EN 2009-01-01

Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency information. This attenuation introduces bias the resulting features generates ill-conditioned matrices. The Gaussian Pyramid has been used a enhancing technique that encodes scale-invariant characteristics into space in an attempt deal with this attenuation. However, at core of is convolutional smoothing operation, makes it incapable generating...

10.1109/cvpr.2015.7298616 article EN 2015-06-01

Multimedia event detection has been one of the major endeavors in video analysis. A variety approaches have proposed recently to tackle this problem. Among others, using semantic representation accredited for its promising performance and desirable ability human-understandable reasoning. To generate representation, we usually utilize several external image/video archives apply concept detectors trained on them videos. Due intrinsic difference these archives, resulted is presumable different...

10.1109/tcyb.2016.2539546 article EN publisher-specific-oa IEEE Transactions on Cybernetics 2016-03-28

Feature selection is one of the most important dimension reduction techniques for its efficiency and interpretation. Since practical data in large scale are usually collected without labels, labeling these dramatically expensive time-consuming, unsupervised feature has become a ubiquitous challenging problem. Without label information, fundamental problem lies how to characterize geometry structure original space produce faithful subset, which preserves intrinsic accurately. In this paper,...

10.1109/tnnls.2017.2650978 article EN IEEE Transactions on Neural Networks and Learning Systems 2017-01-27

10.1007/s11263-017-1033-7 article EN International Journal of Computer Vision 2017-07-13

Fault diagnosis and remaining useful life (RUL) prediction are always two major issues in modern industrial systems, which usually regarded as separated tasks to make the problem easier but ignore fact that there certain information of these can be shared improve performance. Therefore, capture common features between different relative problems, a joint-loss convolutional neural network (JL-CNN) architecture is proposed this paper, implement bearing fault recognition RUL parallel by sharing...

10.1109/tii.2019.2915536 article EN IEEE Transactions on Industrial Informatics 2019-05-08

The aim of crowd counting is to estimate the number people in images by leveraging annotation center positions for pedestrians' heads. Promising progresses have been made with prevalence deep Convolutional Neural Networks. Existing methods widely employ Euclidean distance (i.e., L <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> loss) optimize model, which, however, has two main drawbacks: (1) loss difficulty learning spatial awareness...

10.1109/iccv.2019.00625 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Previous work generally believes that improving the spatial invariance of convolutional networks is key to object counting. However, after verifying several mainstream counting networks, we surprisingly found too strict pixel-level would cause overfit noise in density map generation. In this paper, try use locally connected Gaussian kernels replace original convolution filter estimate position map. The purpose allow feature extraction process potentially stimulate generation overcome...

10.1109/cvpr52688.2022.01902 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, Jianfeng Gao. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.

10.18653/v1/2022.naacl-main.70 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. 3D tokenizer quantize into spatial-temporal visual tokens and propose an embedding method for masked token modeling facilitate multi-task learning. conduct extensive experiments demonstrate quality, efficiency, flexibility of MAGVIT. Our show that (i) MAGVIT performs favorably against state-of-the-art approaches establishes best-published FVD on three generation...

10.1109/cvpr52729.2023.01008 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Multimedia data are usually represented by multiple features. In this paper, we propose a new algorithm, namely Multi-feature Learning via Hierarchical Regression for multimedia semantics understanding, where two issues considered. First, labeling large amount of training is labor-intensive. It meaningful to effectively leverage unlabeled facilitate understanding. Second, given that can be features, it advantageous develop an algorithm which combines evidence obtained from different features...

10.1109/tmm.2012.2234731 article EN IEEE Transactions on Multimedia 2013-03-13

In this paper we propose a unified action recognition framework fusing local descriptors and holistic features. The motivation is that the features emphasize different aspects of actions are suitable for types databases. proposed based on frame differencing, bag-of-words feature fusion. We extract two kinds descriptors, i.e. 2D 3D SIFT both interest points. apply Zernike moments to features, one single frames other motion energy image. perform experiments KTH Weizmann databases, using...

10.1109/cvprw.2009.5204255 article EN IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops 2009-06-01

In order to exploit the abundant potential information of unlabeled data and contribute analyzing correlation among heterogeneous data, we propose semi-supervised model named adaptive feature selection for cross-modal retrieval. First, utilize semantic regression strengthen neighboring relationship between with same semantic. And can be optimized via keeping pairwise closeness when learning common latent space. Second, adopt graph-based constraint predict accurate labels it also keep...

10.1109/tmm.2018.2877127 article EN IEEE Transactions on Multimedia 2018-10-22
Coming Soon ...