- Human Pose and Action Recognition
- Anomaly Detection Techniques and Applications
- Hand Gesture Recognition Systems
- Video Analysis and Summarization
- Gait Recognition and Analysis
- Video Surveillance and Tracking Methods
- Advanced Neural Network Applications
- Personality Traits and Psychology
- Mental Health Research Topics
- Generative Adversarial Networks and Image Synthesis
- Advanced Image and Video Retrieval Techniques
- Digital Mental Health Interventions
- Emotion and Mood Recognition
- Domain Adaptation and Few-Shot Learning
- Underwater Acoustics Research
- Hearing Impairment and Communication
- Human Motion and Animation
- Media Influence and Health
- Multimodal Machine Learning Applications
- Communication in Education and Healthcare
- Sports Analytics and Performance
- Context-Aware Activity Recognition Systems
- Speech and dialogue systems
- Robotics and Sensor-Based Localization
- Action Observation and Synchronization
Universitat de Barcelona
2015-2024
Barcelona Supercomputing Center
2023
Computer Vision Center
2012-2023
Aalborg University
2021-2023
Umbo Computer Vision (United Kingdom)
2023
Universitat Autònoma de Barcelona
2013-2022
Centre de Recerca Matemàtica
2013-2015
The interest in action and gesture recognition has grown considerably the last years. In this paper, we present a survey on current deep learning methodologies for image sequences. We introduce taxonomy that summarizes important aspects of approaching both tasks. review details proposed architectures, fusion strategies, main datasets, competitions. summarize discuss works so far with particular how they treat temporal dimension data, discussing their features identify opportunities...
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing the high dimensionality introduced by temporal dimension. While there surveys analyzing advances of Transformers vision, none focus on an in-depth analysis video-specific designs. In this survey, we analyze main contributions trends...
This paper provides an overview of the Joint Contest on Multimedia Challenges Beyond Visual Analysis. We organized academic competition that focused four problems require effective processing multimodal information in order to be solved. Two tracks were devoted gesture spotting and recognition from RGB-D video, two fundamental for human computer interaction. Another track was a second round first impressions challenge which goal develop methods recognize personality traits short video clips....
Effective conservation actions require effective population monitoring. However, accurately counting animals in the wild to inform decision-making is difficult. Monitoring populations through image sampling has made data collection cheaper, wide-reaching and less intrusive but created a need process analyse this efficiently. Counting from such challenging, particularly when densely packed noisy images. Attempting manually slow expensive, while traditional computer vision methods are limited...
Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language text. Many previous methods employ an intermediate representation, i.e., gloss sequences, facilitate SLT, thus transforming it into two-stage sign recognition (SLR) followed by (SLT). However, scarcity gloss-annotated data, combined with information bottleneck in mid-level has hindered further development SLT task. To address this challenge, we propose...
Person re-identification is about recognizing people who have passed by a sensor earlier. Previous work mainly based on RGB data, but in this we for the first time present system where combine RGB, depth, and thermal data purposes. First, from each of three modalities, obtain some particular features: model color information different regions body, depth compute soft body biometrics, extract local structural information. Then, types are combined joined classifier. The tri-modal evaluated new...
Real age estimation in still images of faces is an active area research the computer vision community. However, very few works attempted to analyse apparent as perceived by observers. Apparent a subjective task, which affected many factors present image well observer's characteristics. In this work, we enhance APPA-REAL dataset, containing around 8K with real and ages, new annotated attributes, namely gender, ethnic, makeup, expression. Age gender from subset guessers also provided. We show...
This paper introduces UDIVA, a new non-acted dataset of face-to-face dyadic interactions, where interlocutors perform competitive and collaborative tasks with different behavior elicitation cognitive workload. The consists 90.5 hours interactions among 147 participants distributed in 188 sessions, recorded using multiple audiovisual physiological sensors. Currently, it includes sociodemographic, self- peer-reported personality, internal state, relationship profiling from participants. As an...
The SoccerNet 2022 challenges were the second annual video understanding organized by team. In 2022, composed of 6 vision-based tasks: (1) action spotting, focusing on retrieving timestamps in long untrimmed videos, (2) replay grounding, live moment an shown a replay, (3) pitch localization, detecting line and goal part elements, (4) camera calibration, dedicated to intrinsic extrinsic parameters, (5) player re-identification, same players across multiple views, (6) object tracking, tracking...
Personality computing has become an emerging topic in computer vision, due to the wide range of applications it can be used for. However, most works on have focused analyzing individual, even when applied interaction scenarios, and for short periods time. To address these limitations, we present Dyadformer, a novel multi-modal multi-subject Transformer architecture model individual interpersonal features dyadic interactions using variable time windows, thus allowing capture long-term...
The performance of different action recognition techniques has recently been studied by several computer vision researchers. However, the potential improvement in classification through classifier fusion ensemble-based methods remained unattended. In this work, we evaluate an ensemble learning techniques, each performing task from a perspective. underlying idea is that instead aiming very sophisticated and powerful representation/learning technique, can learn categories using set relatively...
Action recognition is a challenging task that plays an important role in many robotic systems, which highly depend on visual input feeds. However, due to privacy concerns, it find method can recognise actions without using feed. In this paper, we propose concept for detecting while preserving the test subject's privacy. Our proposed relies only recording temporal evolution of light pulses scattered back from scene. Such data trace record one action contains sequence one-dimensional arrays...
Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language text. Many previous methods employ an intermediate representation, i.e., gloss sequences, facilitate SLT, thus transforming it into two-stage sign recognition (SLR) followed by (SLT). However, scarcity gloss-annotated data, combined with information bottleneck in mid-level has hindered further development SLT task. To address this challenge, we propose...
In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent and dataset, including requirement precise action localization, presence long-tail data distribution, non-visibility certain actions, label noise. To do so, incorporates (a) Transformer encoder-decoder architecture to achieve desired output temporal resolution produce predictions, (b) balanced mixup strategy handle distribution...
This paper summarizes the ChaLearn Looking at People 2020 Challenge on Identity-preserved Human Detection (IPHD). For purpose, we released a large novel dataset containing more than 112K pairs of spatiotemporally aligned depth and thermal frames (and 175K instances humans) sampled from 780 sequences. The sequences contain hundreds non-identifiable people appearing in mix in-the-wild scripted scenarios recorded public private places. competition was divided into three tracks depending...
Automated skill assessment in sports using video-based analysis holds great potential for revolutionizing coaching methodologies. This paper focuses on the problem of determination golfers by leveraging deep learning models applied to a large database video recordings golf swings. We investigate different regression, ranking and classification based methods compare simple baseline approach. The performance is evaluated mean squared error (MSE) as well computing percentages correctly ranked...