- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
- Generative Adversarial Networks and Image Synthesis
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Video Analysis and Summarization
- Video Surveillance and Tracking Methods
- Advanced Neural Network Applications
- Face recognition and analysis
- Evacuation and Crowd Dynamics
- Image Retrieval and Classification Techniques
- Human Motion and Animation
- Anomaly Detection Techniques and Applications
- Data Visualization and Analytics
- Quantum-Dot Cellular Automata
- Music and Audio Processing
- Image Enhancement Techniques
- Slime Mold and Myxomycetes Research
- Traffic control and management
- Advanced Image Processing Techniques
- Gait Recognition and Analysis
- Humor Studies and Applications
- Advanced Memory and Neural Computing
- Cell Image Analysis Techniques
- Visual Attention and Saliency Detection
École Polytechnique
2021-2024
Laboratoire d'Informatique de l'École Polytechnique
2021-2024
Centre National de la Recherche Scientifique
2021-2024
University of Oxford
2019-2021
Oxford Research Group
2020
Democritus University of Thrace
2010-2019
University of Edinburgh
2015-2017
Université Grenoble Alpes
2015-2017
Laboratoire Jean Kuntzmann
2015
Institut national de recherche en informatique et en automatique
2015
Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage temporal continuity of videos instead operating level. We propose ACtion Tubelet detector (ACT-detector) takes as input a sequence frames and outputs tubelets, i.e., sequences bounding boxes with associated scores. The same way object detectors anchor boxes, our ACT-detector is based cuboids. build upon SSD...
While most existing approaches for detection in videos focus on objects or human actions separately, we aim at jointly detecting performing actions, such as cat eating dog jumping. We introduce an end-to-end multitask objective that learns object-action relationships. compare it with different training objectives, validate its effectiveness objects-actions videos, and show both tasks of object action benefit from this joint learning. Moreover, the proposed architecture can be used zero-shot...
In this article, the problem of real-time robot exploration and map building (active SLAM) is considered. A single stereo vision camera exploited by a fully autonomous to navigate, localize itself, define its surroundings, avoid any possible obstacle in aim maximizing mapped region following optimal route. modified version so-called cognitive-based adaptive optimization algorithm introduced for successfully complete tasks real time local minima entrapment. The method's effectiveness...
Object detection is one of the most important challenges in computer vision. detectors are usually trained on bounding-boxes from still images. Recently, video has been used as an alternative source data. Yet, for a given test domain (image or video), performance detector depends it was on. In this paper, we examine reasons behind gap. We define and evaluate different shift factors: spatial location accuracy, appearance diversity, image quality aspect distribution. impact these factors by...
Capturing the 'mutual gaze' of people is essential for understanding and interpreting social interactions between them. To this end, paper addresses problem detecting Looking At Each Other (LAEO) in video sequences. For purpose, we propose LAEO-Net, a new deep CNN determining LAEO videos. In contrast to previous works, LAEO-Net takes spatio-temporal tracks as input reasons about whole track. It consists three branches, one each character's tracked head their relative position. Moreover,...
Abstract Reinforcement Learning is an area of Machine focused on how agents can be trained to make sequential decisions, and achieve a particular goal within arbitrary environment. While learning, they repeatedly take actions based their observation the environment, receive appropriate rewards which define objective. This experience then used progressively improve policy controlling agent's behavior, typically represented by neural network. module reused for similar problems, makes this...
Adapting a segmentation model from labeled source domain to target domain, where single unlabeled datum is available, one of the most challenging problems in adaptation and otherwise known as one-shot un-supervised (OSUDA). Most prior works have addressed problem by relying on style transfer techniques, images are stylized appearance domain. Departing common notion transferring only "texture" information, we leverage text-to-image diffusion models (e.g., Stable Diffusion) generate synthetic...
Abstract Automatically understanding funny moments (i.e., the that make people laugh) when watching comedy is challenging, as they relate to various features, such body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model relies on cross- self-attention for visual, audio text data predict in videos. Unlike most methods rely ground truth form of subtitles, work exploit modalities come naturally with videos: (a) video frames contain visual information indispensable...
In all the living organisms, self-preservation behaviour is almost universal. Even most simple of like slime mould, typically under intense selective pressure to evolve a response ensure their evolution and safety in best possible way. On other hand, evacuation place can be easily characterized as one stressful situations for individuals taking part on it. Taking inspiration from mould behaviour, we are introducing computational bio-inspired model crowd model. Cellular Automata (CA) were...
Gait recognition systems typically rely solely on silhouettes for extracting gait signatures. Nevertheless, these approaches struggle with changes in body shape and dynamic backgrounds; a problem that can be alleviated by learning from multiple modalities. However, many real-life some modalities missing, therefore most existing multimodal frameworks fail to cope missing To tackle this problem, work, we propose UGaitNet, unifying framework recognition, robust UGaitNet handles mingles various...
Image style transfer has attracted widespread attention in the past few years. Despite its remarkable results, it requires additional images available as references, making less flexible and inconvenient. Using text is most natural way to describe style. More importantly, can implicit abstract styles, like styles of specific artists or art movements. In this paper, we propose a text-driven image (TxST) that leverages advanced image-text encoders control arbitrary transfer. We introduce...
Quantum-dot fabrication and characterization is a well-established technology, which used in photonics, quantum optics, nanoelectronics. Four quantum-dots placed at the corners of square form unit cell, can hold bit information serve as basis for quantum-dot cellular automata (QCA) nanoelectronic circuits. Although several basic QCA circuits have been designed, fabricated, tested, proving that functional, fast low-power circuits, nanoelectronics still remain its infancy. One reasons this...
Video Object Segmentation (VOS) is crucial for several applications, from video editing to data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way annotating objects humans draw detailed segmentation masks on the target at each frame. This annotation process, however, tedious and time-consuming. To reduce this cost, in paper, we propose EVA-VOS, human-in-the-loop framework object segmentation. Unlike approach, introduce...
The objective of this work is person-clustering in videos – grouping characters according to their identity. Previous methods focus on the narrower task face-clustering, and for most part ignore other cues such as person's voice, overall appearance (hair, clothes, posture), editing structure videos. Similarly, current datasets evaluate only rather than person-clustering. This limits applicability downstream applications story understanding which require person-level, face-level, reasoning.In...
Modern works on style transfer focus transferring from a single image. Recently, some approaches study multiple transfer; these, however, are either too slow or fail to mix styles. We propose ST-VAE, Variational AutoEncoder for latent space-based transfer. It performs by projecting nonlinear styles linear space, enabling merge via interpolation before the new content To evaluate we experiment COCO and also present case revealing that ST-VAE outperforms other methods while being faster,...
Image style transfer has attracted widespread attention in the past years. Despite its remarkable results, it requires additional images available as references, making less flexible and inconvenient. Using text is most natural way to describe style. Text can implicit abstract styles, like styles of specific artists or art movements. In this work, we propose a text-driven (TxST) that leverages advanced image-text encoders control arbitrary transfer. We introduce contrastive training strategy...
Capturing the 'mutual gaze' of people is essential for understanding and interpreting social interactions between them. To this end, paper addresses problem detecting <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Looking At Each Other (LAEO)</i> in video sequences. For purpose, we propose LAEO-Net++, a new deep CNN determining LAEO videos. In contrast to previous works, LAEO-Net++ takes spatio-temporal tracks as input reasons about whole...