- Generative Adversarial Networks and Image Synthesis
- Advanced Image and Video Retrieval Techniques
- Image Retrieval and Classification Techniques
- Domain Adaptation and Few-Shot Learning
- Video Surveillance and Tracking Methods
- Human Pose and Action Recognition
- Face recognition and analysis
- Anomaly Detection Techniques and Applications
- Face and Expression Recognition
- Multimodal Machine Learning Applications
- Advanced Vision and Imaging
- Digital Media Forensic Detection
- Visual Attention and Saliency Detection
- Learning Styles and Cognitive Differences
- Advanced Neural Network Applications
- Open Education and E-Learning
- Robotics and Sensor-Based Localization
- Remote-Sensing Image Classification
- Image and Object Detection Techniques
- Video Analysis and Summarization
- Computer Graphics and Visualization Techniques
- Semantic Web and Ontologies
- Advanced Image Processing Techniques
- Intelligent Tutoring Systems and Adaptive Learning
- Gaze Tracking and Assistive Technology
University of Modena and Reggio Emilia
2022-2024
University of Trento
2014-2022
Italian Institute of Technology
2011-2013
Sapienza University of Rome
1997-2011
Centro di Ricerca in Matematica Pura ed Applicata
2002-2006
University of Salerno
2006
Roma Tre University
2003-2006
In this paper we address the problem of generating person images conditioned on a given pose. Specifically, an image and target pose, synthesize new that in novel order to deal with pixel-to-pixel misalignments caused by pose differences, introduce deformable skip connections generator our Generative Adversarial Network. Moreover, nearest-neighbour loss is proposed instead common L <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub>...
In this paper we address the abnormality detection problem in crowded scenes. We propose to use Generative Adversarial Nets (GANs), which are trained using normal frames and corresponding optical-flow images order learn an internal representation of scene normality. Since our GANs with only data, they not able generate abnormal events. At testing time real data compared both appearance motion representations reconstructed by areas detected computing local differences. Experimental results on...
Most of the crowd abnormal event detection methods rely on complex hand-crafted features to represent motion and appearance. Convolutional Neural Networks (CNN) have shown be a powerful instrument with excellent representational capacities, which can leverage need for features. In this paper, we show that keeping track changes in CNN feature across time used effectively detect local anomalies. Specifically, propose measure abnormality by combining semantic information (inherited from...
Abnormal crowd behaviour detection attracts a large interest due to its importance in video surveillance scenarios. However, the ambiguity and lack of sufficient abnormal ground truth data makes end-to-end training deep networks hard this domain. In paper we propose use Generative Adversarial Nets (GANs), which are trained generate only normal distribution data. During adversarial GAN training, discriminator (D) is used as supervisor for generator network (G) vice versa. At testing time D...
A classifier trained on a dataset seldom works other datasets obtained under different conditions due to domain shift. This problem is commonly addressed by adaptation methods. In this work we introduce novel deep learning framework which unifies paradigms in unsupervised adaptation. Specifically, propose alignment layers implement feature whitening for the purpose of matching source and target distributions. Additionally, leverage unlabeled data proposing Min-Entropy Consensus loss,...
Previous works on facial expression analysis have shown that person specific models are advantageous with respect to generic ones for recognizing expressions of new users added the gallery set. This finding is not surprising, due often significant inter-individual variability: different persons morphological aspects and express their emotions in ways. However, acquiring person-specific labeled data learning a very time consuming process. In this work we propose transfer method compute...
In a weakly-supervised scenario object detectors need to be trained using image-level annotation alone. Since bounding-box-level ground truth is not available, most of the solutions proposed so far are based on an iterative, Multiple Instance Learning framework in which current classifier used select highest-confidence boxes each image, treated as pseudo-ground next training iteration. However, errors immature can make process drift, usually introducing many false positives dataset. To...
In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between two modalities. To end, propose an extension of MSCOCO dataset, FOIL-COCO, which associates images with both correct "foil" captions, that is, descriptions image are highly similar original ones, but contain one single mistake ("foil word"). We show LaVi fall into traps data perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word...
In this paper, we study the problem of Novel Class Discovery (NCD). NCD aims at inferring novel object categories in an unlabeled set by leveraging from prior knowledge a labeled containing different, but related classes. Existing approaches tackle considering multiple objective functions, usually involving specialized loss terms for and samples respectively, often requiring auxiliary regularization terms. paper depart traditional scheme introduce UNified Objective function (UNO) discovering...
Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. The traditional hashing RS usually exploit hand-crafted features learn hash functions obtain binary codes, which can be insufficient optimally represent the information content images. To overcome this problem, paper we introduce a metric-learning based network, learns: 1) semantic-based metric space for feature representation; 2)...
Facial expression and gesture recognition algorithms are key enabling technologies for human-computer interaction (HCI) systems. State of the art approaches automatic detection body movements analyzing emotions from facial features heavily rely on advanced machine learning algorithms. Most these methods designed average user, but assumption "one-size-fits-all" ignores diversity in cultural background, gender, ethnicity, personal behavior, limits their applicability real-world scenarios. A...
Most of the current self-supervised representation learning (SSL) methods are based on contrastive loss and instance-discrimination task, where augmented versions same image instance ("positives") contrasted with instances extracted from other images ("negatives"). For to be effective, many negatives should compared a positive pair, which is computationally demanding. In this paper, we propose different direction new function for SSL, whitening latent-space features. The operation has...
In this paper, we address the problem of generating person images conditioned on both pose and appearance information. Specifically, given an image x <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">a</sub> a target P(x xmlns:xlink="http://www.w3.org/1999/xlink">b</sub> ), extracted from , synthesize new that in while preserving visual details . orderto deal with pixel-to-pixel misalignments caused bythe differences between ) introduce deformable...
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties visual domain which embedded in CNN design, should be learned samples. this paper, we empirically...
Figure 1: Our method generates smooth interpolations within and across domains in various image-to-image translation tasks.Here, we show gender, age smile translations from CelebA-HQ [20] animal AFHQ [10].
The way in which human beings express emotions depends on their specific personality and cultural background. As a consequence, person independent facial expression classifiers usually fail to accurately recognize vary between different individuals. On the other hand, training person-specific classifier for each new user is time consuming activity involves collecting hundreds of labeled samples. In this paper we present personalization approach only unlabeled target-specific data are...
We present an approach to automatic localization of facial feature points which deals with pose, expression, and identity variations combining 3D shape models local image patch classification. The latter is performed by means densely extracted SURF-like features, we call DU-SURF, while the former based on a multiclass version Hausdorff distance address classification errors nonvisible points. final system able localize in real-world scenarios, dealing out plane head rotations, expression...
Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a also error accumulation phenomenon, which is similar the exposure bias problem in autoregressive text generation. Specifically, note there discrepancy between training and testing, since former conditioned on ground truth samples, while latter previously generated results. To alleviate problem, propose...
Owing to the power of vision-language foundation models, e.g., CLIP, area image synthesis has seen recent important advances. Particularly, for style transfer, CLIP enables transferring more general and abstract styles without collecting images in advance, as can be efficiently described with natural language, result is optimized by minimizing similarity between text description stylized image. However, directly using guide transfer leads undesirable artifacts (mainly written words unrelated...
We address the problem of automatic extraction foreground objects from videos. The goal is to provide a method for unsupervised collection samples which can be further used object detection training without any human intervention. use well known Selective Search approach produce an initial still-image based segmentation video frames. This set proposals pruned and temporally extended using optical flow transductive learning. Specifically, we propose Dense Trajectories in order robustly match...