Mattia Soldan

ORCID: 0000-0003-0413-8165
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Video Analysis and Summarization
  • Multimodal Machine Learning Applications
  • Human Pose and Action Recognition
  • Anomaly Detection Techniques and Applications
  • Video Surveillance and Tracking Methods
  • Gait Recognition and Analysis
  • Natural Language Processing Techniques
  • Advanced Image and Video Retrieval Techniques
  • Multimedia Communication and Technology
  • Music and Audio Processing
  • Wind and Air Flow Studies
  • Combustion and flame dynamics
  • Diabetic Foot Ulcer Assessment and Management
  • Cancer-related molecular mechanisms research
  • Advanced Data Storage Technologies
  • Topic Modeling
  • Subtitles and Audiovisual Media
  • Image and Video Quality Assessment
  • Fluid Dynamics and Turbulent Flows

King Abdullah University of Science and Technology
2019-2024

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort been made at assessing fitness these for grounding task. Recent works have begun to discover significant limitations datasets, suggesting state-of-the-art techniques commonly overfit hidden dataset biases. this work, we present MAD (Movie Audio Descriptions), a novel benchmark departs from...

10.1109/cvpr52688.2022.00497 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Video-Language Pretraining (VLP), which aims to learn transferable representation advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person datasets, such as HowTo100M. In this work, we exploit the released Ego4D dataset pioneer Egocentric VLP along three directions. (i) We create EgoClip, 1st-person pretraining comprising 3.8M clip-text pairs well-chosen from Ego4D, covering large variety human...

10.48550/arxiv.2206.01670 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a query. The solution this challenging task demands understanding videos' and queries' semantic content fine-grained reasoning about their multi-modal interactions. Our key idea is recast challenge into an algorithmic graph matching problem. Fueled by recent advances Graph Neural Networks, we propose leverage Convolutional Networks model video textual information as well alignment....

10.1109/iccvw54120.2021.00361 article EN 2021-10-01

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate performance current state-of-the-art methods for video grounding in setup, with interesting findings: alone fail at tackling this challenging task setup due their inability process long sequences. In paper, we propose a method improving natural language videos by identifying pruning out non-describable windows. We design guided framework consisting Guidance Model base model....

10.1109/iccv51070.2023.01257 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start end locations can be used for highlight generation, temporal detection, etc. Unfortunately, learning exact boundary location activities is highly challenging because are continuous time, there often no clear-cut transitions between actions. Moreover, definition events subjective, which may confuse model. To alleviate...

10.48550/arxiv.2304.02934 preprint EN cc-by-nc-nd arXiv (Cornell University) 2023-01-01

10.1109/cvpr52733.2024.00711 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Smartphones and wearable devices are fast growing technologies that, in conjunction with advances wireless sensor hardware, enabling ubiquitous sensing applications. Wearables suitable for indoor outdoor scenarios, can be placed on many parts of the human body integrate a large number sensors capable gathering physiological behavioral biometric information. Here, we concerned gait analysis systems that extract meaningful information from user's movements to identify anomalies changes their...

10.48550/arxiv.1911.08608 preprint EN other-oa arXiv (Cornell University) 2019-01-01

In the context of large eddy simulation turbulent reacting flows, flamelet-based models are key to affordable simulations and complex systems. However, as complexity problem increases, higher-dimensional look-up tables required, rendering conventional procedure too demanding. This work focuses on accelerating estimation flamelet- based data for flamelet/progress variable model via an artificial neural network. The network hyper-parameters defined by a Bayesian optimization two different...

10.2514/6.2021-0412 article EN AIAA SCITECH 2022 Forum 2021-01-04

Movie trailers are an essential tool for promoting films and attracting audiences. However, the process of creating can be time-consuming expensive. To streamline this process, we propose automatic trailer generation framework that generates plausible from a full movie by automating shot selection composition. Our approach draws inspiration machine translation techniques models movies as sequences shots, thus formulating problem sequence-to-sequence task. We introduce Trailer Generation...

10.48550/arxiv.2404.03477 preprint EN arXiv (Cornell University) 2024-04-04

This study investigates whether Compressed-Language Models (CLMs), i.e. language models operating on raw byte streams from Compressed File Formats~(CFFs), can understand files compressed by CFFs. We focus the JPEG format as a representative CFF, given its commonality and representativeness of key concepts in compression, such entropy coding run-length encoding. test if CLMs probing their capabilities to perform along three axes: recognition inherent file properties, handling with anomalies,...

10.48550/arxiv.2405.17146 preprint EN arXiv (Cornell University) 2024-05-27

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort been made at assessing fitness these for grounding task. Recent works have begun to discover significant limitations datasets, suggesting state-of-the-art techniques commonly overfit hidden dataset biases. this work, we present MAD (Movie Audio Descriptions), a novel benchmark departs from...

10.48550/arxiv.2112.00431 preprint EN cc-by arXiv (Cornell University) 2021-01-01

We introduce the task of retrieving relevant video moments from a large corpus untrimmed, unsegmented videos given natural language query. Our poses unique challenges as system must efficiently identify both and localize in videos. To address these challenges, we propose SpatioTemporal Alignment with Language (STAL), model that represents moment set regions within series short clips aligns query to moment's regions. alignment cost compares variable-length features using symmetric squared...

10.48550/arxiv.1907.12763 preprint EN cc-by-nc-sa arXiv (Cornell University) 2019-01-01

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate performance current state-of-the-art methods for video grounding in setup, with interesting findings: alone fail at tackling this challenging task setup due their inability process long sequences. In paper, we propose a method improving natural language videos by identifying pruning out non-describable windows. We design guided framework consisting Guidance Model base model....

10.48550/arxiv.2302.13372 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a query. The solution this challenging task demands understanding videos' and queries' semantic content fine-grained reasoning about their multi-modal interactions. Our key idea is recast challenge into an algorithmic graph matching problem. Fueled by recent advances Graph Neural Networks, we propose leverage Convolutional Networks model video textual information as well alignment....

10.48550/arxiv.2011.10132 preprint EN other-oa arXiv (Cornell University) 2020-01-01

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, exploit the recently released dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from dataset, objective, development set. Based on above three designs, develop pretrained model that is able transfer its egocentric...

10.48550/arxiv.2207.01622 preprint EN other-oa arXiv (Cornell University) 2022-01-01
Coming Soon ...