Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities

Modalities
DOI: 10.1609/aaai.v38i16.29718 Publication Date: 2024-03-25T11:55:15Z
ABSTRACT
Events describe happenings in our world that are of importance. Naturally, understanding events mentioned multimedia content and how they related forms an important way comprehending world. Existing literature can infer if across textual visual (video) domains identical (via grounding) thus, on the same semantic level. However, grounding fails to capture intricate cross-event relations exist due being referred many levels. For example, abstract event "war'' manifests at a lower level through subevents "tanks firing'' (in video) airplane "shot'' text), leading hierarchical, multimodal relationship between events. In this paper, we propose task extracting hierarchies from (video text) data itself different modalities This reveals structure is critical them. To support research task, introduce Multimodal Hierarchical (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve composed news video-article pairs, which makes it rich hierarchies. We densely annotate part dataset construct test benchmark. show limitations state-of-the-art unimodal baselines task. Further, address these via new weakly supervised model, leveraging only unannotated pairs MultiHiEve. perform thorough evaluation proposed method demonstrates improved performance highlight opportunities for future research. Data: https://github.com/hayyubi/multihieve
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....