NFDI4DS | UHH-SEMS - Publication Details

Scene Graph Reasoning for Visual Question Answering

OPENALEX - Publications

Marcel Hildebrandt Hang Li Rajat Koner Volker Tresp Stephan Günnemann

Visual question answering is concerned with free-form questions about an image. Since it requires a deep linguistic understanding of the and ability to associate various objects that are present in image, ambitious task techniques from both computer vision natural language processing. We propose novel method approaches by performing context-driven, sequential reasoning based on their semantic spatial relationships scene. As first step, we derive scene graph which describes as well attributes...

10.48550/arxiv.2007.01072 preprint EN other-oa arXiv (Cornell University) 2020-01-01

InstanceFormer: An Online Video Instance Segmentation Framework

OPENALEX - Publications

Rajat Koner Tanveer Hannan Suprosanna Shit Sahand Sharifzadeh Matthias Schubert and 2 more

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage efficient VIS framework named InstanceFormer, which is especially suitable for long challenging We three novel...

10.1609/aaai.v37i1.25201 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

Perceive. Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

OPENALEX - Publications

Roberto Amoroso Gengyuan Zhang Rajat Koner Lorenzo Baraldi Rita Cucchiara and 1 more

10.1109/wacv61041.2025.00858 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025-02-26

OODformer: Out-Of-Distribution Detection Transformer

OPENALEX - Publications

Rajat Koner Poulami Sinhamahapatra Karsten Roscher Stephan Günnemann Volker Tresp

A serious problem in image classification is that a trained model might perform well for input data originates from the same distribution as available training, but performs much worse out-of-distribution (OOD) samples. In real-world safety-critical applications, particular, it important to be aware if new point OOD. To date, OOD detection typically addressed using either confidence scores, auto-encoder based reconstruction, or by contrastive learning. However, global context has not yet...

10.48550/arxiv.2107.08976 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Improving Visual Relation Detection using Depth Maps

OPENALEX - Publications

Sahand Sharifzadeh Sina Moayed Baharlou Max Berrendorf Rajat Koner Volker Tresp

Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable relations, e.g. helping to detect not only spatial standing behind, but also non-spatial holding. In this work, we study the effect of using different features with a focus maps. To enable study, release new synthetic dataset VG-Depth, an extension Genome (VG). note given...

10.1109/icpr48806.2021.9412945 article EN 2022 26th International Conference on Pattern Recognition (ICPR) 2021-01-10

Relation Transformer Network

OPENALEX - Publications

Rajat Koner Suprosanna Shit Volker Tresp

The extraction of a scene graph with objects as nodes and mutual relationships edges is the basis for deep understanding image content. Despite recent advances, such message passing joint classification, detection visual remains challenging task due to sub-optimal exploration interaction among objects. In this work, we propose novel transformer formulation generation relation prediction. We leverage encoder-decoder architecture rich feature embedding edges. Specifically, model node-to-node...

10.48550/arxiv.2004.06193 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Do DALL-E and Flamingo Understand Each Other?

OPENALEX - Publications

Hang Li Jindong Gu Rajat Koner Sahand Sharifzadeh Volker Tresp

The field of multimodal research focusing on the comprehension and creation both images text has witnessed significant strides. This progress is exemplified by emergence sophisticated models dedicated to image captioning at scale, such as notable Flamingo model text-to-image generative models, with DALL-E serving a prominent example. An interesting question worth exploring in this domain whether understand each other. To study question, we propose reconstruction task where generates...

10.1109/iccv51070.2023.00191 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Scenes and Surroundings: Scene Graph Generation using Relation Transformer

OPENALEX - Publications

Rajat Koner Poulami Sinhamahapatra Volker Tresp

Identifying objects in an image and their mutual relationships as a scene graph leads to deep understanding of content. Despite the recent advancement learning, detection labeling visual object remain challenging task. This work proposes novel local-context aware architecture named relation transformer, which exploits complex global edge (relation) interactions. Our hierarchical multi-head attention-based approach efficiently captures contextual dependencies between predicts relationships....

10.48550/arxiv.2107.05448 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Random Finite Set Based Bayesian Filtering with OpenCL in a Heterogeneous Platform

OPENALEX - Publications

Biao Hu Uzair Sharif Rajat Koner Guang Chen Kai Huang and 3 more

While most filtering approaches based on random finite sets have focused improving performance, in this paper, we argue that computation times are very important order to enable real-time applications such as pedestrian detection. Towards goal, paper investigates the use of OpenCL accelerate set-based Bayesian a heterogeneous system. In detail, developed an efficient and fully-functional pedestrian-tracking system implementation, which can run under constraints, meanwhile offering decent...

10.3390/s17040843 article EN cc-by Sensors 2017-04-12

GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation

OPENALEX - Publications

Tanveer Hannan Rajat Koner Maximilian Bernhard Suprosanna Shit Bjoern Menze and 3 more

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation noise accumulation methods, especially during occlusion abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at cost quadratic memory attention. they are susceptible instance features due above-mentioned challenges suffer from cascading effects. The...

10.48550/arxiv.2305.17096 preprint EN cc-by arXiv (Cornell University) 2023-01-01

LookupViT: Compressing visual information to a limited number of tokens

OPENALEX - Publications

Rajat Koner Gagan Jain Prateek Jain Volker Tresp Sujoy Paul

Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive many settings, they compute self-attention in each layer which suffers from quadratic computational complexity number of tokens. On other hand, spatial information images and spatio-temporal videos is usually sparse redundant. In this work, we introduce LookupViT, that aims to exploit sparsity reduce ViT cost. LookupViT provides a novel...

10.48550/arxiv.2407.12753 preprint EN arXiv (Cornell University) 2024-07-17

Do DALL-E and Flamingo Understand Each Other?

OPENALEX - Publications

Hang Li Jindong Gu Rajat Koner Sahand Sharifzadeh Volker Tresp

The field of multimodal research focusing on the comprehension and creation both images text has witnessed significant strides. This progress is exemplified by emergence sophisticated models dedicated to image captioning at scale, such as notable Flamingo model text-to-image generative models, with DALL-E serving a prominent example. An interesting question worth exploring in this domain whether understand each other. To study question, we propose reconstruction task where generates...

10.48550/arxiv.2212.12249 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

OPENALEX - Publications

Roberto Amoroso Gengyuan Zhang Rajat Koner Lorenzo Baraldi Rita Cucchiara and 1 more

Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from given question, and reason accurately provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed QA by leveraging their exceptional commonsense reasoning capabilities. This progress largely driven effective alignment between visual data language space of...

10.48550/arxiv.2412.19304 preprint EN arXiv (Cornell University) 2024-12-26

Relationformer: A Unified Framework for Image-to-Graph Generation

OPENALEX - Publications

Suprosanna Shit Rajat Koner Bastian Wittmann Johannes C. Paetzold Ivan Ezhov and 6 more

A comprehensive representation of an image requires understanding objects and their mutual relationship, especially in image-to-graph generation, e.g., road network extraction, blood-vessel or scene graph generation. Traditionally, generation is addressed with a two-stage approach consisting object detection followed by separate relation prediction, which prevents simultaneous object-relation interaction. This work proposes unified one-stage transformer-based framework, namely...

10.48550/arxiv.2203.10202 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

Is it all a cluster game? -- Exploring Out-of-Distribution Detection based on Clustering in the Embedding Space

OPENALEX - Publications

Poulami Sinhamahapatra Rajat Koner Karsten Roscher Stephan Günnemann

It is essential for safety-critical applications of deep neural networks to determine when new inputs are significantly different from the training distribution. In this paper, we explore out-of-distribution (OOD) detection problem image classification using clusters semantically similar embeddings data and exploit differences in distance relationships these between in- data. We study structure separation embedding space find that supervised contrastive learning leads well-separated while...

10.48550/arxiv.2203.08549 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Box Supervised Video Segmentation Proposal Network

OPENALEX - Publications

Tanveer Hannan Rajat Koner Jonathan Kobold Matthias Schubert

Bounding box supervision provides a balanced compromise between labeling effort and result quality for image segmentation. However, there exists no such work explicitly tailored videos. Applying the segmentation methods directly to videos produces sub-optimal solutions because they do not exploit temporal information. In this work, we propose box-supervised video proposal network. We take advantage of intrinsic properties by introducing novel box-guided motion calculation pipeline...

10.56541/azwk8552 article EN 2022-08-29

Improving Visual Relation Detection using Depth Maps

OPENALEX - Publications

Sahand Sharifzadeh Sina Moayed Baharlou Max Berrendorf Rajat Koner Volker Tresp

Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable relations, e.g. helping to detect not only spatial standing behind, but also non-spatial holding. In this work, we study the effect of using different features with a focus maps. To enable study, release new synthetic dataset VG-Depth, an extension Genome (VG). note given...

10.48550/arxiv.1905.00966 preprint EN other-oa arXiv (Cornell University) 2019-01-01