Rajat Koner

ORCID: 0000-0003-3441-8192
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Advanced Neural Network Applications
  • Video Analysis and Summarization
  • Visual Attention and Saliency Detection
  • Human Pose and Action Recognition
  • Adversarial Robustness in Machine Learning
  • Image Retrieval and Classification Techniques
  • Anomaly Detection Techniques and Applications
  • Image and Video Quality Assessment
  • Generative Adversarial Networks and Image Synthesis
  • Video Surveillance and Tracking Methods
  • Topic Modeling
  • Biomedical Text Mining and Ontologies
  • Language, Metaphor, and Cognition
  • Advanced Data Compression Techniques
  • Target Tracking and Data Fusion in Sensor Networks

Google (United Kingdom)
2024

DeepMind (United Kingdom)
2024

LMU Klinikum
2023

Ludwig-Maximilians-Universität München
2019-2023

Technical University of Munich
2017

Visual question answering is concerned with free-form questions about an image. Since it requires a deep linguistic understanding of the and ability to associate various objects that are present in image, ambitious task techniques from both computer vision natural language processing. We propose novel method approaches by performing context-driven, sequential reasoning based on their semantic spatial relationships scene. As first step, we derive scene graph which describes as well attributes...

10.48550/arxiv.2007.01072 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage efficient VIS framework named InstanceFormer, which is especially suitable for long challenging We three novel...

10.1609/aaai.v37i1.25201 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

A serious problem in image classification is that a trained model might perform well for input data originates from the same distribution as available training, but performs much worse out-of-distribution (OOD) samples. In real-world safety-critical applications, particular, it important to be aware if new point OOD. To date, OOD detection typically addressed using either confidence scores, auto-encoder based reconstruction, or by contrastive learning. However, global context has not yet...

10.48550/arxiv.2107.08976 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable relations, e.g. helping to detect not only spatial standing behind, but also non-spatial holding. In this work, we study the effect of using different features with a focus maps. To enable study, release new synthetic dataset VG-Depth, an extension Genome (VG). note given...

10.1109/icpr48806.2021.9412945 article EN 2022 26th International Conference on Pattern Recognition (ICPR) 2021-01-10

The extraction of a scene graph with objects as nodes and mutual relationships edges is the basis for deep understanding image content. Despite recent advances, such message passing joint classification, detection visual remains challenging task due to sub-optimal exploration interaction among objects. In this work, we propose novel transformer formulation generation relation prediction. We leverage encoder-decoder architecture rich feature embedding edges. Specifically, model node-to-node...

10.48550/arxiv.2004.06193 preprint EN cc-by arXiv (Cornell University) 2020-01-01

The field of multimodal research focusing on the comprehension and creation both images text has witnessed significant strides. This progress is exemplified by emergence sophisticated models dedicated to image captioning at scale, such as notable Flamingo model text-to-image generative models, with DALL-E serving a prominent example. An interesting question worth exploring in this domain whether understand each other. To study question, we propose reconstruction task where generates...

10.1109/iccv51070.2023.00191 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Identifying objects in an image and their mutual relationships as a scene graph leads to deep understanding of content. Despite the recent advancement learning, detection labeling visual object remain challenging task. This work proposes novel local-context aware architecture named relation transformer, which exploits complex global edge (relation) interactions. Our hierarchical multi-head attention-based approach efficiently captures contextual dependencies between predicts relationships....

10.48550/arxiv.2107.05448 preprint EN cc-by arXiv (Cornell University) 2021-01-01

While most filtering approaches based on random finite sets have focused improving performance, in this paper, we argue that computation times are very important order to enable real-time applications such as pedestrian detection. Towards goal, paper investigates the use of OpenCL accelerate set-based Bayesian a heterogeneous system. In detail, developed an efficient and fully-functional pedestrian-tracking system implementation, which can run under constraints, meanwhile offering decent...

10.3390/s17040843 article EN cc-by Sensors 2017-04-12

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation noise accumulation methods, especially during occlusion abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at cost quadratic memory attention. they are susceptible instance features due above-mentioned challenges suffer from cascading effects. The...

10.48550/arxiv.2305.17096 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive many settings, they compute self-attention in each layer which suffers from quadratic computational complexity number of tokens. On other hand, spatial information images and spatio-temporal videos is usually sparse redundant. In this work, we introduce LookupViT, that aims to exploit sparsity reduce ViT cost. LookupViT provides a novel...

10.48550/arxiv.2407.12753 preprint EN arXiv (Cornell University) 2024-07-17

The field of multimodal research focusing on the comprehension and creation both images text has witnessed significant strides. This progress is exemplified by emergence sophisticated models dedicated to image captioning at scale, such as notable Flamingo model text-to-image generative models, with DALL-E serving a prominent example. An interesting question worth exploring in this domain whether understand each other. To study question, we propose reconstruction task where generates...

10.48550/arxiv.2212.12249 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from given question, and reason accurately provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed QA by leveraging their exceptional commonsense reasoning capabilities. This progress largely driven effective alignment between visual data language space of...

10.48550/arxiv.2412.19304 preprint EN arXiv (Cornell University) 2024-12-26

A comprehensive representation of an image requires understanding objects and their mutual relationship, especially in image-to-graph generation, e.g., road network extraction, blood-vessel or scene graph generation. Traditionally, generation is addressed with a two-stage approach consisting object detection followed by separate relation prediction, which prevents simultaneous object-relation interaction. This work proposes unified one-stage transformer-based framework, namely...

10.48550/arxiv.2203.10202 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

It is essential for safety-critical applications of deep neural networks to determine when new inputs are significantly different from the training distribution. In this paper, we explore out-of-distribution (OOD) detection problem image classification using clusters semantically similar embeddings data and exploit differences in distance relationships these between in- data. We study structure separation embedding space find that supervised contrastive learning leads well-separated while...

10.48550/arxiv.2203.08549 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Bounding box supervision provides a balanced compromise between labeling effort and result quality for image segmentation. However, there exists no such work explicitly tailored videos. Applying the segmentation methods directly to videos produces sub-optimal solutions because they do not exploit temporal information. In this work, we propose box-supervised video proposal network. We take advantage of intrinsic properties by introducing novel box-guided motion calculation pipeline...

10.56541/azwk8552 article EN 2022-08-29

Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable relations, e.g. helping to detect not only spatial standing behind, but also non-spatial holding. In this work, we study the effect of using different features with a focus maps. To enable study, release new synthetic dataset VG-Depth, an extension Genome (VG). note given...

10.48550/arxiv.1905.00966 preprint EN other-oa arXiv (Cornell University) 2019-01-01
Coming Soon ...