- Advanced Image and Video Retrieval Techniques
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Video Analysis and Summarization
- Visual Attention and Saliency Detection
- Human Pose and Action Recognition
- Adversarial Robustness in Machine Learning
- Image Retrieval and Classification Techniques
- Anomaly Detection Techniques and Applications
- Image and Video Quality Assessment
- Generative Adversarial Networks and Image Synthesis
- Video Surveillance and Tracking Methods
- Topic Modeling
- Biomedical Text Mining and Ontologies
- Language, Metaphor, and Cognition
- Advanced Data Compression Techniques
- Target Tracking and Data Fusion in Sensor Networks
Google (United Kingdom)
2024
DeepMind (United Kingdom)
2024
LMU Klinikum
2023
Ludwig-Maximilians-Universität München
2019-2023
Technical University of Munich
2017
Visual question answering is concerned with free-form questions about an image. Since it requires a deep linguistic understanding of the and ability to associate various objects that are present in image, ambitious task techniques from both computer vision natural language processing. We propose novel method approaches by performing context-driven, sequential reasoning based on their semantic spatial relationships scene. As first step, we derive scene graph which describes as well attributes...
Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage efficient VIS framework named InstanceFormer, which is especially suitable for long challenging We three novel...
A serious problem in image classification is that a trained model might perform well for input data originates from the same distribution as available training, but performs much worse out-of-distribution (OOD) samples. In real-world safety-critical applications, particular, it important to be aware if new point OOD. To date, OOD detection typically addressed using either confidence scores, auto-encoder based reconstruction, or by contrastive learning. However, global context has not yet...
Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable relations, e.g. helping to detect not only spatial standing behind, but also non-spatial holding. In this work, we study the effect of using different features with a focus maps. To enable study, release new synthetic dataset VG-Depth, an extension Genome (VG). note given...
The extraction of a scene graph with objects as nodes and mutual relationships edges is the basis for deep understanding image content. Despite recent advances, such message passing joint classification, detection visual remains challenging task due to sub-optimal exploration interaction among objects. In this work, we propose novel transformer formulation generation relation prediction. We leverage encoder-decoder architecture rich feature embedding edges. Specifically, model node-to-node...
The field of multimodal research focusing on the comprehension and creation both images text has witnessed significant strides. This progress is exemplified by emergence sophisticated models dedicated to image captioning at scale, such as notable Flamingo model text-to-image generative models, with DALL-E serving a prominent example. An interesting question worth exploring in this domain whether understand each other. To study question, we propose reconstruction task where generates...
Identifying objects in an image and their mutual relationships as a scene graph leads to deep understanding of content. Despite the recent advancement learning, detection labeling visual object remain challenging task. This work proposes novel local-context aware architecture named relation transformer, which exploits complex global edge (relation) interactions. Our hierarchical multi-head attention-based approach efficiently captures contextual dependencies between predicts relationships....
While most filtering approaches based on random finite sets have focused improving performance, in this paper, we argue that computation times are very important order to enable real-time applications such as pedestrian detection. Towards goal, paper investigates the use of OpenCL accelerate set-based Bayesian a heterogeneous system. In detail, developed an efficient and fully-functional pedestrian-tracking system implementation, which can run under constraints, meanwhile offering decent...
Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation noise accumulation methods, especially during occlusion abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at cost quadratic memory attention. they are susceptible instance features due above-mentioned challenges suffer from cascading effects. The...
Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive many settings, they compute self-attention in each layer which suffers from quadratic computational complexity number of tokens. On other hand, spatial information images and spatio-temporal videos is usually sparse redundant. In this work, we introduce LookupViT, that aims to exploit sparsity reduce ViT cost. LookupViT provides a novel...
The field of multimodal research focusing on the comprehension and creation both images text has witnessed significant strides. This progress is exemplified by emergence sophisticated models dedicated to image captioning at scale, such as notable Flamingo model text-to-image generative models, with DALL-E serving a prominent example. An interesting question worth exploring in this domain whether understand each other. To study question, we propose reconstruction task where generates...
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from given question, and reason accurately provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed QA by leveraging their exceptional commonsense reasoning capabilities. This progress largely driven effective alignment between visual data language space of...
A comprehensive representation of an image requires understanding objects and their mutual relationship, especially in image-to-graph generation, e.g., road network extraction, blood-vessel or scene graph generation. Traditionally, generation is addressed with a two-stage approach consisting object detection followed by separate relation prediction, which prevents simultaneous object-relation interaction. This work proposes unified one-stage transformer-based framework, namely...
It is essential for safety-critical applications of deep neural networks to determine when new inputs are significantly different from the training distribution. In this paper, we explore out-of-distribution (OOD) detection problem image classification using clusters semantically similar embeddings data and exploit differences in distance relationships these between in- data. We study structure separation embedding space find that supervised contrastive learning leads well-separated while...
Bounding box supervision provides a balanced compromise between labeling effort and result quality for image segmentation. However, there exists no such work explicitly tailored videos. Applying the segmentation methods directly to videos produces sub-optimal solutions because they do not exploit temporal information. In this work, we propose box-supervised video proposal network. We take advantage of intrinsic properties by introducing novel box-guided motion calculation pipeline...
Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable relations, e.g. helping to detect not only spatial standing behind, but also non-spatial holding. In this work, we study the effect of using different features with a focus maps. To enable study, release new synthetic dataset VG-Depth, an extension Genome (VG). note given...