- Robotics and Sensor-Based Localization
- Advanced Vision and Imaging
- 3D Shape Modeling and Analysis
- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Human Pose and Action Recognition
- 3D Surveying and Cultural Heritage
- Computer Graphics and Visualization Techniques
- Multimodal Machine Learning Applications
- Video Surveillance and Tracking Methods
- Generative Adversarial Networks and Image Synthesis
- Domain Adaptation and Few-Shot Learning
- Robot Manipulation and Learning
- Anomaly Detection Techniques and Applications
- Image Processing and 3D Reconstruction
- Optical measurement and interference techniques
- Image Processing Techniques and Applications
- Image and Object Detection Techniques
- Adversarial Robustness in Machine Learning
- Medical Image Segmentation Techniques
- Autonomous Vehicle Technology and Safety
- Image Retrieval and Classification Techniques
- Advanced Image Processing Techniques
- Remote Sensing and LiDAR Applications
- Surgical Simulation and Training
Google (Switzerland)
2019-2025
Technical University of Munich
2015-2024
Google (United States)
2019-2024
University of Bologna
2009-2022
University of Catania
2022
National Research Council
2022
Universidad de Las Palmas de Gran Canaria
2022
Menlo School
2022
Institut national de recherche en informatique et en automatique
2022
Amazon (United States)
2022
This paper addresses the problem of estimating depth map a scene given single RGB image. We propose fully convolutional architecture, encompassing residual learning, to model ambiguous mapping between monocular images and maps. In order improve output resolution, we present novel way efficiently learn feature up-sampling within network. For optimization, introduce reverse Huber loss that is particularly suited for task at hand driven by value distributions commonly in Our composed...
We present a novel method for detecting 3D model instances and estimating their 6D poses from RGB data in single shot. To this end, we extend the popular SSD paradigm to cover full pose space train on synthetic only. Our approach competes or surpasses current state-of-the-art methods that leverage RGBD multiple challenging datasets. Furthermore, our produces these results at around 10Hz, which is many times faster than related methods. For sake of reproducibility, make trained networks...
Given the recent advances in depth prediction from Convolutional Neural Networks (CNNs), this paper investigates how predicted maps a deep neural network can be deployed for goal of accurate and dense monocular reconstruction. We propose method where CNN-predicted are naturally fused together with measurements obtained direct SLAM, based on scheme that privileges image locations SLAM approaches tend to fail, e.g. along low-textured regions, vice-versa. demonstrate use estimate absolute scale...
In this paper, we propose 3D point-capsule networks, an auto-encoder designed to process sparse point clouds while preserving spatial arrangements of the input data. capsule networks arise as a direct consequence our unified formulation common auto-encoders. The dynamic routing scheme and peculiar 2D latent space deployed by bring in improvements for several cloud-related tasks, such object classification, reconstruction part segmentation substantiated extensive evaluations. Moreover, it...
With the advent of new-generation depth sensors, use three-dimensional (3-D) data is becoming increasingly popular. As these sensors are commodity hardware and sold at low cost, a rapidly growing group people can acquire 3- D cheaply in real time.
The use of robust feature descriptors is now key for many 3D tasks such as object recognition and surface alignment. Many have been proposed in literature which are based on a non-unique local Reference Frame hence require the computation multiple descriptions at each points. In this paper we show how to deploy unique improve accuracy reduce memory footprint well-known Shape Context descriptor. We validate our proposal by means an experimental analysis carried out large dataset scenes...
Motivated by the increasing availability of 3D sensors capable delivering both shape and texture information, this paper presents a novel descriptor for feature matching in data enriched with texture. The proposed approach stems from theory recently which relies on only, represents its generalization to case multiple cues associated mesh. descriptor, dubbed CSHOT, is demonstrated notably improve accuracy challenging object recognition scenarios characterized presence clutter occlusions.
Abstract Recent advances in machine learning have led to increased interest solving visual computing problems using methods that employ coordinate‐based neural networks. These methods, which we call fields , parameterize physical properties of scenes or objects across space and time. They seen widespread success such as 3D shape image synthesis, animation human bodies, reconstruction, pose estimation. Rapid progress has numerous papers, but a consolidation the discovered knowledge not yet...
Registration is an important step when processing three-dimensional (3-D) point clouds. Applications for registration range from object modeling and tracking, to simultaneous localization mapping (SLAM). This article presents the open-source cloud library (PCL) tools available registration. The PCL incorporates methods initial alignment of clouds using a variety local shape feature descriptors, as well refining alignments different variants well-known iterative closest (ICP) algorithm....
6D pose estimation from a single RGB image is fundamental task in computer vision. The current top-performing deep learning-based methods rely on an indirect strategy, i.e., first establishing 2D-3D correspondences between the coordinates plane and object coordinate system, then applying variant of PnP/RANSAC algorithm. However, this two-stage pipeline not end-to-end trainable, thus hard to be employed for many tasks requiring differentiable poses. On other hand, based direct regression are...
Many prediction tasks contain uncertainty. In some cases, uncertainty is inherent in the task itself. future prediction, for example, many distinct outcomes are equally valid. other arises from way data labeled. For object detection, objects of interest often go unlabeled, and human pose estimation, occluded joints labeled with ambiguous values. this work we focus on a principled approach handling such scenarios. particular, propose frame-work reformulating existing single-prediction models...
Scene understanding has been of high interest in computer vision. It encompasses not only identifying objects a scene, but also their relationships within the given context. With this goal, recent line works tackles 3D semantic segmentation and scene layout prediction. In our work we focus on graphs, data structure that organizes entities graph, where are nodes modeled as edges. We leverage inference graphs way to carry out understanding, mapping relationships. particular, propose learned...
Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains generic, may only barely overlap. For example, visual-language (VLMs) Internet-scale image captions, but large language (LMs) further text with no images spreadsheets, SAT questions, code). As a result, store different forms commonsense knowledge across domains. In this work, we show that diversity is symbiotic, and can be leveraged through...
Directly regressing all 6 degrees-of-freedom (6DoF) for the object pose (i.e. 3D rotation and translation) in a cluttered environment from single RGB image is challenging problem. While end-to-end methods have recently demonstrated promising results at high efficiency, they are still inferior when compared with elaborate PnP/RANSAC-based approaches terms of accuracy. In this work, we address shortcoming by means novel reasoning about self-occlusion, order to establish two-layer...
In compositional zero-shot learning, the goal is to recognize unseen compositions (e.g. old dog) of observed visual primitives states old, cute) and objects car, in training set. This challenging because same state can for example alter appearance a dog drastically differently from car. As solution, we propose novel graph formulation called Compositional Graph Embedding (CGE) that learns image features, classifiers latent representations an end-to-end manner. The key our approach exploiting...
Adapting to a continuously evolving environment is safety-critical challenge inevitably faced by all autonomous-driving systems. Existing image- and video-based driving datasets, however, fall short of capturing the mutable nature real world. In this paper, we introduce largest multi-task synthetic dataset for autonomous driving, SHIFT. It presents discrete continuous shifts in cloudiness, rain fog intensity, time day, vehicle pedestrian density. Featuring comprehensive sensor suite...
Establishing correspondences from image to 3D has been a key task of 6DoF object pose estimation for long time. To predict more accurately, deeply learned dense maps replaced sparse templates. Dense methods also improved in the presence occlusion. More recently researchers have shown improvements by learning fragments as segmentation. In this work, we present discrete descriptor, which can represent surface densely. By incorporating hierarchical binary grouping, encode very efficiently....