NFDI4DS | UHH-SEMS - Publication Details

Alexey Dosovitskiy

ORCID: 0000-0003-1851-0976

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5090113830

Research Areas

Advanced Vision and Imaging
Advanced Neural Network Applications
Advanced Image and Video Retrieval Techniques
Domain Adaptation and Few-Shot Learning
Generative Adversarial Networks and Image Synthesis
Advanced Image Processing Techniques
Robotics and Sensor-Based Localization
Multimodal Machine Learning Applications
Autonomous Vehicle Technology and Safety
Reinforcement Learning in Robotics
Computer Graphics and Visualization Techniques
3D Shape Modeling and Analysis
Robotic Path Planning Algorithms
Human Pose and Action Recognition
Cell Image Analysis Techniques
Video Surveillance and Tracking Methods
Robot Manipulation and Learning
Image Enhancement Techniques
Retinal Imaging and Analysis
Handwritten Text Recognition Techniques
Neural dynamics and brain function
Traffic control and management
Image Retrieval and Classification Techniques
Robotic Locomotion and Control
Sparse and Compressive Sensing Techniques

Google (United States)
2020-2022

Brain (Germany)
2021

Intel (Germany)
2018-2019

Intel (United States)
2017-2019

University of Freiburg
2013-2018

Laboratoire d'Informatique de Paris-Nord
2014-2015

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

OPENALEX - Publications

Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai and 7 more

While the Transformer architecture has become de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used replace certain components of networks while keeping their overall structure place. We show that this reliance on CNNs not necessary and a pure transformer directly sequences image patches can perform very well classification tasks. When pre-trained...

10.48550/arxiv.2010.11929 preprint EN other-oa arXiv (Cornell University) 2020-01-01

FlowNet: Learning Optical Flow with Convolutional Networks

OPENALEX - Publications

Alexey Dosovitskiy Philipp Fischer Eddy Ilg Philip Häusser Caner Hazırbaş and 4 more

Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not among the tasks CNNs succeeded at. In this paper we construct which are capable solving optical problem as supervised learning task. We propose and compare two architectures: generic architecture another one including layer that correlates feature vectors at different image locations. Since existing ground...

10.1109/iccv.2015.316 article EN 2015-12-01

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

OPENALEX - Publications

Eddy Ilg N. Michael Mayer Tonmoy Saikia Margret Keuper Alexey Dosovitskiy and 1 more

The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of art with regard to quality has still been defined by traditional methods. Particularly on small displacements and real-world data, cannot compete variational In this paper, we advance concept end-to-end make it work really well. large improvements in speed are caused three major contributions: first, focus training data show schedule presenting during is very important. Second,...

10.1109/cvpr.2017.179 article EN 2017-07-01

CARLA: An Open Urban Driving Simulator

OPENALEX - Publications

Alexey Dosovitskiy Germán Ros Felipe Codevilla Antonio M. López Vladlen Koltun

We introduce CARLA, an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of urban systems. In addition code protocols, provides open digital assets (urban layouts, buildings, vehicles) that were created this purpose can be used freely. The simulation platform supports flexible specification sensor suites environmental conditions. use study performance three approaches driving: a classic modular...

10.48550/arxiv.1711.03938 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Striving for Simplicity: The All Convolutional Net

OPENALEX - Publications

Jost Tobias Springenberg Alexey Dosovitskiy Thomas Brox Martin Riedmiller

Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate state art from images with networks, questioning necessity different components in pipeline. find that can simply be replaced layer increased stride without loss accuracy on several image benchmarks. Following this finding -- building other recent work simple...

10.48550/arxiv.1412.6806 preprint EN other-oa arXiv (Cornell University) 2014-01-01

A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

OPENALEX - Publications

N. Michael Mayer Eddy Ilg Philip Häusser Philipp Fischer Daniel Cremers and 2 more

Recent work has shown that optical flow estimation can be formulated as a supervised learning task and successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by large synthetically generated dataset. The present paper extends concept via networks to disparity scene estimation. To this end, we propose three synthetic stereo video datasets sufficient realism, variation, size train Our are first large-scale enable training evaluating methods. Besides...

10.1109/cvpr.2016.438 preprint EN 2016-06-01

MLP-Mixer: An all-MLP Architecture for Vision

OPENALEX - Publications

Ilya Tolstikhin Neil Houlsby Alexander Kolesnikov Lucas Beyer Xiaohua Zhai and 7 more

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as Vision Transformer, have also become popular. In this paper we show that while convolutions and attention both sufficient good performance, neither of them necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types layers: one with MLPs applied independently to image patches (i.e. "mixing"...

10.48550/arxiv.2105.01601 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Learning agile and dynamic motor skills for legged robots

OPENALEX - Publications

Jemin Hwangbo Joonho Lee Alexey Dosovitskiy C. Dario Bellicoso Vassilios Tsounis and 2 more

Legged robots pose one of the greatest challenges in robotics. Dynamic and agile maneuvers animals cannot be imitated by existing methods that are crafted humans. A compelling alternative is reinforcement learning, which requires minimal craftsmanship promotes natural evolution a control policy. However, so far, learning research for legged mainly limited to simulation, only few comparably simple examples have been deployed on real systems. The primary reason training with robots,...

10.1126/scirobotics.aau5872 article EN Science Robotics 2019-01-17

NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections

OPENALEX - Publications

Ricardo Martin-Brualla Noha Radwan Mehdi S. M. Sajjadi Jonathan T. Barron Alexey Dosovitskiy and 1 more

We present a learning-based method for synthesizing novel views of complex scenes using only unstructured collections in-the-wild photographs. build on Neural Radiance Fields (NeRF), which uses the weights multi-layer perceptron to model density and color scene as function 3D coordinates. While NeRF works well images static subjects captured under controlled settings, it is incapable modeling many ubiquitous, real-world phenomena in uncontrolled images, such variable illumination or...

10.1109/cvpr46437.2021.00713 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

End-to-End Driving Via Conditional Imitation Learning

OPENALEX - Publications

Felipe Codevilla Matthias Müller Antonio M. López Vladlen Koltun Alexey Dosovitskiy

Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, policies via imitation learning cannot be controlled at test time. A vehicle end-to-end imitate an expert guided take a specific turn upcoming intersection. This limits the utility such systems. We propose condition high-level command input. At time, policy functions as chauffeur that handles sensorimotor coordination but continues respond navigational commands. evaluate...

10.1109/icra.2018.8460487 article EN 2018-05-01

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

OPENALEX - Publications

Maxim Tatarchenko Alexey Dosovitskiy Thomas Brox

We present a deep convolutional decoder architecture that can generate volumetric 3D outputs in compute- and memory-efficient manner by using an octree representation. The network learns to predict both the structure of octree, occupancy values individual cells. This makes it particularly valuable technique for generating shapes. In contrast standard decoders acting on regular voxel grids, does not have cubic complexity. allows representing much higher resolution with limited memory budget....

10.1109/iccv.2017.230 article EN 2017-10-01

DeMoN: Depth and Motion Network for Learning Monocular Stereo

OPENALEX - Publications

Benjamin Ummenhofer Huizhong Zhou Jonas Uhrig N. Michael Mayer Eddy Ilg and 2 more

In this paper we formulate structure from motion as a learning problem. We train convolutional network end-to-end to compute depth and camera successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative that able improve its own predictions. estimates not only motion, but additionally surface normals, optical flow between images confidence matching. A crucial component approach training loss based on...

10.1109/cvpr.2017.596 preprint EN 2017-07-01

Learning to generate chairs with convolutional neural networks

OPENALEX - Publications

Alexey Dosovitskiy Jost Tobias Springenberg Thomas Brox

We train a generative convolutional neural network which is able to generate images of objects given object type, viewpoint, and color. the in supervised manner on dataset rendered 3D chair models. Our experiments show that does not merely learn all by heart, but rather finds meaningful representation model allowing it assess similarity different chairs, interpolate between viewpoints missing ones, or invent new styles interpolating chairs from training set. can be used find correspondences...

10.1109/cvpr.2015.7298761 article EN 2015-06-01

Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks

OPENALEX - Publications

Alexey Dosovitskiy Philipp Fischer Jost Tobias Springenberg Martin Riedmiller Thomas Brox

Deep convolutional networks have proven to be very successful in learning task specific features that allow for unprecedented performance on various computer vision tasks. Training of such follows mostly the supervised paradigm, where sufficiently many input-output pairs are required training. Acquisition large training sets is one key challenges, when approaching a new task. In this paper, we aim generic feature and present an approach network using only unlabeled data. To end, train...

10.1109/tpami.2015.2496141 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2015-10-29

Inverting Visual Representations with Convolutional Networks

OPENALEX - Publications

Alexey Dosovitskiy Thomas Brox

Feature representations, both hand-designed and learned ones, are often hard to analyze interpret, even when they extracted from visual data. We propose a new approach study image representations by inverting them with an up-convolutional neural network. apply the method shallow (HOG, SIFT, LBP), as well deep networks. For our provides significantly better reconstructions than existing methods, revealing that there is surprisingly rich information contained in these features. Inverting...

10.1109/cvpr.2016.522 article EN 2016-06-01

On Evaluation of Embodied Navigation Agents

OPENALEX - Publications

Peter Anderson Anne Lynn S. Chang Devendra Singh Chaplot Alexey Dosovitskiy Saurabh Gupta and 6 more

Skillful mobile operation in three-dimensional environments is a primary topic of study Artificial Intelligence. The past two years have seen surge creative work on navigation. This output has produced plethora sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing future research this area, we convened working group to empirical methodology navigation research. present document summarizes the consensus recommendations group. We discuss different problem...

10.48550/arxiv.1807.06757 preprint EN cc-by arXiv (Cornell University) 2018-01-01

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

OPENALEX - Publications

Anh‐Tu Nguyen Jeff Clune Yoshua Bengio Alexey Dosovitskiy Jason Yosinski

Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. [37] showed one interesting way to synthesize novel by performing gradient ascent the latent space of generator network maximize activations or multiple neurons separate classifier network. In this paper we extend method introducing an additional prior on code, improving both sample quality and diversity, leading state-of-the-art generative model that produces high at...

10.1109/cvpr.2017.374 article EN 2017-07-01

Generating Images with Perceptual Similarity Metrics based on Deep Networks

OPENALEX - Publications

Alexey Dosovitskiy Thomas Brox

Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead computing distances space, compute between features extracted by neural networks. metric better reflects perceptually images and thus show three applications: autoencoder training, modification...

10.48550/arxiv.1602.02644 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Do Vision Transformers See Like Convolutional Neural Networks?

OPENALEX - Publications

Maithra Raghu Thomas Unterthiner Simon Kornblith Chiyuan Zhang Alexey Dosovitskiy

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, learning entirely different representations? Analyzing internal representation structure of ViTs and CNNs benchmarks, we find striking...

10.48550/arxiv.2108.08810 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Coming Soon ...