Alexey Dosovitskiy

ORCID: 0000-0003-1851-0976
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Vision and Imaging
  • Advanced Neural Network Applications
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Image Processing Techniques
  • Robotics and Sensor-Based Localization
  • Multimodal Machine Learning Applications
  • Autonomous Vehicle Technology and Safety
  • Reinforcement Learning in Robotics
  • Computer Graphics and Visualization Techniques
  • 3D Shape Modeling and Analysis
  • Robotic Path Planning Algorithms
  • Human Pose and Action Recognition
  • Cell Image Analysis Techniques
  • Video Surveillance and Tracking Methods
  • Robot Manipulation and Learning
  • Image Enhancement Techniques
  • Retinal Imaging and Analysis
  • Handwritten Text Recognition Techniques
  • Neural dynamics and brain function
  • Traffic control and management
  • Image Retrieval and Classification Techniques
  • Robotic Locomotion and Control
  • Sparse and Compressive Sensing Techniques

Google (United States)
2020-2022

Brain (Germany)
2021

Intel (Germany)
2018-2019

Intel (United States)
2017-2019

University of Freiburg
2013-2018

Laboratoire d'Informatique de Paris-Nord
2014-2015

While the Transformer architecture has become de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used replace certain components of networks while keeping their overall structure place. We show that this reliance on CNNs not necessary and a pure transformer directly sequences image patches can perform very well classification tasks. When pre-trained...

10.48550/arxiv.2010.11929 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not among the tasks CNNs succeeded at. In this paper we construct which are capable solving optical problem as supervised learning task. We propose and compare two architectures: generic architecture another one including layer that correlates feature vectors at different image locations. Since existing ground...

10.1109/iccv.2015.316 article EN 2015-12-01

The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of art with regard to quality has still been defined by traditional methods. Particularly on small displacements and real-world data, cannot compete variational In this paper, we advance concept end-to-end make it work really well. large improvements in speed are caused three major contributions: first, focus training data show schedule presenting during is very important. Second,...

10.1109/cvpr.2017.179 article EN 2017-07-01

We introduce CARLA, an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of urban systems. In addition code protocols, provides open digital assets (urban layouts, buildings, vehicles) that were created this purpose can be used freely. The simulation platform supports flexible specification sensor suites environmental conditions. use study performance three approaches driving: a classic modular...

10.48550/arxiv.1711.03938 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate state art from images with networks, questioning necessity different components in pipeline. find that can simply be replaced layer increased stride without loss accuracy on several image benchmarks. Following this finding -- building other recent work simple...

10.48550/arxiv.1412.6806 preprint EN other-oa arXiv (Cornell University) 2014-01-01

Recent work has shown that optical flow estimation can be formulated as a supervised learning task and successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by large synthetically generated dataset. The present paper extends concept via networks to disparity scene estimation. To this end, we propose three synthetic stereo video datasets sufficient realism, variation, size train Our are first large-scale enable training evaluating methods. Besides...

10.1109/cvpr.2016.438 preprint EN 2016-06-01

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as Vision Transformer, have also become popular. In this paper we show that while convolutions and attention both sufficient good performance, neither of them necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types layers: one with MLPs applied independently to image patches (i.e. "mixing"...

10.48550/arxiv.2105.01601 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Legged robots pose one of the greatest challenges in robotics. Dynamic and agile maneuvers animals cannot be imitated by existing methods that are crafted humans. A compelling alternative is reinforcement learning, which requires minimal craftsmanship promotes natural evolution a control policy. However, so far, learning research for legged mainly limited to simulation, only few comparably simple examples have been deployed on real systems. The primary reason training with robots,...

10.1126/scirobotics.aau5872 article EN Science Robotics 2019-01-17

We present a learning-based method for synthesizing novel views of complex scenes using only unstructured collections in-the-wild photographs. build on Neural Radiance Fields (NeRF), which uses the weights multi-layer perceptron to model density and color scene as function 3D coordinates. While NeRF works well images static subjects captured under controlled settings, it is incapable modeling many ubiquitous, real-world phenomena in uncontrolled images, such variable illumination or...

10.1109/cvpr46437.2021.00713 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, policies via imitation learning cannot be controlled at test time. A vehicle end-to-end imitate an expert guided take a specific turn upcoming intersection. This limits the utility such systems. We propose condition high-level command input. At time, policy functions as chauffeur that handles sensorimotor coordination but continues respond navigational commands. evaluate...

10.1109/icra.2018.8460487 article EN 2018-05-01

We present a deep convolutional decoder architecture that can generate volumetric 3D outputs in compute- and memory-efficient manner by using an octree representation. The network learns to predict both the structure of octree, occupancy values individual cells. This makes it particularly valuable technique for generating shapes. In contrast standard decoders acting on regular voxel grids, does not have cubic complexity. allows representing much higher resolution with limited memory budget....

10.1109/iccv.2017.230 article EN 2017-10-01

In this paper we formulate structure from motion as a learning problem. We train convolutional network end-to-end to compute depth and camera successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative that able improve its own predictions. estimates not only motion, but additionally surface normals, optical flow between images confidence matching. A crucial component approach training loss based on...

10.1109/cvpr.2017.596 preprint EN 2017-07-01

We train a generative convolutional neural network which is able to generate images of objects given object type, viewpoint, and color. the in supervised manner on dataset rendered 3D chair models. Our experiments show that does not merely learn all by heart, but rather finds meaningful representation model allowing it assess similarity different chairs, interpolate between viewpoints missing ones, or invent new styles interpolating chairs from training set. can be used find correspondences...

10.1109/cvpr.2015.7298761 article EN 2015-06-01

Deep convolutional networks have proven to be very successful in learning task specific features that allow for unprecedented performance on various computer vision tasks. Training of such follows mostly the supervised paradigm, where sufficiently many input-output pairs are required training. Acquisition large training sets is one key challenges, when approaching a new task. In this paper, we aim generic feature and present an approach network using only unlabeled data. To end, train...

10.1109/tpami.2015.2496141 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2015-10-29

Feature representations, both hand-designed and learned ones, are often hard to analyze interpret, even when they extracted from visual data. We propose a new approach study image representations by inverting them with an up-convolutional neural network. apply the method shallow (HOG, SIFT, LBP), as well deep networks. For our provides significantly better reconstructions than existing methods, revealing that there is surprisingly rich information contained in these features. Inverting...

10.1109/cvpr.2016.522 article EN 2016-06-01

Skillful mobile operation in three-dimensional environments is a primary topic of study Artificial Intelligence. The past two years have seen surge creative work on navigation. This output has produced plethora sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing future research this area, we convened working group to empirical methodology navigation research. present document summarizes the consensus recommendations group. We discuss different problem...

10.48550/arxiv.1807.06757 preprint EN cc-by arXiv (Cornell University) 2018-01-01

Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. [37] showed one interesting way to synthesize novel by performing gradient ascent the latent space of generator network maximize activations or multiple neurons separate classifier network. In this paper we extend method introducing an additional prior on code, improving both sample quality and diversity, leading state-of-the-art generative model that produces high at...

10.1109/cvpr.2017.374 article EN 2017-07-01

Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead computing distances space, compute between features extracted by neural networks. metric better reflects perceptually images and thus show three applications: autoencoder training, modification...

10.48550/arxiv.1602.02644 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, learning entirely different representations? Analyzing internal representation structure of ViTs and CNNs benchmarks, we find striking...

10.48550/arxiv.2108.08810 preprint EN other-oa arXiv (Cornell University) 2021-01-01
Coming Soon ...