Hugo Touvron

ORCID: 0000-0003-1678-392X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Domain Adaptation and Few-Shot Learning
  • Cell Image Analysis Techniques
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • CCD and CMOS Imaging Sensors
  • Image Processing Techniques and Applications
  • Generative Adversarial Networks and Image Synthesis
  • Brain Tumor Detection and Classification
  • COVID-19 diagnosis using AI
  • Advanced Image Processing Techniques
  • Digital Imaging for Blood Diseases
  • Visual Attention and Saliency Detection
  • Topic Modeling
  • Anomaly Detection Techniques and Applications
  • Currency Recognition and Detection
  • Image and Object Detection Techniques
  • Medical Image Segmentation Techniques
  • Natural Language Processing Techniques
  • Speech Recognition and Synthesis
  • Machine Learning in Healthcare
  • Human Pose and Action Recognition
  • Visual and Cognitive Learning Processes
  • Advanced Memory and Neural Computing
  • Model-Driven Software Engineering Techniques

Hong Kong Polytechnic University
2023

University of the Basque Country
2023

Nokia (United Kingdom)
2023

Sorbonne Université
2021-2023

Bangalore University
2023

Sorbonne University Abu Dhabi
2021-2023

Meta (Israel)
2020-2021

Université Paris Cité
2021

Université de Strasbourg
2019

Centre National de la Recherche Scientifique
2019

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) [16] that stand out compared convolutional networks (convnets). Beyond the fact adapting methods architecture works particularly well, make following observations: first, ViT features contain explicit information about semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor convnets. Second, these are also excellent k-NN classifiers, reaching 78.3%...

10.1109/iccv48922.2021.00951 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. train our on trillions tokens, and show that it is possible state-of-the-art using publicly available datasets exclusively, without resorting proprietary inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) most benchmarks, LLaMA-65B competitive with the best models, Chinchilla-70B PaLM-540B. release all research community.

10.48550/arxiv.2302.13971 preprint EN cc-by arXiv (Cornell University) 2023-01-01

In this work, we develop and release Llama 2, a collection of pretrained fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 parameters. Our LLMs, called 2-Chat, are optimized for dialogue use cases. outperform open-source chat on most benchmarks tested, based our human evaluations helpfulness safety, may be suitable substitute closed-source models. We provide detailed description approach fine-tuning safety improvements 2-Chat order enable the community build work...

10.48550/arxiv.2307.09288 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However optimization vision transformers has little studied so far. In this work, we build and optimize deeper transformer networks classification. particular, investigate interplay architecture such dedicated transformers. We make two changes that significantly improve accuracy deep This leads us to produce models whose...

10.1109/iccv48922.2021.00010 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in high-speed regime. Our work exploits recent findings attention-based architectures, which are competitive on highly parallel processing hardware. revisit principles from extensive literature convolutional neural networks to apply them transformers, particular activation maps with decreasing resolutions. also introduce attention bias, new way integrate positional information...

10.1109/iccv48922.2021.01204 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) linear layer in which patches interact, independently and identically across channels, (ii) two-layer feed-forward channels interact per patch. When trained with modern training strategy using heavy data-augmentation optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. also train ResMLP models...

10.1109/tpami.2022.3206148 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-09-12

Abstract Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and recently outperformed CNNs image classification. However, they require costly pre-training large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask following...

10.1088/1742-5468/ac9830 article EN Journal of Statistical Mechanics Theory and Experiment 2022-11-01

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of data beyond the local convolutions. This flexibility, however, comes with a quadratic complexity time memory, hindering application to long sequences high-resolution images. We propose "transposed" version that operates...

10.48550/arxiv.2106.09681 preprint EN cc-by arXiv (Cornell University) 2021-01-01

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support input contexts, and zero-shot instruction following ability programming tasks. provide multiple flavors to cover wide range applications: foundation (Code Llama), Python specializations - Python), instruction-following Instruct) with 7B, 13B, 34B 70B parameters each. All are trained sequences 16k tokens show improvements...

10.48550/arxiv.2308.12950 preprint EN cc-by arXiv (Cornell University) 2023-01-01

The influential Residual Networks designed by He et al. remain the gold-standard architecture in numerous scientific publications. They typically serve as default studies, or baselines when new architectures are proposed. Yet there has been significant progress on best practices for training neural networks since inception of ResNet 2015. Novel optimization & data-augmentation have increased effectiveness recipes. In this paper, we re-evaluate performance vanilla ResNet-50 trained with a...

10.48550/arxiv.2110.00476 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Data-augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between typical size objects seen by classifier at train and test time. We experimentally validate that, target resolution, using lower resolution offers better classification then propose simple yet effective efficient strategy optimize performance when resolutions differ. It involves only computationally cheap fine-tuning...

10.48550/arxiv.1906.06423 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Recently, neural networks purely based on attention were shown to address image understanding tasks such as classification. However, these visual transformers are pre-trained with hundreds of millions images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training Imagenet only. We train them single computer in less than 3 days. Our reference vision (86M parameters) achieves top-1 accuracy 83.1%...

10.48550/arxiv.2012.12877 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm especially effective for tasks with small training sets, which high-capacity tend to overfit. In this work, we consider self-supervised pre-training scenario that only leverages the target task data. We Stanford Cars, Sketch or COCO, are order(s) of magnitude smaller than Imagenet. Our study shows denoising autoencoders, such as BEiT variant introduce paper, more robust type...

10.48550/arxiv.2112.10740 preprint EN cc-by arXiv (Cornell University) 2021-01-01

This paper tackles the problem of learning a finer representation than one provided by training labels. enables fine-grained category retrieval images in collection annotated with coarse labels only. Our network is learned nearest-neighbor classifier objective, and an instance loss inspired self-supervised learning. By jointly leveraging underlying latent space, it significantly improves accuracy category-level methods. strategy outperforms all competing methods for retrieving or classifying...

10.1109/iccv48922.2021.00091 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in high-speed regime. Our work exploits recent findings attention-based architectures, which are competitive on highly parallel processing hardware. revisit principles from extensive literature convolutional neural networks to apply them transformers, particular activation maps with decreasing resolutions. also introduce attention bias, new way integrate positional information...

10.48550/arxiv.2104.01136 preprint EN other-oa arXiv (Cornell University) 2021-01-01

This paper provides an extensive analysis of the performance EfficientNet image classifiers with several recent training procedures, in particular one that corrects discrepancy between train and test images. The resulting network, called FixEfficientNet, significantly outperforms initial architecture same number parameters. For instance, our FixEfficientNet-B0 trained without additional data achieves 79.3% top-1 accuracy on ImageNet 5.3M is a +0.5% absolute improvement over Noisy student...

10.48550/arxiv.2003.08237 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Abstract Accurately quantifying cellular morphology at scale could substantially empower existing single-cell approaches. However, measuring cell remains an active field of research, which has inspired multiple computer vision algorithms over the years. Here, we show that DINO, a vision-transformer based, self-supervised algorithm, remarkable ability for learning rich representations without manual annotations or any other type supervision. We evaluate DINO on wide variety tasks across three...

10.1101/2023.06.16.545359 preprint EN cc-by-nd bioRxiv (Cold Spring Harbor Laboratory) 2023-06-18

We show how to augment any convolutional network with an attention-based global map achieve non-local reasoning. replace the final average pooling by aggregation layer akin a single transformer block, that weights patches are involved in classification decision. plug this learned simplistic patch-based parametrized 2 parameters (width and depth). In contrast pyramidal design, architecture family maintains input patch resolution across all layers. It yields surprisingly competitive trade-offs...

10.48550/arxiv.2112.13692 preprint EN other-oa arXiv (Cornell University) 2021-01-01
Coming Soon ...