NFDI4DS | UHH-SEMS - Publication Details

Mathilde Caron

ORCID: 0000-0001-6594-6698

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5071626315

Research Areas

Domain Adaptation and Few-Shot Learning
Social Sciences and Governance
Health, Medicine and Society
Healthcare Systems and Practices
Multimodal Machine Learning Applications
Advanced Image and Video Retrieval Techniques
Advanced Neural Network Applications
Social Policies and Family
Cell Image Analysis Techniques
Education, sociology, and vocational training
Aging, Elder Care, and Social Issues
Digital Economy and Work Transformation
Legal and Labor Studies
COVID-19 diagnosis using AI
French Urban and Social Studies
Human Pose and Action Recognition
Colorectal Cancer Screening and Detection
Image Retrieval and Classification Techniques
Video Surveillance and Tracking Methods
Workplace Health and Well-being
Natural Language Processing Techniques
Wikis in Education and Collaboration
International Labor and Employment Law
Information Technology and Learning
Occupational Health and Safety Research

Google (United States)
2023-2024

Université de Lille
2012-2024

Adrian College
2023

Directorate of Medicinal and Aromatic Plants Research
2023

Brain (Germany)
2023

Centre de Théorie et Analyse du Droit
2018-2022

Centre de Recherche Droits et Perspectives du droit
2018-2022

Weatherford College
2021

Meta (Israel)
2019-2021

Université de Bordeaux
2014

Emerging Properties in Self-Supervised Vision Transformers

OPENALEX - Publications

Mathilde Caron Hugo Touvron Ishan Misra Hervé Jeǵou Julien Mairal and 2 more

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) [16] that stand out compared convolutional networks (convnets). Beyond the fact adapting methods architecture works particularly well, make following observations: first, ViT features contain explicit information about semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor convnets. Second, these are also excellent k-NN classifiers, reaching 78.3%...

10.1109/iccv48922.2021.00951 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

OPENALEX - Publications

Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Piotr Bojanowski and 1 more

Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably recent achievements of contrastive learning methods. These methods typically work online and rely on a large number explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an algorithm, SwAV, that takes advantage without requiring to compute comparisons. Specifically, our method simultaneously clusters data while enforcing consistency...

10.48550/arxiv.2006.09882 preprint EN other-oa arXiv (Cornell University) 2020-01-01

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

OPENALEX - Publications

Hugo Touvron Piotr Bojanowski Mathilde Caron Matthieu Cord Alaaeldin El-Nouby and 6 more

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) linear layer in which patches interact, independently and identically across channels, (ii) two-layer feed-forward channels interact per patch. When trained with modern training strategy using heavy data-augmentation optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. also train ResMLP models...

10.1109/tpami.2022.3206148 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-09-12

Unsupervised Pre-Training of Image Features on Non-Curated Data

OPENALEX - Publications

Mathilde Caron Piotr Bojanowski Julien Mairal Armand Joulin

Pre-training general-purpose visual features with convolutional neural networks without relying on annotations is a challenging and important task. Most recent efforts in unsupervised feature learning have focused either small or highly curated datasets like ImageNet, whereas using uncurated raw was found to decrease the quality when evaluated transfer Our goal bridge performance gap between methods trained data, which are costly obtain, massive that easily available. To effect, we propose...

10.1109/iccv.2019.00305 preprint EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

XCiT: Cross-Covariance Image Transformers

OPENALEX - Publications

Alaaeldin El-Nouby Hugo Touvron Mathilde Caron Piotr Bojanowski Matthijs Douze and 6 more

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of data beyond the local convolutions. This flexibility, however, comes with a quadratic complexity time memory, hindering application to long sequences high-resolution images. We propose "transposed" version that operates...

10.48550/arxiv.2106.09681 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Self-supervised Pretraining of Visual Features in the Wild

OPENALEX - Publications

Priya Goyal Mathilde Caron Benjamin Lefaudeux Min Xu Pengchao Wang and 6 more

Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results been achieved in a control environment, that is highly curated ImageNet dataset. However, premise of it can learn from any random image unbounded In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images no supervision. Our final SElf-supERvised (SEER) model, RegNetY 1.3B parameters trained 1B...

10.48550/arxiv.2103.01988 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Scaling Vision Transformers to 22 Billion Parameters

OPENALEX - Publications

Mostafa Dehghani Josip Djolonga Basil Mustafa Piotr Padlewski Jonathan Heek and 37 more

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large models (LLMs) contain upwards 100B parameters. Vision (ViT) have introduced same architecture to image and video modelling, but these not yet been successfully scaled nearly degree; dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe highly efficient stable training 22B-parameter (ViT-22B) perform wide variety experiments on resulting model. When evaluated...

10.48550/arxiv.2302.05442 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Unsupervised Dense Information Retrieval with Contrastive Learning

OPENALEX - Publications

Gautier Izacard Mathilde Caron Lucas Hosseini Sebastian Riedel Piotr Bojanowski and 2 more

Recently, information retrieval has seen the emergence of dense retrievers, using neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results datasets and tasks where large training sets are available. However, they do not transfer well new applications with no data, outperformed by unsupervised term-frequency such BM25. In this work, we explore limits contrastive learning a way train retrievers show that it...

10.48550/arxiv.2112.09118 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

OPENALEX - Publications

Mahmoud Assran Mathilde Caron Ishan Misra Piotr Bojanowski Armand Joulin and 2 more

This paper proposes a novel method of learning by predicting view assignments with support samples (PAWS). The trains model to minimize consistency loss, which ensures that different views the same unlabeled instance are assigned similar pseudo-labels. pseudo-labels generated non-parametrically, comparing representations image those set randomly sampled labeled images. distance between and is used provide weighting over class labels, we interpret as soft pseudo-label. By non-parametrically...

10.1109/iccv48922.2021.00833 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

FlexiViT: One Model for All Patch Sizes

OPENALEX - Publications

Lucas Beyer Pavel Izmailov А. И. Колесников Mathilde Caron Simon Kornblith and 5 more

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller leading higher accuracy at greater computational cost, but changing the patch typically requires retraining model. In this paper, we demonstrate that simply randomizing training time leads single set weights performs well across wide range sizes, making it possible tailor model different compute budgets deployment time. We extensively...

10.1109/cvpr52729.2023.01393 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Verbs in Action: Improving verb understanding in video-language models

OPENALEX - Publications

Liliane Momeni Mathilde Caron Arsha Nagrani Andrew Zisserman Cordelia Schmid

Understanding verbs is crucial to modelling how people and objects interact with each other the environment through space time. Recently, state-of-the-art video-language models based on CLIP have been shown limited verb understanding rely extensively nouns, restricting their performance in real-world video applications that require action temporal understanding. In this work, we improve for CLIP-based by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main...

10.1109/iccv51070.2023.01428 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

OPENALEX - Publications

Priya Goyal Quentin Duval Isaac Seessel Mathilde Caron Ishan Misra and 3 more

Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads object centric features perform par with supervised most object-centric downstream tasks. In work, we question if using ability, can learn more representative present in diverse unbounded set images from across globe. To do so, train billions without data pre-processing or...

10.48550/arxiv.2202.08360 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Unbiased single-cell morphology with self-supervised vision transformers

OPENALEX - Publications

Michael Doron Théo Moutakanni Zitong Chen Nikita Moshkov Mathilde Caron and 4 more

Abstract Accurately quantifying cellular morphology at scale could substantially empower existing single-cell approaches. However, measuring cell remains an active field of research, which has inspired multiple computer vision algorithms over the years. Here, we show that DINO, a vision-transformer based, self-supervised algorithm, remarkable ability for learning rich representations without manual annotations or any other type supervision. We evaluate DINO on wide variety tasks across three...

10.1101/2023.06.16.545359 preprint EN cc-by-nd bioRxiv (Cold Spring Harbor Laboratory) 2023-06-18

Pruning Convolutional Neural Networks with Self-Supervision

OPENALEX - Publications

Mathilde Caron Ari S. Morcos Piotr Bojanowski Julien Mairal Armand Joulin

Convolutional neural networks trained without supervision come close to matching performance with supervised pre-training, but sometimes at the cost of an even higher number parameters. Extracting subnetworks from these large unsupervised convnets preserved is particular interest make them less computationally intensive. Typical pruning methods operate during training on a task while trying maintain pruned network same task. However, in self-supervised feature learning, objective agnostic...

10.48550/arxiv.2001.03554 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

OPENALEX - Publications

Mostafa Dehghani Basil Mustafa Josip Djolonga Jonathan Heek Matthias Minderer and 10 more

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, such as the Vision Transformer (ViT) offer flexible sequence-based modeling, hence varying input sequence lengths. We take advantage this NaViT (Native Resolution ViT) which uses packing during training process inputs arbitrary resolutions aspect ratios. Alongside model usage, we demonstrate improved...

10.48550/arxiv.2307.06304 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Retrieval-Enhanced Contrastive Vision-Text Models

OPENALEX - Publications

Ahmet İşcen Mathilde Caron Alireza Fathi Cordelia Schmid

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, still struggle on fine-grained entities which are rare, or even absent from pre-training dataset. Hence, a key ingredient to their success has been use large-scale curated data aiming expanding set concepts that can memorize during stage. In this work, we explore an alternative encoding knowledge directly into model's parameters:...

10.48550/arxiv.2306.07196 preprint EN other-oa arXiv (Cornell University) 2023-01-01

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

OPENALEX - Publications

Mathilde Caron Ahmet İşcen Alireza Fathi Cordelia Schmid

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one 6 million existing entities in Wikipedia. One way approaching problem such scale is using dual-encoder models (eg CLIP), where all names and images are embedded into unified space, paving for an approximate k-NN search. Alternatively, it also possible re-purpose captioning model directly generate image. contrast, introduce novel Generative Entity Recognition (GER)...

10.48550/arxiv.2403.02041 preprint EN arXiv (Cornell University) 2024-03-04

Location-Aware Self-Supervised Transformers for Semantic Segmentation

OPENALEX - Publications

Mathilde Caron Neil Houlsby Cordelia Schmid

Pixel-level labels are particularly expensive to acquire. Hence, pretraining is a critical step improve models on task like semantic segmentation. However, prominent algorithms for neural networks use image-level objectives, e.g. image classification, image-text alignment à la CLIP, or self-supervised contrastive learning. These objectives do not model spatial information, which might be sub-optimal when finetuning downstream tasks with reasoning. In this work, we pretrain location-aware...

10.1109/wacv57701.2024.00019 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

OPENALEX - Publications

Mathilde Caron Ahmet İşcen Alireza Fathi Cordelia Schmid

10.1109/cvpr52733.2024.01639 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Coming Soon ...