- Domain Adaptation and Few-Shot Learning
- Social Sciences and Governance
- Health, Medicine and Society
- Healthcare Systems and Practices
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Social Policies and Family
- Cell Image Analysis Techniques
- Education, sociology, and vocational training
- Aging, Elder Care, and Social Issues
- Digital Economy and Work Transformation
- Legal and Labor Studies
- COVID-19 diagnosis using AI
- French Urban and Social Studies
- Human Pose and Action Recognition
- Colorectal Cancer Screening and Detection
- Image Retrieval and Classification Techniques
- Video Surveillance and Tracking Methods
- Workplace Health and Well-being
- Natural Language Processing Techniques
- Wikis in Education and Collaboration
- International Labor and Employment Law
- Information Technology and Learning
- Occupational Health and Safety Research
Google (United States)
2023-2024
Université de Lille
2012-2024
Adrian College
2023
Directorate of Medicinal and Aromatic Plants Research
2023
Brain (Germany)
2023
Centre de Théorie et Analyse du Droit
2018-2022
Centre de Recherche Droits et Perspectives du droit
2018-2022
Weatherford College
2021
Meta (Israel)
2019-2021
Université de Bordeaux
2014
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) [16] that stand out compared convolutional networks (convnets). Beyond the fact adapting methods architecture works particularly well, make following observations: first, ViT features contain explicit information about semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor convnets. Second, these are also excellent k-NN classifiers, reaching 78.3%...
Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably recent achievements of contrastive learning methods. These methods typically work online and rely on a large number explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an algorithm, SwAV, that takes advantage without requiring to compute comparisons. Specifically, our method simultaneously clusters data while enforcing consistency...
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) linear layer in which patches interact, independently and identically across channels, (ii) two-layer feed-forward channels interact per patch. When trained with modern training strategy using heavy data-augmentation optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. also train ResMLP models...
Pre-training general-purpose visual features with convolutional neural networks without relying on annotations is a challenging and important task. Most recent efforts in unsupervised feature learning have focused either small or highly curated datasets like ImageNet, whereas using uncurated raw was found to decrease the quality when evaluated transfer Our goal bridge performance gap between methods trained data, which are costly obtain, massive that easily available. To effect, we propose...
Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of data beyond the local convolutions. This flexibility, however, comes with a quadratic complexity time memory, hindering application to long sequences high-resolution images. We propose "transposed" version that operates...
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results been achieved in a control environment, that is highly curated ImageNet dataset. However, premise of it can learn from any random image unbounded In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images no supervision. Our final SElf-supERvised (SEER) model, RegNetY 1.3B parameters trained 1B...
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large models (LLMs) contain upwards 100B parameters. Vision (ViT) have introduced same architecture to image and video modelling, but these not yet been successfully scaled nearly degree; dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe highly efficient stable training 22B-parameter (ViT-22B) perform wide variety experiments on resulting model. When evaluated...
Recently, information retrieval has seen the emergence of dense retrievers, using neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results datasets and tasks where large training sets are available. However, they do not transfer well new applications with no data, outperformed by unsupervised term-frequency such BM25. In this work, we explore limits contrastive learning a way train retrievers show that it...
This paper proposes a novel method of learning by predicting view assignments with support samples (PAWS). The trains model to minimize consistency loss, which ensures that different views the same unlabeled instance are assigned similar pseudo-labels. pseudo-labels generated non-parametrically, comparing representations image those set randomly sampled labeled images. distance between and is used provide weighting over class labels, we interpret as soft pseudo-label. By non-parametrically...
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller leading higher accuracy at greater computational cost, but changing the patch typically requires retraining model. In this paper, we demonstrate that simply randomizing training time leads single set weights performs well across wide range sizes, making it possible tailor model different compute budgets deployment time. We extensively...
Understanding verbs is crucial to modelling how people and objects interact with each other the environment through space time. Recently, state-of-the-art video-language models based on CLIP have been shown limited verb understanding rely extensively nouns, restricting their performance in real-world video applications that require action temporal understanding. In this work, we improve for CLIP-based by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main...
Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads object centric features perform par with supervised most object-centric downstream tasks. In work, we question if using ability, can learn more representative present in diverse unbounded set images from across globe. To do so, train billions without data pre-processing or...
Abstract Accurately quantifying cellular morphology at scale could substantially empower existing single-cell approaches. However, measuring cell remains an active field of research, which has inspired multiple computer vision algorithms over the years. Here, we show that DINO, a vision-transformer based, self-supervised algorithm, remarkable ability for learning rich representations without manual annotations or any other type supervision. We evaluate DINO on wide variety tasks across three...
Convolutional neural networks trained without supervision come close to matching performance with supervised pre-training, but sometimes at the cost of an even higher number parameters. Extracting subnetworks from these large unsupervised convnets preserved is particular interest make them less computationally intensive. Typical pruning methods operate during training on a task while trying maintain pruned network same task. However, in self-supervised feature learning, objective agnostic...
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, such as the Vision Transformer (ViT) offer flexible sequence-based modeling, hence varying input sequence lengths. We take advantage this NaViT (Native Resolution ViT) which uses packing during training process inputs arbitrary resolutions aspect ratios. Alongside model usage, we demonstrate improved...
Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, still struggle on fine-grained entities which are rare, or even absent from pre-training dataset. Hence, a key ingredient to their success has been use large-scale curated data aiming expanding set concepts that can memorize during stage. In this work, we explore an alternative encoding knowledge directly into model's parameters:...
In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one 6 million existing entities in Wikipedia. One way approaching problem such scale is using dual-encoder models (eg CLIP), where all names and images are embedded into unified space, paving for an approximate k-NN search. Alternatively, it also possible re-purpose captioning model directly generate image. contrast, introduce novel Generative Entity Recognition (GER)...
Pixel-level labels are particularly expensive to acquire. Hence, pretraining is a critical step improve models on task like semantic segmentation. However, prominent algorithms for neural networks use image-level objectives, e.g. image classification, image-text alignment à la CLIP, or self-supervised contrastive learning. These objectives do not model spatial information, which might be sub-optimal when finetuning downstream tasks with reasoning. In this work, we pretrain location-aware...