NFDI4DS | UHH-SEMS - Publication Details

Alexey A. Gritsenko

ORCID: 0000-0003-1799-2301

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5083585691

Research Areas

Multimodal Machine Learning Applications
Advanced Neural Network Applications
Advanced Image and Video Retrieval Techniques
Domain Adaptation and Few-Shot Learning
RNA and protein synthesis mechanisms
Generative Adversarial Networks and Image Synthesis
Algorithms and Data Compression
Human Pose and Action Recognition
Advanced Vision and Imaging
Speech and Audio Processing
Speech Recognition and Synthesis
Music and Audio Processing
Fractional Differential Equations Solutions
Genomics and Phylogenetic Studies
RNA Research and Splicing
Bacterial Genetics and Biotechnology
Machine Learning and Algorithms
Model Reduction and Neural Networks
Topic Modeling
Machine Learning and Data Classification
Advanced Neuroimaging Techniques and Applications
Insurance, Mortality, Demography, Risk Management
Image Enhancement Techniques
Explainable Artificial Intelligence (XAI)
Video Surveillance and Tracking Methods

Google (United States)
2020-2024

DeepMind (United Kingdom)
2023

Delft University of Technology
2012-2019

Cancer Genomics Centre
2015-2017

Systematic discovery of cap-independent translation sequences in human and viral genomes

OPENALEX - Publications

Shira Weingarten-Gabbay Shani Elias‐Kirma Ronit Nir Alexey A. Gritsenko Noam Stern‐Ginossar and 3 more

Identifying the IRESs of humans and viruses Most proteins result from translation 5′ capped RNA transcripts. In a subset human genes, transcripts with internal ribosome entry sites (IRESs) are uncapped. Weingarten-Gabbay et al. systematically surveyed presence in protein-coding transcripts, as well those (see Perspective by Gebauer Hentze). Large-scale mutagenesis profiling identified two classes IRESs: having functional element localized to one small region IRES important elements...

10.1126/science.aad4939 article EN Science 2016-01-15

Imagen Video: High Definition Video Generation with Diffusion Models

OPENALEX - Publications

Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao and 6 more

We present Imagen Video, a text-conditional video generation system based on cascade of diffusion models. Given text prompt, Video generates high definition videos using base model and sequence interleaved spatial temporal super-resolution describe how we scale up the as text-to-video including design decisions such choice fully-convolutional models at certain resolutions, v-parameterization In addition, confirm transfer findings from previous work diffusion-based image to setting. Finally,...

10.48550/arxiv.2210.02303 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Video Diffusion Models

OPENALEX - Publications

Jonathan Ho Tim Salimans Alexey A. Gritsenko William Chan Mohammad Norouzi and 1 more

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this by proposing a diffusion model for generation that shows very promising initial results. Our natural extension of the standard image architecture, and it enables jointly training from data, which we find to reduce variance minibatch gradients speed up optimization. To generate long higher resolution videos introduce new conditional sampling technique...

10.48550/arxiv.2204.03458 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Scaling Vision Transformers to 22 Billion Parameters

OPENALEX - Publications

Mostafa Dehghani Josip Djolonga Basil Mustafa Piotr Padlewski Jonathan Heek and 37 more

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large models (LLMs) contain upwards 100B parameters. Vision (ViT) have introduced same architecture to image and video modelling, but these not yet been successfully scaled nearly degree; dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe highly efficient stable training 22B-parameter (ViT-22B) perform wide variety experiments on resulting model. When evaluated...

10.48550/arxiv.2302.05442 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Global DNA Compaction in Stationary-Phase Bacteria Does Not Affect Transcription

OPENALEX - Publications

Richard Janissen Mathia M.A. Arens Natalia Vtyurina Zaïda Rivai Nicholas D. Sunday and 8 more

10.1016/j.cell.2018.06.049 article EN publisher-specific-oa Cell 2018-07-26

Simple Open-Vocabulary Object Detection with Vision Transformers

OPENALEX - Publications

Matthias Minderer Alexey A. Gritsenko Austin V. Stone Maxim Neumann Dirk Weissenborn and 9 more

Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, and scaling approaches are less well established, especially the long-tailed open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models detection. We use standard Vision Transformer architecture minimal modifications, contrastive pre-training, end-to-end detection...

10.48550/arxiv.2205.06230 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Scaling Open-Vocabulary Object Detection

OPENALEX - Publications

Matthias Minderer Alexey A. Gritsenko Neil Houlsby

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available training data. While data can be expanded using Web image-text pairs as weak supervision, this not been done at scales comparable to image-level pretraining. Here, we scale up with self-training, which uses an existing detector generate pseudo-box annotations on pairs. Major challenges in scaling self-training are choice label space, pseudo-annotation...

10.48550/arxiv.2306.09683 preprint EN other-oa arXiv (Cornell University) 2023-01-01

End-to-End Spatio-Temporal Action Localisation with Video Transformers

OPENALEX - Publications

Alexey A. Gritsenko Xuehan Xiong Josip Djolonga Mostafa Dehghani Chen Sun and 3 more

10.1109/cvpr52733.2024.01739 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

SCENIC: A JAX Library for Computer Vision Research and Beyond

OPENALEX - Publications

Mostafa Dehghani Alexey A. Gritsenko Anurag Arnab Matthias Minderer Yi Tay

Scenic is an open-source <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> https://github.com/google-research/scenic JAX library with a focus on transformer-based models for computer vision research and beyond. The goal of this toolkit to facilitate rapid experimentation, prototyping, new architectures models. supports diverse range tasks (e.g., classification, segmentation, detection) facilitates working multi-modal problems, along...

10.1109/cvpr52688.2022.02070 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies

OPENALEX - Publications

Alexey A. Gritsenko Jurgen F. Nijkamp Marcel J. T. Reinders Dick de Ridder

Abstract Motivation: The increasing availability of second-generation high-throughput sequencing (HTS) technologies has sparked a growing interest in de novo genome sequencing. This turn fueled the need for reliable means obtaining high-quality draft genomes from short-read data. millions reads usually involved HTS experiments are first assembled into longer fragments called contigs, which then scaffolded, i.e. ordered and oriented using additional information, to produce even sequences...

10.1093/bioinformatics/bts175 article EN Bioinformatics 2012-04-06

Unbiased Quantitative Models of Protein Translation Derived from Ribosome Profiling Data

OPENALEX - Publications

Alexey A. Gritsenko Marc Hulsman Marcel J. T. Reinders Dick de Ridder

Translation of RNA to protein is a core process for any living organism. While some steps this the effect on production understood, holistic understanding translation still remains elusive. In silico modelling promising approach elucidating synthesis. Although number computational models have been proposed, their application limited by assumptions they make. Ribosome profiling (RP), relatively new sequencing-based technique capable recording snapshots locations actively translating...

10.1371/journal.pcbi.1004336 article EN cc-by PLoS Computational Biology 2015-08-14

The Benchmark Lottery

OPENALEX - Publications

Mostafa Dehghani Yi Tay Alexey A. Gritsenko Zhe Zhao Neil Houlsby and 3 more

The world of empirical machine learning (ML) strongly relies on benchmarks in order to determine the relative effectiveness different algorithms and methods. This paper proposes notion "a benchmark lottery" that describes overall fragility ML benchmarking process. lottery postulates many factors, other than fundamental algorithmic superiority, may lead a method being perceived as superior. On multiple setups are prevalent community, we show performance be altered significantly simply by...

10.48550/arxiv.2107.07002 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

OPENALEX - Publications

Connor Schenck Isaac Reid Mithun George Jacob Alex Bewley Joshua Ainslie and 17 more

We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Encodings, a recently proposed and widely used algorithm in large language models, via unifying theoretical framework. Importantly, still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining low computational footprint. These properties are especially important robotics, where efficient 3D representation is key. integrate into Vision...

10.48550/arxiv.2502.02562 preprint EN arXiv (Cornell University) 2025-02-04

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

OPENALEX - Publications

Michael Tschannen Alexey A. Gritsenko Xiao Wang Muhammad Ferjad Naeem Ibrahim Alabdulmohsin and 9 more

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success original SigLIP. In this second iteration, we extend image-text training objective with several prior, independently developed techniques into unified recipe -- includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, 2 models outperform their counterparts at all model scales in core capabilities,...

10.48550/arxiv.2502.14786 preprint EN arXiv (Cornell University) 2025-02-20

IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression

OPENALEX - Publications

Rianne van den Berg Alexey A. Gritsenko Mostafa Dehghani Casper Kaae Sønderby Tim Salimans

In this paper we analyse and improve integer discrete flows for lossless compression. Integer are a recently proposed class of models that learn invertible transformations integer-valued random variables. Their nature makes them particularly suitable compression with entropy coding schemes. We start by investigating recent theoretical claim states variables less flexible than their continuous counterparts. demonstrate proof does not hold due to the embedding data finite support into...

10.48550/arxiv.2006.12459 preprint EN other-oa arXiv (Cornell University) 2020-01-01

PaliGemma: A versatile 3B VLM for transfer

OPENALEX - Publications

Lucas Beyer Andreas Steiner André Susano Pinto Alexander Kolesnikov Xiao Wang and 30 more

PaliGemma is an open Vision-Language Model (VLM) that based on the SigLIP-So400m vision encoder and Gemma-2B language model. It trained to be a versatile broadly knowledgeable base model effective transfer. achieves strong performance wide variety of open-world tasks. We evaluate almost 40 diverse tasks including standard VLM benchmarks, but also more specialized such as remote-sensing segmentation.

10.48550/arxiv.2407.07726 preprint EN arXiv (Cornell University) 2024-07-10

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

OPENALEX - Publications

Mostafa Dehghani Basil Mustafa Josip Djolonga Jonathan Heek Matthias Minderer and 10 more

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, such as the Vision Transformer (ViT) offer flexible sequence-based modeling, hence varying input sequence lengths. We take advantage this NaViT (Native Resolution ViT) which uses packing during training process inputs arbitrary resolutions aspect ratios. Alongside model usage, we demonstrate improved...

10.48550/arxiv.2307.06304 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Sequence features of viral and human Internal Ribosome Entry Sites predictive of their activity

OPENALEX - Publications

Alexey A. Gritsenko Shira Weingarten-Gabbay Shani Elias‐Kirma Ronit Nir Dick de Ridder and 1 more

Translation of mRNAs through Internal Ribosome Entry Sites (IRESs) has emerged as a prominent mechanism cellular and viral initiation. It supports cap-independent translation select genes under normal conditions, in conditions when cap-dependent is inhibited. IRES structure sequence are believed to be involved this process. However due the small number IRESs known, there have been no systematic investigations determinants activity. With recent discovery thousands novel human viruses, next...

10.1371/journal.pcbi.1005734 article EN cc-by PLoS Computational Biology 2017-09-18

A Spectral Energy Distance for Parallel Speech Synthesis

OPENALEX - Publications

Alexey A. Gritsenko Tim Salimans Rianne van den Berg Jasper Snoek Nal Kalchbrenner

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such they require executing tens thousands sequential operations per second generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new method allows us to train highly parallel speech, without...

10.48550/arxiv.2008.01160 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Autoregressive Diffusion Models

OPENALEX - Publications

Emiel Hoogeboom Alexey A. Gritsenko Jasmijn Bastings Ben Poole Rianne van den Berg and 1 more

We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) absorbing discrete diffusion (Austin 2021), which we show are special cases of ARDMs under mild assumptions. simple to implement easy train. Unlike standard ARMs, they do not require causal masking representations, can be trained using an efficient objective similar modern probabilistic that scales favourably highly-dimensional data. At...

10.48550/arxiv.2110.02037 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Video OWL-ViT: Temporally-consistent open-world localization in video

OPENALEX - Publications

Georg Heigold Daniel Keysers Matthias Minderer Mario Lučić Alexey A. Gritsenko and 3 more

We present an architecture and a training recipe that adapts pretrained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led significant improvements image-level For more structured tasks involving object applying pre-trained challenging. This particularly true video tasks, where task-specific...

10.1109/iccv51070.2023.01269 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Improving fine-grained understanding in image-text pre-training

OPENALEX - Publications

Ioana Bica Anastasija Ilić Matthias Bauer Goker Erdogan Matko Bošnjak and 6 more

We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose learn grouping of every token in the caption. To achieve this, use sparse similarity metric between and language tokens compute each language-grouped vision embedding as weighted average patches. The embeddings are then contrasted through sequence-wise...

10.48550/arxiv.2401.09865 preprint EN cc-by-sa arXiv (Cornell University) 2024-01-01

Time-, Memory- and Parameter-Efficient Visual Adaptation

OPENALEX - Publications

Otniel-Bogdan Mercea Alexey A. Gritsenko Cordelia Schmid Anurag Arnab

10.1109/cvpr52733.2024.00529 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Coming Soon ...