- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- RNA and protein synthesis mechanisms
- Generative Adversarial Networks and Image Synthesis
- Algorithms and Data Compression
- Human Pose and Action Recognition
- Advanced Vision and Imaging
- Speech and Audio Processing
- Speech Recognition and Synthesis
- Music and Audio Processing
- Fractional Differential Equations Solutions
- Genomics and Phylogenetic Studies
- RNA Research and Splicing
- Bacterial Genetics and Biotechnology
- Machine Learning and Algorithms
- Model Reduction and Neural Networks
- Topic Modeling
- Machine Learning and Data Classification
- Advanced Neuroimaging Techniques and Applications
- Insurance, Mortality, Demography, Risk Management
- Image Enhancement Techniques
- Explainable Artificial Intelligence (XAI)
- Video Surveillance and Tracking Methods
Google (United States)
2020-2024
DeepMind (United Kingdom)
2023
Delft University of Technology
2012-2019
Cancer Genomics Centre
2015-2017
Identifying the IRESs of humans and viruses Most proteins result from translation 5′ capped RNA transcripts. In a subset human genes, transcripts with internal ribosome entry sites (IRESs) are uncapped. Weingarten-Gabbay et al. systematically surveyed presence in protein-coding transcripts, as well those (see Perspective by Gebauer Hentze). Large-scale mutagenesis profiling identified two classes IRESs: having functional element localized to one small region IRES important elements...
We present Imagen Video, a text-conditional video generation system based on cascade of diffusion models. Given text prompt, Video generates high definition videos using base model and sequence interleaved spatial temporal super-resolution describe how we scale up the as text-to-video including design decisions such choice fully-convolutional models at certain resolutions, v-parameterization In addition, confirm transfer findings from previous work diffusion-based image to setting. Finally,...
Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this by proposing a diffusion model for generation that shows very promising initial results. Our natural extension of the standard image architecture, and it enables jointly training from data, which we find to reduce variance minibatch gradients speed up optimization. To generate long higher resolution videos introduce new conditional sampling technique...
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large models (LLMs) contain upwards 100B parameters. Vision (ViT) have introduced same architecture to image and video modelling, but these not yet been successfully scaled nearly degree; dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe highly efficient stable training 22B-parameter (ViT-22B) perform wide variety experiments on resulting model. When evaluated...
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, and scaling approaches are less well established, especially the long-tailed open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models detection. We use standard Vision Transformer architecture minimal modifications, contrastive pre-training, end-to-end detection...
Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available training data. While data can be expanded using Web image-text pairs as weak supervision, this not been done at scales comparable to image-level pretraining. Here, we scale up with self-training, which uses an existing detector generate pseudo-box annotations on pairs. Major challenges in scaling self-training are choice label space, pseudo-annotation...
Scenic is an open-source <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> https://github.com/google-research/scenic JAX library with a focus on transformer-based models for computer vision research and beyond. The goal of this toolkit to facilitate rapid experimentation, prototyping, new architectures models. supports diverse range tasks (e.g., classification, segmentation, detection) facilitates working multi-modal problems, along...
Abstract Motivation: The increasing availability of second-generation high-throughput sequencing (HTS) technologies has sparked a growing interest in de novo genome sequencing. This turn fueled the need for reliable means obtaining high-quality draft genomes from short-read data. millions reads usually involved HTS experiments are first assembled into longer fragments called contigs, which then scaffolded, i.e. ordered and oriented using additional information, to produce even sequences...
Translation of RNA to protein is a core process for any living organism. While some steps this the effect on production understood, holistic understanding translation still remains elusive. In silico modelling promising approach elucidating synthesis. Although number computational models have been proposed, their application limited by assumptions they make. Ribosome profiling (RP), relatively new sequencing-based technique capable recording snapshots locations actively translating...
The world of empirical machine learning (ML) strongly relies on benchmarks in order to determine the relative effectiveness different algorithms and methods. This paper proposes notion "a benchmark lottery" that describes overall fragility ML benchmarking process. lottery postulates many factors, other than fundamental algorithmic superiority, may lead a method being perceived as superior. On multiple setups are prevalent community, we show performance be altered significantly simply by...
We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Encodings, a recently proposed and widely used algorithm in large language models, via unifying theoretical framework. Importantly, still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining low computational footprint. These properties are especially important robotics, where efficient 3D representation is key. integrate into Vision...
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success original SigLIP. In this second iteration, we extend image-text training objective with several prior, independently developed techniques into unified recipe -- includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, 2 models outperform their counterparts at all model scales in core capabilities,...
In this paper we analyse and improve integer discrete flows for lossless compression. Integer are a recently proposed class of models that learn invertible transformations integer-valued random variables. Their nature makes them particularly suitable compression with entropy coding schemes. We start by investigating recent theoretical claim states variables less flexible than their continuous counterparts. demonstrate proof does not hold due to the embedding data finite support into...
PaliGemma is an open Vision-Language Model (VLM) that based on the SigLIP-So400m vision encoder and Gemma-2B language model. It trained to be a versatile broadly knowledgeable base model effective transfer. achieves strong performance wide variety of open-world tasks. We evaluate almost 40 diverse tasks including standard VLM benchmarks, but also more specialized such as remote-sensing segmentation.
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, such as the Vision Transformer (ViT) offer flexible sequence-based modeling, hence varying input sequence lengths. We take advantage this NaViT (Native Resolution ViT) which uses packing during training process inputs arbitrary resolutions aspect ratios. Alongside model usage, we demonstrate improved...
Translation of mRNAs through Internal Ribosome Entry Sites (IRESs) has emerged as a prominent mechanism cellular and viral initiation. It supports cap-independent translation select genes under normal conditions, in conditions when cap-dependent is inhibited. IRES structure sequence are believed to be involved this process. However due the small number IRESs known, there have been no systematic investigations determinants activity. With recent discovery thousands novel human viruses, next...
Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such they require executing tens thousands sequential operations per second generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new method allows us to train highly parallel speech, without...
We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) absorbing discrete diffusion (Austin 2021), which we show are special cases of ARDMs under mild assumptions. simple to implement easy train. Unlike standard ARMs, they do not require causal masking representations, can be trained using an efficient objective similar modern probabilistic that scales favourably highly-dimensional data. At...
We present an architecture and a training recipe that adapts pretrained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led significant improvements image-level For more structured tasks involving object applying pre-trained challenging. This particularly true video tasks, where task-specific...
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose learn grouping of every token in the caption. To achieve this, use sparse similarity metric between and language tokens compute each language-grouped vision embedding as weighted average patches. The embeddings are then contrasted through sequence-wise...