- Genomics and Chromatin Dynamics
- RNA and protein synthesis mechanisms
- Genomics and Phylogenetic Studies
- Biomedical Text Mining and Ontologies
- Chromatin Remodeling and Cancer
- RNA Research and Splicing
- Protein Degradation and Inhibitors
- RNA modifications and cancer
- Chromosomal and Genetic Variations
- CRISPR and Genetic Engineering
- Genetic diversity and population structure
- Machine Learning in Bioinformatics
- Geography and Environmental Studies in Latin America
- SARS-CoV-2 and COVID-19 Research
- Phytoplasmas and Hemiptera pathogens
- Genomic variations and chromosomal abnormalities
- COVID-19 Clinical Research Studies
- COVID-19 diagnosis using AI
- Cancer, Hypoxia, and Metabolism
- Plant Taxonomy and Phylogenetics
- Viral gastroenteritis research and epidemiology
- Single-cell and spatial transcriptomics
- Evolution and Genetic Dynamics
- Gene expression and cancer classification
- Bioinformatics and Genomic Networks
Vavilov Institute of General Genetics
2019-2025
Institute of Protein Research
2020-2025
Lomonosov Moscow State University
2018-2024
Russian Academy of Sciences
2024
Pirogov Russian National Research Medical University
2022-2024
Moscow State University
2023
Gamalei Institute of Epidemiology and Microbiology
2023
Moscow Institute of Physics and Technology
2019-2020
The present outbreak of a coronavirus-associated acute respiratory disease called coronavirus 19 (COVID-19) is the third documented spillover an animal to humans in only two decades that has resulted major epidemic. Coronaviridae Study Group (CSG) International Committee on Taxonomy Viruses, which responsible for developing classification viruses and taxon nomenclature family Coronaviridae, assessed placement human pathogen, tentatively named 2019-nCoV, within Coronaviridae. Based phylogeny,...
Abstract The present outbreak of lower respiratory tract infections, including distress syndrome, is the third spillover, in only two decades, an animal coronavirus to humans resulting a major epidemic. Here, Coronavirus Study Group (CSG) International Committee on Taxonomy Viruses, which responsible for developing official classification viruses and taxa naming (taxonomy) Coronaviridae family, assessed novelty human pathogen tentatively named 2019-nCoV. Based phylogeny, taxonomy established...
Sequence variants in gene regulatory regions alter expression and contribute to phenotypes of individual cells the whole organism, including disease susceptibility progression. Single-nucleotide enhancers or promoters may affect transcription by altering factor binding sites. Differential heterozygous genomic loci provides a natural source information on such variants. We present novel approach call allele-specific events at single-nucleotide ChIP-Seq data, taking into account joint...
Abstract Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent DNA function. A significant challenge, however, resides accurately decoding which inherently involves comprehending rich contextual information dispersed across thousands nucleotides. To address this need, we introduce GENA language...
Background: Transposons are selfish genetic elements that self-reproduce in host DNA. They were active during evolutionary history and now occupy almost half of mammalian genomes. Close insertions transposons reshaped structure regulation many genes considerably. Co-evolution DNA frequently results the formation new regulatory regions. Previously we published a concept proportion functional features held by positively correlates with rate evolution respective genes. Methods: We ranked human...
Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent DNA function. A significant challenge, however, resides accurately decoding which inherently involves comprehending rich contextual information dispersed across thousands nucleotides. To address this need, we introduce GENA-LM, a suite...
The increasing volume of data from high-throughput experiments including parallel reporter assays facilitates the development complex deep-learning approaches for modeling DNA regulatory grammar.Here, we introduce LegNet, an EfficientNetV2-inspired convolutional network short gene regions. By approaching sequence-to-expression regression problem as a soft classification task, LegNet secured first place autosome.org team in DREAM 2022 challenge predicting expression gigantic assays. Using...
Prediction of RNA structure from sequence remains an unsolved problem, and progress has been slowed by a paucity experimental data. Here, we present Ribonanza, dataset chemical mapping measurements on two million diverse sequences collected through Eterna other crowdsourced initiatives. Ribonanza enabled solicitation, training, prospective evaluation deep neural networks Kaggle challenge, followed distillation into single, self-contained model called RibonanzaNet. When fine tuned auxiliary...
Abstract The human genome contains millions of candidate cis -regulatory elements (cCREs) with cell-type-specific activities that shape both health and many disease states 1 . However, we lack a functional understanding the sequence features control activity these cCREs. Here used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test regulatory more than 680,000 sequences, representing an extensive set annotated cCREs among three cell types (HepG2, K562 WTC11), found 41.7%...
Abstract Background Positional weight matrix (PWM) is a de facto standard model to describe transcription factor (TF) DNA binding specificities. PWMs inferred from in vivo or vitro data are stored many databases and used plethora of biological applications. This calls for comprehensive benchmarking public PWM models with large experimental reference sets. Results Here we report results all-against-all sites human TFs on compilation (HT-SELEX, PBM) (ChIP-seq) data. We observe that the best...
The integrative analysis of high-throughput reporter assays, machine learning, and profiles epigenomic chromatin state in a broad array cells tissues has the potential to significantly improve our understanding noncoding regulatory element function its contribution human disease. Here, we report results from CAGI 5 regulation saturation challenge where participants were asked predict impact nucleotide substitution at every base pair within five disease-associated enhancers nine promoters. A...
Endogenous retroviruses and retrotransposons also termed retroelements (REs) are mobile genetic elements that were active until recently in human genome evolution. REs regulate gene expression by actively reshaping chromatin structure or directly providing transcription factor binding sites (TFBS). We aimed to identify molecular processes most deeply impacted the cells at level of TFBS regulation. Using ENCODE data, we identified ~2 million overlapping with putatively regulation-competent...
ABSTRACT The advent of advanced sequencing technologies has significantly reduced the cost and increased feasibility assembling high-quality genomes. Yet, annotation genomic elements remains a complex challenge. Even for species with comprehensively annotated reference genomes, functional assessment individual genetic variants is not straightforward. In response to these challenges, recent breakthroughs in machine learning have led development DNA language models. These transformer-based...
Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics performance is lacking. To address this gap, we held DREAM Challenge where competitors trained models on dataset millions random promoter DNA sequences corresponding expression levels, experimentally determined yeast, to...
Background: Retroelements (REs) are transposable elements occupying ~40% of the human genome that can regulate genes by providing transcription factor binding sites (TFBS). RE-linked TFBS profile serve as a marker gene transcriptional regulation evolution. This approach allows for interrogating regulatory evolution organisms with RE-rich genomes. We aimed to characterize and molecular pathways using accumulation metric. Methods: characterized either enriched or deficient in regulation. used...
Abstract A systematic evaluation of how model architectures and training strategies impact genomics performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on dataset millions random promoter DNA sequences corresponding expression levels, experimentally determined in yeast. For robust the models, designed comprehensive suite benchmarks encompassing various sequence types. All top-performing used neural networks but diverged strategies. dissect...
Abstract A DNA sequence pattern, or “motif”, is an essential representation of DNA-binding specificity a transcription factor (TF). Any particular motif model has potential flaws due to shortcomings the underlying experimental data and computational discovery algorithm. As part Codebook/GRECO-BIT initiative, here we evaluated at large scale cross-platform recognition performance positional weight matrices (PWMs), which remain popular models in many practical applications. We applied ten...
Polyunsaturated fatty acid (PUFA) metabolism is currently a focus in cancer research due to PUFAs functioning as structural components of the membrane matrix, fuel sources for energy production, and secondary messengers, so called oxylipins, important players inflammatory processes. Although breast (BC) leading cause death among women worldwide, no systematic study PUFA system interrelated processes this disease has been carried out. Here, we implemented Boruta-based feature selection...
Many problems of modern genetics and functional genomics require the assessment effects sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain challenge due insufficient volume diversity training data. A source valuable data saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by variants. Here we...
Phylogenetic inference based on protein sequence alignment is a widely used procedure. Numerous phylogenetic algorithms have been developed, most of which many parameters and options. Choosing program, options, can be nontrivial task. No benchmark for comparison programs real sequences was publicly available. We developed PhyloBench, evaluating the quality inference, it to test number popular programs. PhyloBench natural, not simulated, orthologous evolutionary domains. The measure accuracy...
Abstract mRNA delivery offers new opportunities for disease treatment by directing cells to produce therapeutic proteins. However, designing highly stable mRNAs with programmable cell type-specificity remains a challenge. To address this, we measured the regulatory activity of 60,000 5’ and 3’ untranslated regions (UTRs) across six types developed PARADE (Prediction And RAtional DEsign UTRs), generative AI framework engineer RNA tailored type-specific activity. We validated testing 15,800 de...
Abstract Sequence variants in gene regulatory regions alter expression and contribute to phenotypes of individual cells the whole organism, including disease susceptibility progression. Single-nucleotide enhancers or promoters may affect transcription by altering factor binding sites. Differential heterozygous genomic loci provides a natural source information on such variants. We present novel approach call allele-specific events at single-nucleotide ChIP-Seq data, taking into account joint...