- Algorithms and Data Compression
- Genomics and Phylogenetic Studies
- Network Packet Processing and Optimization
- DNA and Biological Computing
- RNA and protein synthesis mechanisms
- Genome Rearrangement Algorithms
- semigroups and automata theory
- Natural Language Processing Techniques
- Advanced Image and Video Retrieval Techniques
- Chromosomal and Genetic Variations
- Cellular Automata and Applications
- Machine Learning in Bioinformatics
- Gene expression and cancer classification
- Graph Theory and Algorithms
- Molecular Biology Techniques and Applications
- Music and Audio Processing
- Machine Learning and Algorithms
- Genomics and Rare Diseases
- Cancer Genomics and Diagnostics
- Genetics, Bioinformatics, and Biomedical Research
- Genomics and Chromatin Dynamics
- Genomic variations and chromosomal abnormalities
- Advanced Data Compression Techniques
- Image Retrieval and Classification Techniques
- RNA modifications and cancer
University of Helsinki
2015-2024
Helsinki Institute for Information Technology
2012-2021
Helsinki Institute of Physics
2009-2016
Aalto University
2016
Max Planck Institute for Informatics
2015
Saarland University
2015
Centrum Wiskunde & Informatica
2015
Bielefeld University
2005-2015
HAN University of Applied Sciences
2014
Institute of Bioinformatics
2003-2008
Given a sequence S = s 1 2 … n of integers smaller than r O (polylog( )), we show how can be represented using nH 0 ( ) + o bits, so that know any q , as well answer rank and select queries on in constant time. H is the zero-order empirical entropy provides an information-theoretic lower bound to bit storage via fixed encoding its symbols. This extends previous results binary sequences, improves general sequences where those are answered (log For larger still represent log bits /log Another...
We propose a generic approach to replace the canonical sequence representation of genomes with graph representations, and study several applications such extensions. extend Burrows-Wheeler transform (BWT) strings acyclic directed labeled graphs, support path queries as an extension substring searching. develop, apply, tailor this technique a) read alignment on extended BWT index representing pan-genome, i.e., reference genome known variants it; b) split-read splicing graph. Other possible...
Previous studies have reported that chromosome synteny in Lepidoptera has been well conserved, yet the number of haploid chromosomes varies widely from 5 to 223. Here we report genome (393 Mb) Glanville fritillary butterfly (Melitaea cinxia; Nymphalidae), a recognized model species metapopulation biology and eco-evolutionary research, which putative ancestral karyotype n=31. Using phylogenetic analyses Nymphalidae other Lepidoptera, combined with orthologue-level comparisons chromosomes,...
A repetitive sequence collection is a set of sequences which are small variations each other. prominent example genome individuals the same or close species, where differences can be expressed by short lists basic edit operations. Flexible and efficient data analysis on such typically huge plausible using suffix trees. However, tree occupies much space, very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce space to, essentially, that compressed sequences, while...
We give new solutions to the Searchable Partial Sums with Indels problem. Given a sequence of n k -bit numbers, we present structure taking kn + o ( ) bits space, able performing operations sum , search insert and delete all in O (log worst-case time, for any = ). This extends previous results by Hon et al. [2003c] achieving same space /log log time complexities queries, yet offering that are amortized worse than ours, supported only (1). Our result matches an existing lower bound large...
Through transcription and alternative splicing, a gene can be transcribed into different RNA sequences (isoforms), depending on the individual, tissue cell is in, or in response to some stimuli. Recent RNA-Seq technology allows for new high-throughput ways isoform identification quantification based short reads, various methods have been put forward this non-trivial problem.In paper we propose novel radically method minimum-cost network flows. This has two-fold advantage: one hand, it...
Abstract Motivation: Assembling genomes from short read data has become increasingly popular, but the problem remains computationally challenging especially for larger genomes. We study scaffolding phase of sequence assembly where preassembled contigs are ordered based on mate pair data. Results: present MIP Scaffolder that divides into smaller subproblems and solves these with mixed integer programming. The can be represented as a graph biconnected components this solved independently....
Abstract Many disciplines, from human genetics and oncology to plant breeding, microbiology virology, commonly face the challenge of analyzing rapidly increasing numbers genomes. In case Homo sapiens , number sequenced genomes will approach hundreds thousands in next few years. Simply scaling up established bioinformatics pipelines not be sufficient for leveraging full potential such rich genomic datasets. Instead, novel, qualitatively different computational methods paradigms are needed. We...
Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs genome assemblies, such as string and de Bruijn graphs, pan-genome hence genetic variation present population. Being able align sequencing reads is key step for many analyses its applications assembly, read error correction variant calling with respect graph.We generalize two linear...
Sweetpotato (Ipomoea batatas) plants become infected with over 30 RNA or DNA viruses in different parts of the world but little is known about infecting sweetpotato crops Central America, center domestication. Small-RNA deep-sequencing (SRDS) analysis was used to detect Honduras and Guatemala, which detected Sweet potato feathery mottle virus strain RC C (Potyvirus spp.), chlorotic stunt WA (SPCSV-WA; Crinivirus sp.), leaf curl Georgia (Begomovirus pakakuy B (synonym: badnavirus B). Results...
Long-read RNA sequencing technologies are establishing themselves as the primary techniques to detect novel isoforms, and many such analyses dependent on read alignments. However, error rate length of reads create new challenges for accurately aligning them, particularly around small exons.We present an alignment method uLTRA long based a two-pass collinear chaining algorithm. We show that produces higher accuracy over state-of-the-art aligners with substantially exons simulated synthetic...
Exact string matching in labeled graphs is the problem of searching paths a graph G=(V, E) such that concatenation their node labels equal to given pattern P [1. m ]. This basic can be found at heart more complex operations on variation computational biology, query databases, and analysis heterogeneous networks. We prove conditional lower bound stating that, for any constant ε > 0, an O (| E | 1 - ) time, or )time algorithm exact graphs, with drawn from binary alphabet, cannot achieved...