- Algorithms and Data Compression
- Genomics and Phylogenetic Studies
- semigroups and automata theory
- DNA and Biological Computing
- Natural Language Processing Techniques
- RNA and protein synthesis mechanisms
- Machine Learning in Bioinformatics
- Gene expression and cancer classification
- Network Packet Processing and Optimization
- Machine Learning and Algorithms
- Chromosomal and Genetic Variations
- Logic, programming, and type systems
- Microbial Natural Products and Biosynthesis
- Genetics, Bioinformatics, and Biomedical Research
- Alzheimer's disease research and treatments
- Bacterial Identification and Susceptibility Testing
- Advanced biosensing and bioanalysis techniques
- Antibiotic Resistance in Bacteria
- Fractal and DNA sequence analysis
- Biochemical and Structural Characterization
- Genome Rearrangement Algorithms
- Caching and Content Delivery
- Coding theory and cryptography
- Computability, Logic, AI Algorithms
- Genomics and Rare Diseases
Laboratoire d'Informatique Gaspard-Monge
2016-2025
Université Gustave Eiffel
2014-2025
Centre National de la Recherche Scientifique
2015-2025
Skolkovo Institute of Science and Technology
2018-2022
Paris-Est Sup
2015-2019
Université Paris Cité
2011-2019
Ben-Gurion University of the Negev
2011-2016
Institut national de recherche en informatique et en automatique
2000-2011
Centre de recherche Inria Lille - Nord Europe
2008-2011
Laboratoire d'Informatique Fondamentale de Lille
2006-2011
YASS is a DNA local alignment tool based on an efficient and sensitive filtering algorithm. It applies transition-constrained seeds to specify the most probable conserved motifs between homologous sequences, combined with flexible hit criterion used identify groups of that are likely exhibit significant alignments. A web interface ( http://www.loria.fr/projects/YASS/ ) available upload input sequences in fasta format, query program visualize results obtained several forms (dot-plot, tabular...
A repetition in a word w is subword with the period of at most half length. We study maximal repetitions occurring w, that those for which any extended has bigger period. The set such represents compact way all w. first prove combinatorial result asserting sum exponents length n bounded by linear function n. This implies, particular there only number word. allows us to construct linear-time algorithm finding repetitions. Some consequences and applications these results are discussed, as well...
Norine is the first database entirely dedicated to nonribosomal peptides (NRPs). In bacteria and fungi, in addition traditional ribosomal proteic biosynthesis, an alternative ribosome-independent pathway called NRP synthesis allows peptide production. It performed by huge protein complexes synthetases (NRPSs). The molecules synthesized NRPS contain a high proportion of nonproteogenic amino acids. primary structure these not always linear but often more complex may cycles branchings. recent...
Nonribosomal peptides (NRPs) are molecules produced by microorganisms that have a broad spectrum of biological activities and pharmaceutical applications (e.g., antibiotic, immunomodulating, antitumor activities). One particularity the NRPs is biodiversity their monomers, extending far beyond 20 proteogenic amino acid residues. Norine, comprehensive database NRPs, allowed us to review for first time main characteristics especially monomer biodiversity. Our analysis highlighted significant...
Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis k-mers shared between read be classified and sampled reference genomes. Within this general framework, we show work spaced seeds provide significant improvement classification accuracy as opposed traditional contiguous k-mers. We support thesis...
Abstract Surveillance of drug-resistant bacteria is essential for healthcare providers to deliver effective empirical antibiotic therapy. However, traditional molecular epidemiology does not typically occur on a timescale that could affect patient treatment and outcomes. Here, we present method called ‘genomic neighbour typing’ inferring the phenotype bacterial sample by identifying its closest relatives in database genomes with metadata. We show this technique can infer susceptibility...
De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential represent de Bruijn compactly, and several approaches this problem have been proposed recently.In work, we show how reduce the memory required by data structure Chikhi Rizk (WABI'12) that represents using Bloom filters. Our method requires 30% 40% less with respect their method, insignificant impact on construction time. At same time,...
We propose a general approach to compute the seed sensitivity, that can be applied different definitions of seeds. It treats separately three components sensitivity problem — set target alignments, an associated probability distribution, and model are specified by distinct finite automata. The is then new concept subset seeds for which we efficient automaton construction. Experimental results confirm sensitive efficiently designed using our approach, used in similarity search producing...
Abstract de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as compact, efficient, and representation, ProphAsm, fast algorithm for their computation. For the example of assemblies model organisms two bacterial pan-genomes, compare to unitigs, best existing demonstrate that provide substantial improvement cumulative sequence length number. When combined with commonly used Burrows-Wheeler Transform index,...
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, rapid growth these has made it effectively impossible to search data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history guide compression efficiently large microbial existing algorithms structures. We show that, when applied modern diverse genomes, lossless improves...
The hit criterion is a key component of heuristic local alignment algorithms. It specifies class patterns assumed to witness potential similarity, and this choice decisive for the selectivity sensitivity whole method.In paper, we propose two ways improve criterion. First, define group combining advantages single-seed double-seed approaches used in existing Second, introduce transition-constrained seeds that extend spaced by possibility distinguishing transition transversion mismatches. We...
We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The is based on simultaneous use several spaced seeds rather than single seed as studied by Burkhardt Karkkainen. present algorithms to compute important parameters families, their combinatorial properties, describe techniques construct efficient families. also report large-scale application the proposed technique problem oligonucleotide selection an EST sequence database.
Although modern high-throughput biomolecular technologies produce various types of data, biosequence data remain at the core bioinformatic analyses. However, computational techniques for dealing with this evolved dramatically.In bird's-eye review, we overview evolution main algorithmic comparing and searching biological sequences. We highlight key ideas emerged in response to several interconnected factors: shifts analytical paradigm, advent new sequencing a substantial increase size...
Analysis of genetic sequences is usually based on finding similar parts sequences, e.g. DNA reads and/or genomes. For big data, this typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. huge sparse seeding useful, where we only consider seeds at a subset positions in sequence.Here, study sparse-seeding method: using certain 'words' ac, at, gc or gt). Sensitivity maximized by words with minimal overlaps. That because, random sequence, minimally...