Tobias Marschall
- Genomics and Phylogenetic Studies
- Chromosomal and Genetic Variations
- Genomics and Rare Diseases
- Genomic variations and chromosomal abnormalities
- RNA and protein synthesis mechanisms
- Algorithms and Data Compression
- Gene expression and cancer classification
- Cancer Genomics and Diagnostics
- DNA and Biological Computing
- Machine Learning in Bioinformatics
- Single-cell and spatial transcriptomics
- Genetic Mapping and Diversity in Plants and Animals
- Genetic Associations and Epidemiology
- Genomics and Chromatin Dynamics
- Molecular Biology Techniques and Applications
- CRISPR and Genetic Engineering
- semigroups and automata theory
- Genetics, Bioinformatics, and Biomedical Research
- Genetic diversity and population structure
- Bioinformatics and Genomic Networks
- Genome Rearrangement Algorithms
- Evolution and Genetic Dynamics
- Cell Image Analysis Techniques
- RNA Research and Splicing
- Epigenetics and DNA Methylation
Heinrich Heine University Düsseldorf
2020-2025
Düsseldorf University Hospital
2023-2025
Max Planck Institute for Informatics
2015-2021
Saarland University
2015-2020
Centrum Wiskunde & Informatica
2010-2019
Institute of Bioinformatics
2019
Helsinki Institute for Information Technology
2015
Bielefeld University
2006-2015
Max Planck Society
2013-2015
Brown University
2013
Since its initial release in 2000, the human reference genome has covered only euchromatic fraction of genome, leaving important heterochromatic regions unfinished. Addressing remaining 8% Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors prior references, and introduces nearly 200 million base pairs containing 1956 gene predictions, 99 which are predicted to be...
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies human genetic diversity and disease association. Here, we apply a suite long-read, short-read, strand-specific technologies, optical mapping, variant discovery algorithms to comprehensively analyze three trios define the full spectrum variation in haplotype-resolved manner. We identify 818,054 indel (<50 bp) 27,622 SVs (≥50 per genome. also discover 156 inversions genome 58 intersect...
The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. final, phase 3 release 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS resource, which now includes 602 complete trios, sequenced to depth 30X using Illumina. We performed single-nucleotide variant (SNV) short...
Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% genome: 26 million base pairs) integrate all forms genetic variation, even across complex loci. identified 107,590 structural variants (SVs), which 68% were not...
Abstract De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks wall-clock time. To enable rapid assembly, we present Shasta, de assembler, polishing algorithms named MarginPolish HELEN. Using single PromethION sequencer our toolkit, assembled 11 highly contiguous genomes in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values 6.5× coverage reads >100 kb three flow cells per sample. Shasta produced...
The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of genome. resulting haplotypes, lists SNPs belonging each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, oblivious direct read information, constitute state-of-the-art. Haplotype assembly, addresses phasing directly from sequencing reads, suffers fact that reads current generation too short serve purposes...
Abstract Read-based phasing allows to reconstruct the haplotypes of a sample purely from sequencing reads. While is an important step for answering questions about population genetics, compound heterozygosity, and aid in clinical decision making, there has been lack accurate, usable standards-based software. WhatsHap production-ready tool highly accurate read-based phasing. It was designed beginning leverage third-generation technologies, whose long reads can span many variants are therefore...
Despite improvements in genomics technology, the detection of structural variants (SVs) from short-read sequencing still poses challenges, particularly for complex variation. Here we analyse genomes two patients with congenital abnormalities using MinION nanopore sequencer and a novel computational pipeline-NanoSV. We demonstrate that long reads are superior to short regard de novo chromothripsis rearrangements. The also enable efficient phasing genetic variations, which leveraged determine...
Abstract Human genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing 1,2 with continuous long-read or high-fidelity 3 data. Employing this strategy, produced completely phased each haplotype an individual Puerto Rican descent (HG00733) in absence The assemblies accurate...
Haplotype-resolved or phased genome assembly provides a complete picture of genomes and their complex genetic variations. However, current algorithms for either do not generate chromosome-scale phasing require pedigree information, which limits application. We present method named diploid (DipAsm) that uses long, accurate reads long-range conformation data single individuals to within 1 day. Applied four public human genomes, PGP1, HG002, NA12878 HG00733, DipAsm produced haplotype-resolved...
Abstract Typical genotyping workflows map reads to a reference genome before identifying genetic variants. Generating such alignments introduces biases and comes with substantial computational burden. Furthermore, short-read lengths limit the ability characterize repetitive genomic regions, which are particularly challenging for fast k -mer-based genotypers. In present study, we propose new algorithm, PanGenie, that leverages haplotype-resolved pangenome together -mer counts from sequencing...
Abstract In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of human genome, which revolutionized field genomics. While these updates that followed effectively covered euchromatic fraction heterochromatin many other complex regions were left unfinished or erroneous. Addressing this remaining 8% Telomere-to-Telomere (T2T) has finished first truly complete 3.055 billion base pair (bp) sequence a representing largest improvement to...
Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome is key many applications, including error correction, assembly, genotyping of variants in a pangenome graph. Yet, so far, this step often prohibitively slow. We present GraphAligner, tool for aligning long reads graphs. Compared the state-of-the-art tools, GraphAligner 13x faster uses 3x less memory. When employing we find it be more than twice as accurate over 12x extant...
Abstract The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society 1,2 . However, it still many gaps and errors, does not represent biological genome as is blend multiple individuals 3,4 Recently, telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but derived from hydatidiform mole cell line nearly homozygous 5 To address these limitations, Human Pangenome...
Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling methods. Here we use accurate linked long reads expand 7 samples include difficult-to-map regions segmental duplications that challenging for short reads. These add more than 300,000 SNVs 50,000 insertions or deletions (indels) 16% exonic variants, many challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, 92% of the autosomal GRCh38...
Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 in 41 human genomes. Approximately 85% of <2 kbp form by twin-priming during L1 retrotransposition; 80% the larger are balanced and affect twice as many nucleotides CNVs. Balanced show excess common variants, 72% flanked segmental duplications (SDs) or retrotransposons. Since flanking repeats promote non-allelic homologous...
Abstract The short arms of the human acrocentric chromosomes 13, 14, 15, 21 and 22 (SAACs) share large homologous regions, including ribosomal DNA repeats extended segmental duplications 1,2 . Although resolution these regions in first complete assembly a genome—the Telomere-to-Telomere Consortium’s CHM13 (T2T-CHM13)—provided model their homology 3 , it remained unclear whether patterns were ancestral or maintained by ongoing recombination exchange. Here we show that contain...