Samantha Zarate
- Genomics and Phylogenetic Studies
- Genomic variations and chromosomal abnormalities
- Genetics, Bioinformatics, and Biomedical Research
- Chromosomal and Genetic Variations
- Genomics and Rare Diseases
- Evolutionary Algorithms and Applications
- Cancer Genomics and Diagnostics
- Distributed and Parallel Computing Systems
- Advanced Data Storage Technologies
- Gene expression and cancer classification
- Molecular Biology Techniques and Applications
- Scientific Computing and Data Management
- Particle physics theoretical and experimental studies
- Genomics and Chromatin Dynamics
- Genetic Associations and Epidemiology
- Genetic and Clinical Aspects of Sex Determination and Chromosomal Abnormalities
- Machine Learning in Bioinformatics
- Evolution and Genetic Dynamics
- Prenatal Screening and Diagnostics
- Renal Diseases and Glomerulopathies
- Blood Coagulation and Thrombosis Mechanisms
- Hemophilia Treatment and Research
- Atrial Fibrillation Management and Outcomes
- Systemic Lupus Erythematosus Research
- Biomedical Text Mining and Ontologies
Regeneron (United States)
2023-2025
Johns Hopkins University
2020-2023
DNAnexus (United States)
2018-2021
Federico Santa María Technical University
2015-2016
Since its initial release in 2000, the human reference genome has covered only euchromatic fraction of genome, leaving important heterochromatic regions unfinished. Addressing remaining 8% Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors prior references, and introduces nearly 200 million base pairs containing 1956 gene predictions, 99 which are predicted to be...
Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands structural errors, and unlocks most complex regions human for clinical functional study. We show how this reference universally improves read mapping variant calling 3202 17 globally diverse samples sequenced with short long reads, respectively. identify hundreds variants per sample in previously unresolved regions, showcasing promise T2T-CHM13 evolutionary...
Abstract In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of human genome, which revolutionized field genomics. While these updates that followed effectively covered euchromatic fraction heterochromatin many other complex regions were left unfinished or erroneous. Addressing this remaining 8% Telomere-to-Telomere (T2T) has finished first truly complete 3.055 billion base pair (bp) sequence a representing largest improvement to...
Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling methods. Here we use accurate linked long reads expand 7 samples include difficult-to-map regions segmental duplications that challenging for short reads. These add more than 300,000 SNVs 50,000 insertions or deletions (indels) 16% exonic variants, many challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, 92% of the autosomal GRCh38...
Abstract Rare coding variants that substantially affect function provide insights into the biology of a gene 1–3 . However, ascertaining frequency such requires large sample sizes 4–8 Here we present catalogue human protein-coding variation, derived from exome sequencing 983,578 individuals across diverse populations. In total, 23% Regeneron Genetics Center Million Exome (RGC-ME) data come African, East Asian, Indigenous American, Middle Eastern and South Asian ancestry. The includes more...
Abstract Background Structural variants (SVs) are critical contributors to genetic diversity and genomic disease. To predict the phenotypic impact of SVs, there is a need for better estimates both occurrence frequency preferably from large, ethnically diverse cohorts. Thus, current standard approach requires use short paired-end reads, which remain challenging detect, especially at scale hundreds thousands samples. Findings We present Parliament2, consensus SV framework that leverages...
Abstract Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long and linked now enable us construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - Major Histocompatibility Complex (MHC). Here, we develop genome benchmark derived from for openly-consented Genome in Bottle sample HG002. assemble single contig each...
Abstract Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 Mbp of sequence, corrects thousands structural errors, and unlocks most complex regions human clinical functional study. Here we demonstrate how new reference universally improves read mapping variant calling for 3,202 17 globally diverse samples sequenced with short long reads, respectively. We identify hundreds novel variants per sample—a frontier evolutionary biomedical discovery. Simultaneously,...
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure including long palindromes, tandem repeats, segmental duplications 1–3 . As a result, more than half the is missing from GRCh38 reference it remains last be finished 4, 5 Here, Telomere-to-Telomere (T2T) consortium presents complete 62,460,029 base pair HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y adds over 30 million pairs reference, revealing ampliconic...
Atrial fibrillation (AF) has a substantial genetic component. The importance of polygenic risk is well established, while the contribution rare variants to disease warrants characterization in large cohorts.
Summary Genome in a Bottle (GIAB) benchmarks have been widely used to help validate clinical sequencing pipelines and develop new variant calling methods. Here, we use accurate linked reads long expand the prior 7 samples include difficult-to-map regions segmental duplications that are not readily accessible short reads. Our benchmark adds more than 300,000 SNVs, 50,000 indels, 16 % exonic variants, many challenging, clinically relevant genes previously covered (e.g., PMS2 ). For HG002, 92%...
Abstract Background Genomic structural variations (SV) are important determinants of genotypic and phenotypic changes in many organisms. However, the detection SV from next-generation sequencing data remains challenging. Results In this study, DNA a Chinese family quartet is sequenced at three different centers triplicate. A total 288 derivative sets generated utilizing analysis pipelines compared to identify sources analytical variability. Mapping methods provide major contribution...
The EventIndex is the complete catalogue of all ATLAS events, keeping references to files that contain a given event in any processing stage. It replaces TAG database, which had been use during LHC Run 1. For each it contains its identifiers, trigger pattern and GUIDs containing it. Major cases are picking, feeding Event Service used on some production sites, technical checks completion consistency campaigns. system design highly modular so components (data collection system, storage based...
Abstract Here we present Parliament2 – a structural variant caller which combines multiple best-in-class callers to create highly accurate callset. This captures more events than the individual achieve independently. uses call-overlap-genotype approach that is extensible new methods and presents users choice run some or all of Breakdancer, Breakseq, CNVnator, Delly, Lumpy, Manta run. applies an additional parallelization framework speed certain executes these in parallel, taking advantage...
Over the past 30 years, a community of scientists has pieced together every base pair human reference genome from telomere to telomere. Interestingly, most genomics studies omit more than 5% their analyses. Under "normal" circumstances, omitting any chromosome(s) an analysis would be cause for concern, with exception being sex chromosomes. Sex chromosomes in eutherians share evolutionary origin as ancestral autosomes. In humans, they 3 regions high-sequence identity (∼98-100%), which, along...
Abstract Genome sequencing at population scale provides unprecedented access to the genetic foundations of human phenotypic diversity, but genotype-phenotype association analyses limited small variants have failed comprehensively characterize architecture health and disease because they ignore structural (SVs) known contribute variation pathogenic conditions 1–3 . Here we demonstrate significance SVs when assessing associations importance ethnic diversity in study design by analyzing across...
The ATLAS EventIndex is a data catalogue system that stores event-related metadata for all (real and simulated) events, on processing stages. As it consists of different components depend other applications (such as distributed storage, sources information) we need to monitor the conditions many heterogeneous subsystems, make sure everything working correctly. This paper describes how gather information about related subsystems: Producer-Consumer architecture collection, health parameters...
Over the past 30 years, a community of scientists have pieced together every base pair human reference genome from telomere-to-telomere. Interestingly, most genomics studies omit more than 5% their analyses. Under 'normal' circumstances, omitting any chromosome(s) analysis would be reason for concern-the exception being sex chromosomes. Sex chromosomes in eutherians share an evolutionary origin as ancestral autosomes. In humans, they three regions high sequence identity (~98-100%),...
The ATLAS EventIndex System, developed for use in LHC Run 2, is designed to index every processed event ATLAS, replacing the TAG System used 1. Its storage infrastructure, based on Hadoop open-source software framework, necessitates revamping how information this system relates other systems. It will store more indexes since fundamental mechanisms retrieving these be better integrated into all stages of data processing, allowing events from later processing indexed than was possible with...
The ATLAS EventIndex is the catalogue of event-related metadata for information collected from detector. basic unit this event record, containing identification parameters, pointers to files as well trigger decision information. main use case picking, data consistency checks large production campaigns. employs Hadoop platform storage and handling, a messaging system collection both at Tier-0, when are first produced, Grid, various types derived produced. uses auxiliary other sources...
Researchers rely on the human reference genome as a baseline to identify genetic differences between individuals, which are crucial for understanding physiology, disease, and evolution. In this study, we focused implications of first-ever complete genome, improves identification variation ushers in beginning new era genetics.