Optimized sample selection for cost-efficient long-read population sequencing
1000 Genomes Project
Exome
DOI:
10.1101/gr.264879.120
Publication Date:
2021-04-02T21:05:58Z
AUTHORS (9)
ABSTRACT
An increasingly important scenario in population genetics is when a large cohort has been genotyped using low-resolution approach (e.g., microarrays, exome capture, short-read WGS), from which few individuals are resequenced more comprehensive approach, especially long-read sequencing. The subset of selected should ensure that the captured genetic diversity fully representative and includes variants across all subpopulations. For example, human variation historically focused on with European ancestry, but this represents small fraction overall diversity. Addressing this, SVCollector identifies optimal for resequencing by analyzing population-level VCF files genotyping studies. It then computes ranked list samples maximizes total number present within given size. To solve optimization problem, implements fast, greedy heuristic an exact algorithm integer linear programming. We apply simulated data, 2504 genomes 1000 Genomes Project, 3024 3000 Rice Project show rankings it than alternative naive strategies. When selecting 100 these cohorts, every subpopulation, whereas methods yield unbalanced selection. Finally, we cohorts follows power-law distribution naturally related to concept allele frequency spectrum, allowing us estimate increasing numbers samples.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (26)
CITATIONS (6)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....