Current genomic deep learning architectures generalize across grass species but not alleles

DOI: 10.1101/2024.04.11.589024 Publication Date: 2024-04-14T01:43:08Z
ABSTRACT
Abstract Non-coding regions of the genome are just as important coding for understanding mapping from genotype to phenotype. Interpreting deep learning models trained on RNA-seq is an emerging method highlight functional sites within non-coding regions. Most work RNA abundance has been done humans and mice, with little attention paid plants. Here, we benchmark four genomic model architectures genomes data 18 species closely related maize sorghum Andropogoneae. The Andropogoneae a tribe C4 grasses that have adapted wide range environments worldwide since diverging million years ago. Hundreds millions evolution across these produced large, diverse pool training alleles sharing common physiology. As input, extracted 1,026 base pairs upstream each gene’s translation start site. We held out our test set two validation set, architecture remaining genomes. Within panel 26 lines, all predict expression genes moderately well but poorly alleles. DanQ consistently ranked highest or second among yet performance was generally very similar despite orders magnitude differences in size. This suggests state-of-the-art supervised able generalize not sensitively separate species, latter which agrees recent humans. releasing preprocessed code this community evaluate new across-species across-allele tasks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (47)
CITATIONS (0)