Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

Coronavirus 2019-20 coronavirus outbreak False positive rate
DOI: 10.1101/2020.02.03.932350 Publication Date: 2020-02-05T03:35:10Z
ABSTRACT
Abstract As of February 20, 2020, the 2019 novel coronavirus (renamed to COVID-19) spread 30 countries with 2130 deaths and more than 75500 confirmed cases. COVID-19 is being compared infamous SARS coronavirus, which resulted, between November 2002 July 2003, in 8098 cases worldwide a 9.6% death rate 774 deaths. Though has 2.8% as 20 February, 75752 few weeks (December 8, 2020) are alarming, likely under-reported given comparatively longer incubation period. Such outbreaks demand elucidation taxonomic classification origin virus genomic sequence, for strategic planning, containment, treatment. This paper identifies an intrinsic signature uses it together machine learning-based alignment-free approach ultra-fast, scalable, highly accurate whole genomes. The proposed method combines supervised learning digital signal processing genome analyses, augmented by decision tree component, Spearman’s rank correlation coefficient analysis result validation. These tools used analyze large dataset over 5000 unique viral sequences, totalling 61.8 million bp. Our results support hypothesis bat classify Sarbecovirus , within Betacoronavirus . achieves high levels accuracy discovers most relevant relationships among 5,000 genomes minutes, ab initio using raw DNA sequence data alone, without any specialized biological knowledge, training, gene or annotations. suggests that, pathogen this whole-genome machine-learning can provide reliable real-time option classification.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (94)
CITATIONS (42)