LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer

Ensemble Learning
DOI: 10.1371/journal.pone.0204371 Publication Date: 2018-11-02T17:46:27Z
ABSTRACT
Although modern methods of whole genome DNA methylation analysis have a wide range applications, they are not suitable for clinical diagnostics due to their high cost and complexity the large amount sample required analysis. Therefore, it is crucial be able identify relatively small number sites that provide precision sensitivity diagnosis pathological states. We propose an algorithm constructing limited subsamples from high-dimensional data form diagnostic panels. developed tool utilizes different selection find optimal, minimum necessary combination factors using cross-entropy loss metrics (LogLoss) subset sites. show can work effectively with patterns ensemble-based machine learning methods. Algorithm efficiency, robustness were evaluated five genome-wide datasets (totaling 626 samples), each dataset was classified into tumor non-tumor samples. The produced AUC 0.97 (95% CI: 0.94–0.99, 9 sites) prostate adenocarcinoma 1.0 (from 2 6 urothelial bladder carcinoma, two types kidney carcinoma colorectal carcinoma. For we showed identified differential variability distinguish cluster samples higher recurrence rate (hazard ratio = 0.48, 95% 0.05–0.92; log-rank test, p-value < 0.03). also several clusters correlated interchangeable used elaboration biological interpretation resulting models further most designing LogLoss-BERAF implemented as standalone python code open-source freely available https://github.com/bioinformatics-IBCH/logloss-beraf along described in this article.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (69)
CITATIONS (5)