Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters
Identification
DOI:
10.1093/bioinformatics/btaf135
Publication Date:
2025-04-12T23:58:11Z
AUTHORS (5)
ABSTRACT
Advances in bacterial promoter predictors based on ML have greatly improved identification metrics. However, existing models overlooked the impact of negative datasets, previously identified GC-content discrepancies between positive and datasets single-species models. This study aims to investigate whether multiple-species for classification are inherently biased due selection criteria datasets. We further explore generation synthetic random sequences (SRS) that mimic distribution promoters can partly reduce this bias. Multiple-species exhibited bias when using CDS as dataset, suggested by specificity sensibility metrics a species-specific manner, investigated dimensionality reduction. demonstrated reduction employing SRS with less detection background noise real genomic data. In both scenarios DNABERT showed best These findings suggest GC-balanced enhance generalizability across Bacteria. The source code experiments is freely available at https://github.com/maigonzalezh/MultispeciesPromoterClassifier.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (32)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....