NFDI4DS | UHH-SEMS - Publication Details

Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters

Identification

DOI: 10.1093/bioinformatics/btaf135 Publication Date: 2025-04-12T23:58:11Z

Abstract Supplemental Material References Cited by

AUTHORS (5)

Marcelo González

Roberto E Durán

Michael Seeger

Mauricio Araya

Nicolás Jara

ABSTRACT

Advances in bacterial promoter predictors based on ML have greatly improved identification metrics. However, existing models overlooked the impact of negative datasets, previously identified GC-content discrepancies between positive and datasets single-species models. This study aims to investigate whether multiple-species for classification are inherently biased due selection criteria datasets. We further explore generation synthetic random sequences (SRS) that mimic distribution promoters can partly reduce this bias. Multiple-species exhibited bias when using CDS as dataset, suggested by specificity sensibility metrics a species-specific manner, investigated dimensionality reduction. demonstrated reduction employing SRS with less detection background noise real genomic data. In both scenarios DNABERT showed best These findings suggest GC-balanced enhance generalizability across Bacteria. The source code experiments is freely available at https://github.com/maigonzalezh/MultispeciesPromoterClassifier.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (32)

CITATIONS (0)

EXTERNAL LINKS

OPENALEX - Publications CROSSREF - Publications

PlumX Metrics

Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....