Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT
Low Confidence
DOI:
10.48550/arxiv.2201.02229
Publication Date:
2022-01-01
AUTHORS (5)
ABSTRACT
Protein-protein interactions (PPIs) are critical to normal cellular function and related many disease pathways. However, only 4% of PPIs annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time nor cost-effective. We use the IntAct PPI database create a distant supervised dataset interacting protein pairs, their corresponding PTM type, associated abstracts from PubMed database. train an ensemble BioBERT models - dubbed PPI-BioBERT-x10 improve confidence calibration. extend average approach variation counteract effects class imbalance extract high predictions. The model evaluated on test set resulted modest F1-micro 41.3 (P =5 8.1, R = 32.1). by combining low identify quality predictions, tuning predictions for precision, we retained 19% 100% precision. 18 million extracted 1.6 (546507 unique PTM-PPI triplets) filter ~ 5700 (4584 unique) Of 5700, human evaluation small randomly sampled subset shows that precision drops 33.7% despite calibration highlights challenges generalisability beyond even circumvent problem including multiple papers, improving 58.8%. In this work, highlight benefits deep learning-based text mining practice, need increased emphasis facilitate curation efforts.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....