Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

Low Confidence
DOI: 10.48550/arxiv.2201.02229 Publication Date: 2022-01-01
ABSTRACT
Protein-protein interactions (PPIs) are critical to normal cellular function and related many disease pathways. However, only 4% of PPIs annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time nor cost-effective. We use the IntAct PPI database create a distant supervised dataset interacting protein pairs, their corresponding PTM type, associated abstracts from PubMed database. train an ensemble BioBERT models - dubbed PPI-BioBERT-x10 improve confidence calibration. extend average approach variation counteract effects class imbalance extract high predictions. The model evaluated on test set resulted modest F1-micro 41.3 (P =5 8.1, R = 32.1). by combining low identify quality predictions, tuning predictions for precision, we retained 19% 100% precision. 18 million extracted 1.6 (546507 unique PTM-PPI triplets) filter ~ 5700 (4584 unique) Of 5700, human evaluation small randomly sampled subset shows that precision drops 33.7% despite calibration highlights challenges generalisability beyond even circumvent problem including multiple papers, improving 58.8%. In this work, highlight benefits deep learning-based text mining practice, need increased emphasis facilitate curation efforts.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()