CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models
Sequence (biology)
Protein sequencing
Code (set theory)
DOI:
10.1093/bioinformatics/btad029
Publication Date:
2023-01-17T22:40:51Z
AUTHORS (9)
ABSTRACT
CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct hierarchical evolutionary structural relationships. The aim this study was develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. method developed (CATHe) combines neural network with representations obtained from language models. It assessed using dataset having less than 20% identity any in the training set.The CATHe models trained on 1773 largest 50 superfamilies had accuracy 85.6 ± 0.4% 98.2 0.3%, respectively. As further test power detect more HMMs derived domains, we used consisting domains annotations Pfam, but not CATH. By highly reliable predictions (expected error rate <0.5%), were able provide 4.62 million Pfam domains. For subset these Homo sapiens, structurally validated 90.86% comparing their corresponding AlphaFold2 structures which they assigned.The code available https://github.com/vam-sin/CATHe, datasets can be accessed https://zenodo.org/record/6327572.Supplementary data are at Bioinformatics online.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (45)
CITATIONS (25)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....