CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models

Sequence (biology) Protein sequencing Code (set theory)
DOI: 10.1093/bioinformatics/btad029 Publication Date: 2023-01-17T22:40:51Z
ABSTRACT
CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct hierarchical evolutionary structural relationships. The aim this study was develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. method developed (CATHe) combines neural network with representations obtained from language models. It assessed using dataset having less than 20% identity any in the training set.The CATHe models trained on 1773 largest 50 superfamilies had accuracy 85.6 ± 0.4% 98.2 0.3%, respectively. As further test power detect more HMMs derived domains, we used consisting domains annotations Pfam, but not CATH. By highly reliable predictions (expected error rate <0.5%), were able provide 4.62 million Pfam domains. For subset these Homo sapiens, structurally validated 90.86% comparing their corresponding AlphaFold2 structures which they assigned.The code available https://github.com/vam-sin/CATHe, datasets can be accessed https://zenodo.org/record/6327572.Supplementary data are at Bioinformatics online.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (45)
CITATIONS (25)