- Machine Learning in Bioinformatics
- Genomics and Phylogenetic Studies
- Protein Structure and Dynamics
- Bioinformatics and Genomic Networks
- RNA and protein synthesis mechanisms
- Enzyme Structure and Function
- Microbial Metabolic Engineering and Bioproduction
- Advanced Proteomics Techniques and Applications
- SARS-CoV-2 and COVID-19 Research
- Genetics, Bioinformatics, and Biomedical Research
- Biomedical Text Mining and Ontologies
- Microbial Natural Products and Biosynthesis
- Cell Image Analysis Techniques
- Influenza Virus Research Studies
- Scientific Computing and Data Management
- Animal Virus Infections Studies
- Biofuel production and bioconversion
- vaccines and immunoinformatics approaches
- Advanced Clustering Algorithms Research
- COVID-19 Clinical Research Studies
- interferon and immune responses
- Ubiquitin and proteasome pathways
- RNA modifications and cancer
- Drug Transport and Resistance Mechanisms
- Lipid Membrane Structure and Behavior
University College London
2015-2024
Institute of Structural and Molecular Biology
2015-2024
European Bioinformatics Institute
2011-2012
MRC Laboratory of Molecular Biology
2012
University of Bristol
2012
University of Cambridge
2012
UCL Australia
2004
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains conserved sites. InterProScan is the underlying software that allows nucleic acid to be searched against InterPro's signatures. Signatures are predictive models which describe or sites, provided by multiple databases. combines signatures representing equivalent additional information such as descriptions, literature...
Abstract The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains conserved sites. Here, we report recent developments with (version 90.0) its associated software, including updates to data content the website. These extend enrich information provided by InterPro, provide a more user friendly access data. Additionally, have worked on adding Pfam website features website, as...
InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and predict the presence of important domains sites. InterProScan underlying software that allows both nucleic acid be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with associated software, including addition two new databases (SFLD CDD), functionality include residue-level annotation...
The InterPro database (http://www.ebi.ac.uk/interpro/) classifies protein sequences into families and predicts the presence of functionally important domains sites. Here, we report recent developments with (version 70.0) its associated software, including an 18% growth in size terms on new entries, updates to content, inclusion additional entry type, refined modelling discontinuous domains, development a programmatic interface website. These extend enrich information provided by InterPro,...
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and predict the presence of important domains sites. Central are predictive models, known as signatures, from range different family databases have biological focuses use methodological approaches domains. integrates these capitalizing on respective strengths individual databases, produce powerful classification resource. Here, we report status it...
The latest version of the CATH-Gene3D protein structure classification database (4.0, http://www.cathdb.info) provides annotations for over 235 000 domain structures and includes 25 million predictions.This article an update on major developments in 2 years since last publication this journal including: significant improvements to predictive power our functional families (FunFams); release 'current' putative assignments (CATH-B); a new, strictly non-redundant data set CATH domains suitable...
A major bottleneck in our understanding of the molecular underpinnings life is assignment function to proteins. While experiments provide most reliable annotation proteins, their relatively low throughput and restricted purview have led an increasing role for computational prediction. However, assessing methods protein prediction tracking progress field remain challenging.We conducted second critical assessment functional (CAFA), a timed challenge assess that automatically assign function....
Abstract CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain superfamily assignments, CATH+, with additional derived data, such as predicted sequence domains, functionally coherent subsets (Functional Families or FunFams). The CATH+ release, version 4.3, significantly increases coverage an...
The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.1, http://www.cathdb.info). resource comprises over 300 000 domain structures and 53 million domains classified into 2737 homologous superfamilies, doubling number predicted in previous version. daily-updated CATH-B, which contains our very assignment data, provides putative classifications for 100 additional domains. This article describes developments to last two years...
Abstract Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation protein function. Results Here, we report on results third CAFA challenge, CAFA3, that featured expanded analysis over previous rounds, both in terms volume data analyzed types performed. In a novel major new development, predictions assessment goals drove some experimental assays, resulting functional annotations for...
CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for v3.5 is slightly less than previous releases, this observation suggests may now know majority are easily accessible to structure determination. We have improved accuracy our functional family (FunFams) sub-classification...
Summary: The MSAViewer is a quick and easy visualization analysis JavaScript component for Multiple Sequence Alignment data of any size. Core features include interactive navigation through the alignment, application popular color schemes, sorting, selecting filtering. ‘web ready’: written entirely in JavaScript, compatible with modern web browsers does not require specialized software. part BioJS collection components. Availability Implementation: released as open source software under...
Abstract Experimental structures are leveraged through multiple sequence alignments, or more generally homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to query without any annotation. A recent alternative expands concept HBI sequence-distance lookup embedding-based (EAT). These embeddings derived Language Models (pLMs). Here, we introduce using single representations pLMs for contrastive learning. This learning procedure creates...
Abstract Deep-learning (DL) methods like DeepMind’s AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL for structural comparison and classification. Of ~370,000 models, 92% can be assigned 3253 superfamilies our CATH domain superfamily The remaining cluster into 2367 putative superfamilies. Detailed manual analysis on 618 of...
CATH (https://www.cathdb.info) classifies domain structures from experimental protein in the PDB and predicted AlphaFold Database (AFDB). To cope with scale of data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into superfamilies identify novel fold groups superfamilies. CATH-AlphaFlow uses state-of-the-art structure-based boundary prediction method (ChainSaw) for identifying multi-domain proteins. We applied process not classified AFDB 21...
The CATH database of protein domain structures ( http://www.biochem.ucl.ac.uk/bsm/cath/ ) currently contains 43 229 domains classified into 1467 superfamilies and 5107 sequence families. Each structural family is expanded with relatives from GenBank completed genomes, using a variety efficient search protocols reliable thresholds. This extended 616 470 sequences 23 876 results in the significant expansion HMM model library to include models built relatives, giving 10% increase coverage for...
The CATH database of protein domain structures ( http://www.biochem.ucl.ac.uk/bsm/cath_new ) currently contains 34 287 classified into 1383 superfamilies and 3285 sequence families. Each structural family is expanded with relatives recruited from GenBank using a variety efficient search protocols reliable thresholds. This extended resource, known as the CATH-protein (CATH-PFDB) total 310 000 sequences 26 812 New have been designed, based on these intermediate libraries, to allow more regular...
The latest version of CATH (class, architecture, topology, homology) (version 3.2), released in July 2008 (http://www.cathdb.info), contains 114,215 domains, 2178 Homologous superfamilies and 1110 fold groups. We have assigned 20,330 new 87 homologous 26 folds since release 3.1. A total 28,064 domains been our NAR 2007 database publication (CATH 3.0). website has completely redesigned includes more comprehensive documentation. revisited the architecture level as part development a 'Protein...
This article provides an update of the latest data and developments within CATH protein structure classification database (http://www.cathdb.info). The resource two levels release: CATH-B, a daily snapshot structural domain boundaries superfamily assignments, CATH+, which adds layers derived data, such as predicted sequence domains, functional annotations clustering (known Functional Families or FunFams). most recent CATH+ release (version 4.2) huge in coverage data. increases number fully-...
Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of globular domain annotations for millions available protein sequences. has previously featured in the Database issue NAR and here we report significant update to database. The current release, v16, significantly expanded its coverage over previous version now contains 95 million assignments. We also new method dealing with complex architectures that exist Gene3D, arising from discontinuous domains. Amongst other updates, have added...
Abstract SARS-CoV-2 has a zoonotic origin and was transmitted to humans via an undetermined intermediate host, leading infections in other mammals. To enter host cells, the viral spike protein (S-protein) binds its receptor, ACE2, is then processed by TMPRSS2. Whilst receptor binding contributes range, S-protein:ACE2 complexes from animals have not been investigated widely. predict infection risks, we modelled 215 vertebrate species, calculated changes energy of complex caused mutations each...
Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for classification and prediction functional sites exploits sub-classification CATH superfamilies. The superfamilies sub-classified into families (FunFams) using hierarchical clustering algorithm supervised by new method, FunFHMMer.FunFHMMer...
Abstract The Protein Data Bank in Europe-Knowledge Base (PDBe-KB, https://pdbe-kb.org) is a community-driven, collaborative resource for literature-derived, manually curated and computationally predicted structural functional annotations of macromolecular structure data, contained the (PDB). goal PDBe-KB two-fold: (i) to increase visibility reduce fragmentation contributed by specialist data resources, make these more findable, accessible, interoperable reusable (FAIR) (ii) place their...
VarSite is a web server mapping known disease-associated variants from UniProt and ClinVar, together with natural gnomAD, onto protein 3D structures in the Protein Data Bank. The analyses are primarily image-based provide both an overview for each human protein, as well report any specific variant of interest. information can be useful assessing whether given might pathogenic or benign. structural annotations position include secondary structure, interactions ligand, metal, DNA/RNA, other...
CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct hierarchical evolutionary structural relationships. The aim this study was develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. method developed (CATHe) combines neural network with representations obtained from language models. It assessed using dataset...