A conditional protein diffusion model generates artificial programmable endonuclease sequences with enhanced activity

3102 Bioinformatics and Computational Biology QH573-671 Machine Learning and Artificial Intelligence Argonaute; Nucleose; Guide Generic health relevance 3101 Biochemistry and Cell Biology Cytology Article 31 Biological Sciences Biotechnology
DOI: 10.1101/2023.08.10.552783 Publication Date: 2023-08-14T18:25:15Z
ABSTRACT
AbstractDeep learning-based methods for generating functional proteins address the growing need for novel biocatalysts, allowing for precise tailoring of functionalities to meet specific requirements. This emergence leads to the creation of highly efficient and specialized proteins with wide-ranging applications in scientific, technological, and biomedical domains. This study establishes a pipeline for protein sequence generation with a conditional protein diffusion model, namely CPDiffusion, to deliver diverse sequences of proteins with enhanced functions. CPDiffusion accommodates protein-specific conditions, such as secondary structure and highly conserved amino acids (AAs). Without relying on extensive training data, CPDiffusion effectively captures highly conserved residues and sequence features for a specific protein family. We applied CPDiffusion to generate artificial sequences of Argonaute (Ago) proteins based on the backbone structures of wild-type (WT)Kurthia massiliensisAgo (KmAgo) andPyrococcus furiosusAgo (PfAgo), which are complex multi-domain programmable endonucleases. The generated sequences deviate by up to nearly400AAs from their WT templates. Experimental tests demonstrated that the majority of generated proteins show unambiguous activity in DNA cleavage for both KmAgo and PfAgo, with many of them exhibiting superior activity as compared to the WT. These findings underscore CPDiffusion’s remarkable success rate to generate novel sequences for proteins of complex structures and functions in a single step with enhanced activity. This approach facilitates the design of enzymes with multi-domain molecular structures and intricate functions throughin silicogeneration and screening, all accomplished without any supervision from labeled data.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (52)
CITATIONS (5)