Protein design is important for personalized medicine and drug discovery. Generative models for protein structures prove to be especially useful in these areas. Traditional methods of protein design- for example, Rosetta- do this by solving optimization problems for the sequence that minimizes energy functions. Such methods are often effective but can frequently be very computationally demanding; hence, they are limited by the accuracy of the underlying energy functions. Enter deep learning: a game-changing approach that has already shown huge potential in understanding the complex relationships between protein sequences and their functions. Researchers from Profluent Bio, Berkeley, USA, introduce ProseLM, a deep learning model that overcomes the limitations of general protein language models by using graph knowledge and the scaling trends of underlying language models. It also includes non-protein contexts.

Why are Protein Language Models so Powerful?

Deep learning models, specifically protein language models, have moved significantly toward deciphering the complicated language of proteins. For example, ProGen2 models learn solely from protein sequences, applying self-supervised learning techniques such as predicting masked amino acids. They can capture essential structural and functional properties of proteins that turn into precious tools in protein design. The major limitation, however, lies in that they rely a lot on known protein examples for fine-tuning. Hence, their applicability is pretty limited to well-known protein families and functions. They cannot think outside the box and can only learn from examples.

ProseLM: A New Frontier

To help gain a way past these limitations, have developed ProseLM. It is a new method developed for protein sequence design that seeks to use adaptations of protein language models to include structural and functional context. ProseLM uses language model scaling and non-protein contexts- including nucleic acids, ligands, and ions- in design to enhance recovery of native residues by 4โ€“5% at all model scales. This increase is more significant for residues that bind directly to non-protein elements, reaching recovery levels of more than 70% with the most cutting-edge models of ProseLM. That’s a lot!

How It Works: Blending Structure and Function

ProseLM is built from ProGen2 language models but uses structural and functional information via parameter-efficient adaptation. The core of this model lies in the employment of conditional adapter layers that can inject structural context into the language model’s embeddings. Essentially, adapters are bottleneck operations added to every layer of the model to slightly maybe change the output at each layer and include conditioning information without significantly adding computational cost.

A pre-trained causal encoder provides the structural context, capturing inter-residue features and structural spans. The designโ€”including message-passing and invariant-point message-passing layers that condition on later residuesโ€”further makes it suitable for design tasks where some sequences remain constant.

Results: Advancing Protein Design

The performance of proseLM was measured using several metrics, including perplexityโ€”a measure of the accuracy of the predictionsโ€”and recovery of native sequences. In each case, significant gains over prior methods were observed, with perplexity monotonically decreasing when more structural or functional context was given. Notably, proseLMXL improved the median recovery rate by 3.59% over the causal encoder, showing us the effectiveness of this method in correctly predicting native residues. Isn’t that cool?

Real-World Applications: Genome Editors and Antibodies

The researchers optimized the editing efficiency of genome editors within human cells to an additional edit base efficiency boost of up to 50% in editing activity. They also redesigned therapeutic antibodies with proseLM and came up with a PD-1 binder that showed an impressive 2.2 nM affinity. These results definitely show us proseLM’s potential in real-world applications, from enhancing genome editing tools to developing high-affinity therapeutic antibodies.

Discussion: The Future of Protein Design

In comparison with the traditional approaches to this problem, ProseLM is the approach capable of benefiting directly from the joint amino acid-level and protein-level decision formulation of protein interaction and functionalities. All these will further allow for finer details on protein interactions and functionalities. Instead, since it is based on the entire protein knowledge graph, it thus can capture broader relational nuances and dependencies to provide more elaborative and accurate protein representation.

Another recently introduced new protein language model called GOProteinGNN also leverages GKI (Graph Neural Networks (GNN) Knowledge Injection) for protein representation learning. ProseLM focuses on enhancing protein sequence design by combining structural and functional information into protein language models, while GOProteinGNN tries to combine protein language models with protein knowledge graphs to improve protein representation learning. In proseLM, GKI enables knowledge graph information to be put directly into protein embeddings during the encoder stage.ย Hence, the protein representations are better now with sequential insights and graph-enhanced knowledge through full and contextually enhanced encoding.

Conclusion:

ProseLM is a big step in protein design, with this method effectively allowing protein knowledge graphs to be embedded into protein language models. Representations of proteins are made much more robust and contextual, modeling complex relationships and dependencies often lost by traditional methods. Each of the top-performing models across a range of tasks was beaten by proseLM, hence making it the leading solution to protein representation learning!

While this technology is yet in the process of full development, it holds a lot of power to revolutionize biotechnology and medicine by the creation of designed proteins for any wanted need. ProseLM adds another piece to the exponentially growing protein engineering puzzle, bringing us closer to solving the picture and opening up new avenues of inquiry into research and innovation. Indeed, the future of protein design is shining bright, and with tools such as proseLM, the possibilities may be endless. We hope researchers continue to bring out more such protein modeling architectures!

Article Source: Reference Paper | Original code and trained models described are available on GitHub.

Important Note: bioRxiv releases preprints that have not yet undergone peer review. As a result, it is important to note that these papers should not be considered conclusive evidence, nor should they be used to direct clinical practice or influence health-related behavior. It is also important to understand that the information presented in these papers is not yet considered established or confirmed.

Learn More:

Neermita
Website | + posts

Neermita Bhattacharya is a consulting Scientific Content Writing Intern at CBIRT. She is pursuing B.Tech in computer science from IIT Jodhpur. She has a niche interest in the amalgamation of biological concepts and computer science and wishes to pursue higher studies in related fields. She has quite a bunch of hobbies- swimming, dancing ballet, playing the violin, guitar, ukulele, singing, drawing and painting, reading novels, playing indie videogames and writing short stories. She is excited to delve deeper into the fields of bioinformatics, genetics and computational biology and possibly help the world through research!

LEAVE A REPLY

Please enter your comment!
Please enter your name here