OpenFold, an artificial intelligence (AI) research consortium, has announced the release of two new tools that aim to improve protein structure prediction: SoloSeq and OpenFold-Multimer. These open-source tools represent important advances in integrating protein language models with structure prediction software, as well as modeling protein complexes more accurately.
The SoloSeq model developed on Amazon Web Services (AWS) emerges as a groundbreaking protein LLM/structure prediction AI tool. It holds the distinction of being the inaugural fully open-source model in this category, offering critical training code. Now, organizations can leverage this code for fine-tuning or training new models with their proprietary data, marking a pivotal step towards collaborative advancements in the field.
SoloSeq Integrates Large Language Model with OpenFold
The first tool, SoloSeq, combines OpenFold’s existing protein structure prediction capabilities with a new protein large language model (LLM). This integration eliminates the need for a separate pre-processing step to generate a multiple sequence alignment (MSA) that is required by OpenFold and other structure predictors like AlphaFold.
By analyzing most alreadyย known protein sequences, SoloSeq’s LLM can summarize evolutionary information very rapidly rather than having to query sequence databases to find homologs. This makes the overall prediction over 10x faster on average than standard OpenFold, with nearly identical accuracy.
Advantages of SoloSeq’s Integrated LLM Architecture
Speed and Efficiency:
The integrated LLM architecture of SoloSeq confers several advantages. Firstly, it makes predictions over 4x faster than OpenFold and AlphaFold, albeit with slightly lower accuracy. This speed makes it very useful for large-scale structural screens where throughput is critical.
Handles Designed Proteins:
Additionally, SoloSeq can take as input artificially designed proteins, like those from systems like ProteinMPNN and RFDiffusion. These non-natural proteins are not handled well by alignment-based predictors reliant on evolutionary information like AlphaFold.
Enables Template-Based Modeling:
Uniquely, SoloSeq is the first protein LLM that enables template-based modeling in addition to ab initio structure prediction. This allows leveraging structural information from homologous proteins to improve accuracy.
Fully Open Source:
Unlike previous protein LLMs from Meta AI and DeepMind, SoloSeq’s code, weights, and even training code are fully open-source. This enables organizations to fine-tune the models on proprietary data or create entirely new models.
Robust Energy Function:
Approximation Experiments show that SoloSeq has learned an extremely robust approximation of the implicit protein energy function used to fold structures. This physical accuracy helps it generate high-quality predictions.
OpenFold-Multimer โ Improved Modeling of Protein Complexes
The second tool announced, OpenFold-Multimer, focuses specifically on better modeling of homo and hetero protein complexes. It is the first open-source complex modeling toolkit with included training code.
Building on AlphaFold’s previous AlphaFold-Multimer research, the new OpenFold code can create more accurate multimer models through end-to-end retraining focused on protein-protein interactions. Users can also fine-tune their own proprietary complex structure data.
This capability to train customized OpenFold variant models makes OpenFold-Multimer uniquely valuable for predicting protein assemblies. Accurate modeling of complexes could have major impacts on structural biology, biopharma, and enzyme engineering.
Commitment to Open Science to Drive Innovation
Both tools exemplify OpenFold’s commitment to open science – releasing not just prediction code but also training code. This enables the broader life sciences community to continuously improve protein structure modeling rather than be limited by closed-source solutions.
Industry leaders welcomed OpenFold-Multimer and SoloSeq as catalyzing high-speed BioAI innovation and revolutionary developments in fields relying on accurate protein structure prediction. The open availability grants pharmaceutical, agricultural, and synthetic biology companies equal opportunities to utilize and customize state-of-the-art AI.
Conclusion
The protein structure prediction field has been transformed in the last few years, with AlphaFold achieving astounding progress. However, key opportunities remain to improve prediction speed, handle designed proteins, leverage homology information, and model complexes.
OpenFold’s latest open-source tools make strides on these fronts. Moreover, their commitment to releasing training code ensures models can be tailored rather than act as black-boxes. SoloSeq and OpenFold-Multimer represent the next phase in democratizing access to transformative AI capabilities so all of science can unlock protein structures faster.
Story source: Reference Paper
Follow Us!
Learn More:
Dr. Tamanna Anwar is a Scientist and Co-founder of the Centre of Bioinformatics Research and Technology (CBIRT). She is a passionate bioinformatics scientist and a visionary entrepreneur. Dr. Tamanna has worked as a Young Scientist at Jawaharlal Nehru University, New Delhi. She has also worked as a Postdoctoral Fellow at the University of Saskatchewan, Canada. She has several scientific research publications in high-impact research journals. Her latest endeavor is the development of a platform that acts as a one-stop solution for all bioinformatics related information as well as developing a bioinformatics news portal to report cutting-edge bioinformatics breakthroughs.