An extensive protein database that reveals the structures of millions of metagenomic proteins has been created by researchers under the direction of Alex Rives from Meta (formerly known as Facebook). With DeepMind’s AlphaFold demystifying 220 million protein structures from DNA databases in early 2022, this groundbreaking study has surpassed these findings by unraveling 617 million protein structures, with most of its origin unknown and uncharacterized to man.
Metagenomics studies microbes in their natural habitat, including the intricate microbial communities in which they often dwell. Consequently, it is an effective strategy for comprehending how proteins and the microorganisms they are found in have undergone adaptive evolution. It is crucial to decipher the mysteries by identifying novel proteins and their structures, as proteins can be found across the natural world, from the soil to the depths of the oceans. With billions of protein sequences previously unknown, metagenomics is beginning to unveil these proteins’ astounding breadth and diversity, which can catapult the discovery of proteins for various real-life applications.
These (metagenomic proteins) are the structures that we are least familiar with. Alexander Rives, the research team leader for Meta AI’s protein team, believes they (proteins) have the potential to provide excellent biological understanding.
According to the preprint released in Biorxiv, the Meta AI team has demonstrated how a large language model may speed up high-resolution structure prediction using the primary amino acid sequence. This model was then fed with sequences of well-known proteins containing a chain of 20 different amino acids, each symbolized by a letter, to apply them to proteins. The network then developed the ability to “autocomplete” proteins by creating atomic scale resolution 3D structures with 60x speed of unraveling the structure without negating the resolution and accuracy.
Learning the Language of Amino Acids
Protein structure determination is highly critical to understand proteins’ biological function. Laboratory approaches to determine a protein structure, such as X-Ray Crystallography or Cryo-Electron Microscopy, are laborious, expensive, and cannot be used on all proteins. Moreover, determining a structure of a mere 100 amino acid long protein sequence that can be made of 20 different amino acids leads to a staggering 20100 different possible structural conformations. And these structures sometimes opt for ordered or disordered 3D structures, making it highly taxing even for computers to solve such big data.
But there are patterns of code hidden in the protein sequences; evolution must pick amino acids that fit together in the folded structure, and by examining such patterns in protein sequences, we can frequently deduce information about the structure of a protein. Back in 2019, the Meta AI team analyzed millions of metagenomic proteins using self-supervised learning named masked language modeling. Based on evolutionary similarity, the model is set to “autocomplete” the missing spaces in the protein sequence. In 2020, ESM1b, Meta AI’s protein language model, helped the scientific community unravel the SARS-CoV-2 evolutionary variants. And in 2022, Meta AI upped its previous version, ESM1b, to ESM2, making it the most extensive language model of proteins to date.
ESMFold contains structures of more than 617 million proteins generated in two weeks. In contrast, AlphaFold can take minutes to generate a single prediction, but the structures predicted in ESMFold aren’t as accurate as those predicted by AlphaFold.
Cheaper, Quicker, and Easier – Meta AI Solves it All
Modern techniques for predicting structure require extensive protein datasets to scan through to find related sequences. These methods need a large number of evolutionarily related sequences as input in order to extract the patterns associated with the structure. But the language model picks up these evolutionary patterns during its training on protein sequences, enabling a high-resolution three-dimensional structure prediction straight from the protein sequence.
Although the speed and accuracy of Meta AI’s have impressed Burkhard Rost, a computational biologist from the Technical University of Munich, the lesser accuracy than AlphaFold is indeed considered a limitation.
Final Thoughts
The twenty-lettered protein language stemming from the four-lettered genetic language weaves up the language called ‘Life.’ A language as ancient as a billion years is yet to be comprehended by man, the mysteries waiting to be decoded.
Ever since the invasion of AI in biology, it has helped us comprehend the vast expanse of natural variation. Even the most sophisticated computing tools have been unable to thoroughly demystify the language of proteins, which is beyond human comprehension. AI has the potential to help us understand this language. ESMFold holds a promising future by providing us with new tools to comprehend the natural world.
With DeepMind not having current plans to include metagenomic structure predictions in its database, although the possibility of including it is not ruled out, says AlphaFold representatives.
Prof. Martin Steinegger from the Seoul National University works on combining machine learning and big data technologies to gain an understanding of microbial populations and has recently used AlphaFold to predict 30 million metagenomic proteins. Steinegger believes that with the release of Meta AI, the investigation of these metagenomic structures will explode very soon. Alex Rives and his team anticipate that introducing this comprehensive structural atlas and quick protein folding models will advance science and deepen our comprehension of the universe.
Article Sources: Reference Paper | Reference Article | Reference Article | ESM Metagenomic Atlas
Learn More:
Top Bioinformatics Books ↗
Learn more to get deeper insights into the field of bioinformatics.
Top Free Online Bioinformatics Courses ↗
Freely available courses to learn each and every aspect of bioinformatics.
Latest Bioinformatics Breakthroughs ↗
Stay updated with the latest discoveries in the field of bioinformatics.
Shwetha is a consulting scientific content writing intern at CBIRT. She has completed her Master’s in biotechnology at the Indian Institute of Technology, Hyderabad, with nearly two years of research experience in cellular biology and cell signaling. She is passionate about science communication, cancer biology, and everything that strikes her curiosity!