Home AI Meet GenSLMs, A Genome-Level AI Model That Helps Track COVID Variants’ Evolution

Meet GenSLMs, A Genome-Level AI Model That Helps Track COVID Variants’ Evolution

December 4, 2022

Argonne National Laboratory researchers, together with partners from six different universities, NVIDIA, and Cerebras Inc., have collaborated to create GenSLMs, that learn the evolutionary landscape of the SARS-CoV2 genome using large language models (LLMs) resources. The Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research in 2022 has been awarded to GenSLMs.

Rapid sequencing and whole genome sequencing have enabled the tracking of emerging and further emergent variations of viruses such as SARS-CoV-2. Despite the slow mutation rate, SARS-CoV-2 has produced numerous variant strains with distinct mutation patterns over the past few years, many resulting in novel and mutant characteristics such as increased antigenicity, transmissibility, and fitness. The Centers for Disease Control and Prevention (CDC) in the United States categorized into four SARS-CoV-2 variant classifications:

1) Variations being monitored (VBM)

2) Variants of interest (VOI)

3) Variants of concern (VOC)

4) Variants of high importance (VOHC)

The alpha, beta, delta, and omicron variants are the major covid variants.

Artificial intelligence (AI) and machine learning (ML) have repeatedly shown the potential to revolutionize pandemic surveillance in real time. Instead of looking out for variants to occur before identifying VOCs, AI/ML approaches can use deep sequencing data to detect mutations in viral proteins and characterize the evolutionary patterns that can help describe future VOCs.

The authors Zvyagin et al. trained the largest biological LLMs with codon tokenization on a varied set of 110 million bacterial gene sequences. These are the first foundation models trained on raw nucleotide sequences that significantly improve predictive accuracy when identifying VOCs.

To capture the correct context and longer-range interactions in genome-scale datasets, propose and verify a novel hierarchical transformer-based model that employs both Generative Pre-trained Transformers (GPT) (on individual gene sequences) and sound diffusion. Using its generative capabilities, the scientists used this model to predict SARS-CoV-2 progression.

They demonstrated training foundation models on traditional (GPU-based) systems and emerging AI accelerator hardware with high time-to-solution watermarks. Furthermore, the scaling benchmarks show that training GenSLMs can be intensive, with a sustained performance throughout the training session.

Evolution Demystified By A Cryptic Code

According to Venkatram Vishwanath, co-author of the paper and data science head at the Argonne Leadership Computing Facility (ALCF), large language models are critical for attaining AI for science vision across multiple science fields.

The researchers used sophisticated supercomputing assets such as Polaris, a Hewlett Packard Enterprise system, and the Cerebras CS-2 AI platform at the ALCF. This study also made use of NVIDIA’s Selene supercomputer. Polaris and Selene are powerful supercomputers that are aided by GPUs (graphics processing units), but the CS-2 system is unique. The ALCF AI Testbed’s CS-2 AI accelerator technology is well-tuned for learning-based tasks.

Prior studies have shown that LLMs based on protein language can follow the evolution and build wholly new proteins with novel structures and functions. However, the corresponding author Ramanathan claims that the research he and his colleagues conducted was the first attempt to run an LLM-based model at the gene level and not the previously explained protein levels.

Genes are initially transcribed to messenger RNA in the cell (mRNA). The mRNA then exits the nucleus and enters the ribosome, which is translated into proteins. While previous tests demonstrate language models capable of describing evolutionary changes at the protein level, scientists needed to dive deeper to discover VOCs at the gene level.

Therefore, machine learning was critical in the current study since the models needed information about VOCs. The Bacterial and Viral Bioinformatics Resource Center web resources and the Houston Hospital System supplied the necessary integrated data and analysis tools to support this investigation. To further comprehend the virus, the researchers studied 1.5 million high-quality SARS-CoV-2 whole genome sequences and total sequences obtained from the resource center and Houston hospital, respectively. It’s worth noting that the information included inside nucleotide sequences has a considerably larger vocabulary than protein sequences alone.

Ramanathan and his team have demonstrated that these models can advance research. He believes their research could pave the way for future pandemic surveillance and further argues that this discovery could lead to protein engineering applications or the modeling of entire organisms.

Way Forward

The authors note the inability to solve the issues of noise and bias in the data – comparable to natural language models and reach out to the community to help design appropriate test harnesses for extensively analyzing GenSLM-like models.

A simple application would be to merge GenSLMs with protein structure prediction procedures like AlphaFold/OpenFold and faster protein folding methods to model immune escape and fitness that affect the virus’s capacity to adapt to its host. Likewise, combining experimental and biophysical data from antibody assays, molecular docking, and other quantitative methods into the GenSLM workflow helps direct the training regimes for these models to focus on potential future variants of concern.

Article Source: Reference Paper | Reference Article

Learn More:

Top Bioinformatics Books ↗

Learn more to get deeper insights into the field of bioinformatics.

Top Free Online Bioinformatics Courses ↗

Freely available courses to learn each and every aspect of bioinformatics.

Latest Bioinformatics Breakthroughs ↗

Stay updated with the latest discoveries in the field of bioinformatics.

Shwetha S

| Website

Shwetha is a consulting scientific content writing intern at CBIRT. She has completed her Master’s in biotechnology at the Indian Institute of Technology, Hyderabad, with nearly two years of research experience in cellular biology and cell signaling. She is passionate about science communication, cancer biology, and everything that strikes her curiosity!

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Evolution Demystified By A Cryptic Code

Way Forward

LEAVE A REPLY Cancel reply

Must Read

Company

Latest News

Popular Categories