Argonne National Laboratory researchers, together with partners from six different universities, NVIDIA, and Cerebras Inc., have collaborated to create GenSLMs, that learn the evolutionary landscape of the SARS-CoV2 genome using large language models (LLMs) resources. The Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research in 2022 has been awarded to GenSLMs.
Rapid sequencing and whole genome sequencing have enabled the tracking of emerging and further emergent variations of viruses such as SARS-CoV-2. Despite the slow mutation rate, SARS-CoV-2 has produced numerous variant strains with distinct mutation patterns over the past few years, many resulting in novel and mutant characteristics such as increased antigenicity, transmissibility, and fitness. The Centers for Disease Control and Prevention (CDC) in the United States categorized into four SARS-CoV-2 variant classifications:
1) Variations being monitored (VBM)
2) Variants of interest (VOI)
3) Variants of concern (VOC)
4) Variants of high importance (VOHC)
The alpha, beta, delta, and omicron variants are the major covid variants.
Artificial intelligence (AI) and machine learning (ML) have repeatedly shown the potential to revolutionize pandemic surveillance in real time. Instead of looking out for variants to occur before identifying VOCs, AI/ML approaches can use deep sequencing data to detect mutations in viral proteins and characterize the evolutionary patterns that can help describe future VOCs.
The authors Zvyagin et al. trained the largest biological LLMs with codon tokenization on a varied set of 110 million bacterial gene sequences. These are the first foundation models trained on raw nucleotide sequences that significantly improve predictive accuracy when identifying VOCs.
To capture the correct context and longer-range interactions in genome-scale datasets, propose and verify a novel hierarchical transformer-based model that employs both Generative Pre-trained Transformers (GPT) (on individual gene sequences) and sound diffusion. Using its generative capabilities, the scientists used this model to predict SARS-CoV-2 progression.
They demonstrated training foundation models on traditional (GPU-based) systems and emerging AI accelerator hardware with high time-to-solution watermarks. Furthermore, the scaling benchmarks show that training GenSLMs can be intensive, with a sustained performance throughout the training session.
Evolution Demystified By A Cryptic Code
According to Venkatram Vishwanath, co-author of the paper and data science head at the Argonne Leadership Computing Facility (ALCF), large language models are critical for attaining AI for science vision across multiple science fields.
The researchers used sophisticated supercomputing assets such as Polaris, a Hewlett Packard Enterprise system, and the Cerebras CS-2 AI platform at the ALCF. This study also made use of NVIDIA’s Selene supercomputer. Polaris and Selene are powerful supercomputers that are aided by GPUs (graphics processing units), but the CS-2 system is unique. The ALCF AI Testbed’s CS-2 AI accelerator technology is well-tuned for learning-based tasks.
Prior studies have shown that LLMs based on protein language can follow the evolution and build wholly new proteins with novel structures and functions. However, the corresponding author Ramanathan claims that the research he and his colleagues conducted was the first attempt to run an LLM-based model at the gene level and not the previously explained protein levels.
Genes are initially transcribed to messenger RNA in the cell (mRNA). The mRNA then exits the nucleus and enters the ribosome, which is translated into proteins. While previous tests demonstrate language models capable of describing evolutionary changes at the protein level, scientists needed to dive deeper to discover VOCs at the gene level.
Therefore, machine learning was critical in the current study since the models needed information about VOCs. The Bacterial and Viral Bioinformatics Resource Center web resources and the Houston Hospital System supplied the necessary integrated data and analysis tools to support this investigation. To further comprehend the virus, the researchers studied 1.5 million high-quality SARS-CoV-2 whole genome sequences and total sequences obtained from the resource center and Houston hospital, respectively. It’s worth noting that the information included inside nucleotide sequences has a considerably larger vocabulary than protein sequences alone.
Ramanathan and his team have demonstrated that these models can advance research. He believes their research could pave the way for future pandemic surveillance and further argues that this discovery could lead to protein engineering applications or the modeling of entire organisms.
The authors note the inability to solve the issues of noise and bias in the data – comparable to natural language models and reach out to the community to help design appropriate test harnesses for extensively analyzing GenSLM-like models.
A simple application would be to merge GenSLMs with protein structure prediction procedures like AlphaFold/OpenFold and faster protein folding methods to model immune escape and fitness that affect the virus’s capacity to adapt to its host. Likewise, combining experimental and biophysical data from antibody assays, molecular docking, and other quantitative methods into the GenSLM workflow helps direct the training regimes for these models to focus on potential future variants of concern.
Freely available courses to learn each and every aspect of bioinformatics.
Stay updated with the latest discoveries in the field of bioinformatics.
Shwetha is a consulting scientific content writing intern at CBIRT. She has completed her Master’s in biotechnology at the Indian Institute of Technology, Hyderabad, with nearly two years of research experience in cellular biology and cell signaling. She is passionate about science communication, cancer biology, and everything that strikes her curiosity!