A team of researchers at The University of Texas at Austin developed a novel algorithm, XVir, that combines deep learning methods and transformers to identify viral sequences from the genome of a tumor. It is a much faster and more compact method of analyzing genomes acquired from tumors and their possible causative oncogenic viruses. XVir identifies sequences of small segments of genetic material from the genomic sequencing reads. This approach exhibits high detection accuracy, demands less computational power, and outperforms previously developed deep learning algorithms for the same purpose.
The link between cancer and viruses
Viruses are one of the most mysterious elements of biology. They are difficult to detect and avoid and are capable of causing widespread global pandemics, and developing vaccines against them is a tricky process. This is because they aren’t exactly alive or dead; they rely on the mechanism of their host for their ‘activation’ and survival and are, therefore, difficult to detect outside of their target host. Keeping in mind the several inconveniences they already bring to the table, imagine the danger they pose if they were to possess the capability of causing one of the most feared diseases of all time: cancer. Unfortunately, viruses of this type do, in fact, exist, and there’s a name for them: oncogenic viruses.
Some of the most widespread oncogenic viruses that have been studied extensively are the Epstein-Barr virus (EBV), hepatitis B and C viruses (HBV and HCV), the human immunodeficiency virus (HIV), and the human papillomavirus (HPV). The exact mechanism of how viruses induce tumors is not understood clearly as of now, and the rapid evolution of the viral genome once incorporated into the genome of its host cell makes it difficult to link the genome sequences with the viral family it belongs to. Viral oncoproteins are believed to negatively impact regulatory mechanisms and ‘check points’ present in cells and promote uncontrolled cell division, inducing carcinogenesis.
Previous attempts to identify viral sequences
Identifying viral sequences in the genomes of cancer cells can be an important link to helping determine the cause of the cancer as well as studying the mechanisms involved. However, this is the most challenging part – due to the widespread diversity in viral genomes from their original, known sequences outside the host, it is difficult to identify viral sequences accurately. This is due to their tendency to evolve rapidly, owing to their genetic makeup. Incomplete genome databases for oncogenic viruses are another barrier to accurate identification of causative viruses.
Various computational approaches have been developed over the years to analyze sequencing data obtained from DNA, RNA, and other amino acid sequences. The first approach was called ViFi, which utilized a collection of Hidden Markov Models (HMM). This method identifies hidden objects using known information fed to it beforehand; here, established viral genomes were used as the reference dataset. Short viral sequences, or viral ‘reads,’ that evolved from the genome that was analyzed were identified. A more advanced algorithm, VirFinder, used k-mer frequencies to identify viral reads from genomic data that was directly obtained from clinical samples. Small sequences of a given length ‘k’ are referred to as k-mers. The total number of k-mers is given by the value of 4k. The use of neural networks came into play with ViRNAtrap, an algorithm that uses convolutional neural networks (CNNs) to identify short viral RNA reads.
State-of-the-art methods to identify viral reads came about with the development of DeepVirFinder and DeepViFi. DeepVirFinder uses shallow CNNs to recognize contigs. Contigs are segments of DNA that overlap and indicate continuous sequence regions within a genome. Deep ViFi uses a hybrid pipeline consisting of a transformer and the generation of a random forest classifier. Transformers are deep learning (DL) models that use a process called ‘self attention’ to process the entirety of data that has been provided as input in a sequential manner. It has the ability to capture context and identify relevant data. Embeddings are created within the transformer, where it learns the sequences of the viral genome and is followed by the classification of the viral reads using the random forest classifier. A common disadvantage for both of these methods is the large size of the shallow CNNs and the transformer itself; this significantly increases the computational load and run-time of the algorithm.
XVir is a compact and efficient deep-learning solution
XVir is an effort to use deep learning methods and transformers to identify oncogenic viral reads. It is more compact and takes up much less space than the DeepViFi and DeepVirFinder. It learns by reading sequences obtained from the sequences of the viral and human (tumor) genomes, taking into account the impact of non-coding regions on the induction of carcinogenesis as well.
To test the abilities of this algorithm, an experimental setting was created. Here, the HPV genome from the Papillomavirus Episteme database was used, and the cancer sequences from the human genome were taken from the GRCh38.p14 assembly. ART, a simulator that generates read sequences for testing next-generation sequencing (NGS) tools, was used to generate the same number of reads for both genomes. These reads were combined to form the sample dataset and were divided into three datasets: training, validation, and test datasets, in a ratio of 8:1:1, respectively.
The training dataset learns the parameters that the model will use. The validation dataset is involved in hyperparameter turning and selecting the appropriate model for identification, and finally, the test dataset compares the results from XVir with those obtained from previously used algorithms, specifically DeepVirFinder and DeepViFi. A two-dimensional representation of the read data can be performed by measuring the Hamming distance between two read pairs.
Comparison of XVir with DeepVirFinder and DeepViFi
To perform this comparison, both algorithms were trained with the same data used for testing XVir’s capabilities. The accuracy measurements for the three algorithms were found to be 0.967, 0.995, and 0.767 for XVir, DeepVirFinder, and DeepViFi, respectively, showing that XVir outperformed DeepViFi and is only marginally inferior to DeepVirFinder. Additionally, its output generation speed was eight times that of DeepVirFinder and 40% faster than that of DeepViFi, despite several parameters of XVir being 25% smaller than those involved in the other two. This small dataset saves a lot of space and provides a lighter computational load with a high-speed output. The use of transformers over CNNs captures a wider range of data as well, paying attention to intricate details at every position that may not be present in the other two algorithms.
Overall, XVir acts as a much more efficient and lighter pipeline for identifying viral reads from a large genomic dataset of a tumor. It is lighter on the computer and can provide results nearly as accurately as standard methods like DeepViFi and DeepVirFinder, if not better, with upsides to it and almost no downsides. Due to the ability of transformers to capture a wider range of information in a given dataset, they have the potential to analyze longer sequences from platforms like Oxford Nanopore Technologies and Pacific Biosciences.
The code for XVir is available at https://github.com/shoryaconsul/XVir
Swasti is a scientific writing intern at CBIRT with a passion for research and development. She is pursuing BTech in Biotechnology from Vellore Institute of Technology, Vellore. Her interests deeply lie in exploring the rapidly growing and integrated sectors of bioinformatics, cancer informatics, and computational biology, with a special emphasis on cancer biology and immunological studies. She aims to introduce and invest the readers of her articles to the exciting developments bioinformatics has to offer in biological research today.