A breakthrough by a group of researchers at New York University resulted in the development of an interpretable neural network that can explain the process by which it reached its predictions. This is revolutionary for both genomics as well as for artificial intelligence (AI) and machine learning (ML) because up until now, we did not really have a clear understanding of how exactly the AI and ML tools derive accuracy.
This neural network was initially created for the purpose of answering biological questions that are specifically related to genomics. It was aimed at helping us understand splicing better; it is the process that facilitates the transfer of DNA to RNA, which ultimately leads to the production of proteins. The researchers found that a hairpin-like conformation of RNA can halt the splicing process, and they cross-verified this finding by following standard experimental protocols in the lab as well.
Several approaches towards making scientific discoveries and improving the design of experimental methods have been transformed with the advent of machine learning. ML algorithms can comprehend complex relationships between input and output files involved in the given workflow. However, a major limitation of currently existing models based on ML is that they do not provide any insights into how exactly they reached the conclusion given by them as output; this applies to state-of-the-art (SOTA) methods as well. While the neural networks developed in this study focus primarily on analyzing interpretability, the accuracy of their predictions is on par with SOTA models, too. This model’s interpretability is exhibited by a pre-installed visualization component within it that traces the entire process of its own decision-making processes. In the case of splicing, it traces all the steps, starting from the sequences entered as input all the way up to the splicing predictions given by it as output.
Mysteries contained within the process of splicing
The underlying logic for the regulation of splicing still holds many questions within it, with most of them having no answers to them so far. In simple words, splicing involves the removal of introns (non-coding regions of RNA) and joining the remaining exons (coding regions) together to produce finalized RNA transcripts.
The function that is to be carried out by the protein encoded by the group of exons post-splicing is heavily influenced by the order in which the exons are present within the coding sequence. However, the exact mechanism and reasoning by which exon sequences determine which genes should be included and which ones can be skipped remains unknown. This is further complicated when considering the sensitive nature of splicing, wherein even a single alteration to the nucleotide sequence, whether it involves insertion, deletion, or mutation at a specific point, can lead to drastic changes in the functional and structural outcome of the protein produced.
Information brought to us by the model
This model proves that modeling it for the purpose of understanding interpretability can maintain the accuracy with which it can perform predictions. It is based on datasets that have been derived synthetically using immortalized cell lines. The researchers followed the synthetic route for multiple reasons:
- Synthetic datasets would have a much larger number of data points when compared to pre-existing genomic datasets, which are reduced in comparison owing to the number of exons present in the genome.
- Exons present in conventional genomic datasets are usually flanked by additional factors, such as promoters and splice sites, which unnecessarily complicate the process of interpretation.
- Synthetic datasets eliminate the possibility of RNA codes that overlap.
Therefore, synthetic datasets are the driving element towards improving the interpretability of the neural network.
Two exon-skipping features that played an important role in RNA regulation were identified using it and were validated experimentally subsequently. These exon-skipping features were characterized by the presence of RNA-binding proteins (RBPs) or a complex.
The model also has the ability to quantify individual exons that confer certain features, which implies its potential application in the fields of medicine and biotechnology. In these areas of research, the model can be applied to studies that focus on editing target exons within the genome or the RNA of the organism to bring certain corrections to the splicing process, as well as in the development of RNA-based therapeutics such as antisense oligonucleotides. It also uncovered new biochemical processes that have the potential for further investigation, wherein it was observed that splicing decisions can be improved with the help of additive quantities. The interpretable neural network can also contribute to the refining of redundant hypotheses provided by scientists.
Further scope for research and conclusion
More work needs to be done on the model for researchers to understand the splicing logic that is regulated by developing mechanisms. Splicing outcomes are determined by the levels of expression exhibited by specific cell types that correspond to their respective RNA-binding proteins and have not been explored deeply in this study. It can be enabled by developing splicing datasets in a synthetic manner, as done previously, and by incorporating cell types relevant to specific RBPs and further pairing these with models based on interpretability to capture regulatory features local to the cells.
Models that are interpretable by design can be further applied toward unveiling complex codes that regulate biomolecular processing. Generating a large amount of data for such models will accelerate our understanding of biological codes.
Swasti is a scientific writing intern at CBIRT with a passion for research and development. She is pursuing BTech in Biotechnology from Vellore Institute of Technology, Vellore. Her interests deeply lie in exploring the rapidly growing and integrated sectors of bioinformatics, cancer informatics, and computational biology, with a special emphasis on cancer biology and immunological studies. She aims to introduce and invest the readers of her articles to the exciting developments bioinformatics has to offer in biological research today.