AlphaFold’s breakthrough in solving the protein structure prediction problem has ignited enthusiasm to apply similar methods to predict the 3D structures of RNAs. However, achieving this feat presents significant challenges. A review paper titled ‘When will RNA get its AlphaFold moment?’ recently published in Nucleic Acids Research, discusses, in-depth, the various challenges faced by scientists over the years while attempting to develop deep learning models similar to AlphaFold for studying and predicting 3D structures of RNA.
A variety of techniques, ranging from physics to machine learning-based methods, have been employed to predict RNA structures. The studies stressed the fact that data-hungry deep learning methods are heavily reliant on large amounts of data, and in RNA structure prediction, the limited availability of structure and alignment data poses a significant challenge. Sequence data has also been found to be biased, and a lot of it does not uphold quality standards. Addressing and solving these issues is crucial to enabling further research into RNA prediction and opening new avenues.
The Importance of RNA
It is important to understand the relevance of RNA to biological functions and its composition. One of their most important roles is demonstrated during the process of protein translation, and it is also one of their most well-studied functions. There are four types of RNA whose functions are currently known to us:
- Ribosomal RNA (rRNA): It acts as a catalyst in processes where ribosomes are used to produce proteins.
- Transfer RNA (tRNA): It delivers amino acid residues where they belong during protein production.
- Messenger RNA (mRNA): The untranslated regions of mRNA and some viruses have regulatory effects.
- Non-coding RNA (ncRNA): While it is not as well understood as the RNAs mentioned above, it is believed to have regulatory functions as well. The ncRNA of the animal genome is its most well-studied form. Its function depends on stable RNA structures such as tRNA and ribosomes and transient structures like the spliceosome. The spliceosome is a structure that removes introns from pre-mRNA, which is later processed to generate mRNA.
Understanding the functions and workings of different types of RNA is important for various reasons. RNA can have potential applications in drug discovery and design, as well as in novel forms of therapy. It can also help us answer the question of where the origin of life lies and can help prevent bacterial drug resistance by inhibiting specific particles of ribosomes.
A typical molecule of RNA is composed of nucleotides, which act as its basic building blocks. They consist of ribose sugars, nitrogenous bases, and phosphate groups. These nucleotides are stacked on each other and stabilized by Van der Waals interactions. The geometry of the RNA backbone varies with the geometry of ribose sugar puckers. Due to the presence of d-orbitals in the phosphorus atoms, phosphate atoms are one of the most complex components of RNA molecules. They also contain non-Watson and Crick base pairings present throughout, caused by differing hydrogen bonding patterns. These patterns are exclusive to RNA. These pairings constitute large portions of RNA and are capable of forming motifs, mediating several interactions, and creating binding sites for proteins.
The issue of quantity and quality
Protein databases are much more extensive and vast than RNA databases. The Protein Data Bank (PDB) contains 25 times more protein-related data than RNA. A limited number of RNA structures of high resolution are currently available. This is a huge drawback since they are considered the most reliable structures to use while performing research. Structures that have been resolved recently tend to be of lower resolution. The difference between the quantity and quality of protein and RNA databases can be illustrated using the examples of Pfam and Rfam.
Pfam uses hidden Markov models to analyze and quantify data related to proteins. Compared to the covariance models used in Rfam, which are much more expensive computationally, these are much more convenient and easier to use. Rfam also takes the secondary structures of RNA into consideration. Several difficulties are faced when determining the homology between RNA molecules that are related and creating new alignments between them.
RNA alignments are much smaller than those between protein sequences. These alignments also tend to be highly conserved and are very simple, which is inadequate for studying 3D structures of RNA that are much more complex. Overall, there isn’t enough data available in Rfam to train ML models, as they require much more data.
Many non-Watson and Crick base pairs are not aligned in Rfam, and the aligned ones have been inconsistent. Many structures in Rfam also remain unfolded due to the limitations mentioned above, and this is inconvenient since RNA is studied best when it is in its folded, 3D structure.
Building families automatically can be promising, as they are already used in established deep learning models like AlphaFold. They have not been used for RNA yet, but research is being performed in this area.
It is necessary to build better datasets for RNA. The RNA community can contribute to this by adding more data in an accessible manner. There is also a need for the diversification of existing data. Both of these changes can improve training datasets that can be utilized for machine learning models. Higher-resolution data for a greater number of RNA motifs needs to be collected with appropriate geometric structures.
Further research and data are required on the interactions between the charged oxygen atoms of phosphate groups and hydrogen bonding. Standards that benchmark the quality of predicted 3D structures need to be established, and it is necessary to improve the size and variation of datasets that contain information on multiple sequence alignments (MSA). Annotation of ncRNA in metagenomes can lead to sequence diversification. The Tree of Life projects at the Sanger Institute will provide sufficient sequences to add to RNA databases in the near future.
Machine learning methods are brisk and difficult to predict. They are also highly data-hungry in nature, which acts as a major setback when trying to use them for performing RNA prediction. New methods like transfer learning are less data-hungry and can potentially be used for RNA structure prediction. 3D RNA structure prediction methods, will most likely have considerable development and application after the 2020s are over.
Article source: Reference Paper
Swasti is a scientific writing intern at CBIRT with a passion for research and development. She is pursuing BTech in Biotechnology from Vellore Institute of Technology, Vellore. Her interests deeply lie in exploring the rapidly growing and integrated sectors of bioinformatics, cancer informatics, and computational biology, with a special emphasis on cancer biology and immunological studies. She aims to introduce and invest the readers of her articles to the exciting developments bioinformatics has to offer in biological research today.