A collaborative effort between researchers from the UK, Spain, and Germany utilized experimental methods to explore amino acid sequence spaces and demonstrated that for a decent number of proteins, genetic architecture can be relatively simple. They also noted that models based on artificial intelligence that are conventionally used for biological prediction purposes can be much more complex than the proteins they are used to analyze. There are millions of ways to synthesize proteins from amino acids; 100 amino acids can be synthesized in 20100 different ways. This is a number that is larger than the number of atoms in the universe! Therefore, it is nearly impossible to analyze all these high-dimensional sequence spaces experimentally, and so far, only a small portion has been covered by scientists. While deep neural networks have been used to study high-dimensional sequence spaces, they still haven’t been able to provide sufficient information on the genetic architecture of proteins. The models created by the researchers also account for the relationship between changes in free energy and expressed phenotypes.

Experimental exploration of proteins

Studying the genetic architecture of proteins is key to aiding genetic predictions where biophysical models can be interpreted within high-dimensional sequence spaces. Experimental methods have been used to successfully quantify the effects due to changes in single amino acids and the presence of double mutants in small proteins using massive parallel experiments. There have been problems when analyzing mutants of higher order using these methods simply because too many combinations of genotypes are possible. The overall number of these combinations is denoted by 2n, where n denotes the number of amino acids. There are many chances of predicting unfolded proteins when combining random mutations because only a miniscule portion of variants within a protein domain is actually folded. It then becomes pointless to survey millions of combinatorial genotypes, as it will mostly likely end up giving trivial predictions about unfolded proteins that are not relevant.

Computational approaches – have they been useful?

Generative artificial intelligence (AI) and deep neural networks (DNNs) have been useful for predicting diverse proteins, the impacts of combinatorial proteins, and designing proteins. They have not been instrumental in giving insights into genetic architecture. The architecture of AI models themselves is quite complicated, which makes it even more difficult to interpret the results provided by them. The complexity of AI models might seem to indicate that genetic architecture is just as complicated, but this theory is proved false when simple statistical models, such as Potts Hamiltonian models, are used for analyzing genetic architecture. These models are used in four-dimensional sequence spaces, and they only take conserved positions and pairwise co-variations into consideration over a dataset consisting of multiple-sequence alignments (MSA).

A peek into the experimental model 

The model that the researchers in this work built explores the genetic structures of high-dimensional protein spaces with over 1010 genotypes and 30 dimensions using experimental approaches. To accomplish this, they enriched functional protein sequences. The predictive abilities of the model were improved by accounting for energetic couplings between mutations in a pairwise manner. The model excels in performance due to this feature. Energetic couplings are highly correlated to the three-dimensional structures of proteins and are sparsely present throughout domains. Researchers were given hints about the origins of the proteins they were researching by the information provided by energy couplings, which demonstrated that there were stronger connections between residues whose structures were in contact with one another. The researchers also found that the further they went along the protein backbone, the more the coupling strength decreased.

Biophysical Energy Models

These were used as additives in the experimental model. The biophysical models were added to enhance the prediction processes, and they proved to be very efficient in judging fold stability when multiple genotypes were involved. These models are very sparsely distributed throughout protein domains, and they use compressions of large amounts of data. The datasets were compared with mutagenesis datasets that have been previously published, such as those of tRNA, alternatively spliced exons, and protein interaction surfaces. The comparison showed that a simple landscape acted as a common denominator among all these datasets, with the implication that this is a commonality in molecular reactions.

Importance of energy-based models

Compared to AI-based models, using energy couplings as a parameter in the model developed by the researchers can provide better mechanistic insights into genetic architecture in a more interpretable manner. These models hold great potential for applications in pathogen analysis-based pandemic forecasting, clinical variation interpretation, and protein engineering for biotechnological applications.


A major challenge faced when using energy-based models is accounting for all the energy couplings for each and every mutation present within the protein under study. Quantifying the mutations that arise inside homologous regions is one method to accomplish this. The goal of the study was to show that first-order energetic couplings can provide enough information to complete protein prediction tasks, with the researchers acknowledging the importance of higher-order genetic linkages in the study of protein stability. Future developments would address the extent of importance that should be given to energetic interactions of higher order, especially in large sequences where structurally homologous proteins that have low sequence identities are present. All of this can be dealt with through experimental approaches.

Article source: Reference Paper

Important Note: bioRxiv releases preprints that have not yet undergone peer review. As a result, it is important to note that these papers should not be considered conclusive evidence, nor should they be used to direct clinical practice or influence health-related behavior. It is also important to understand that the information presented in these papers is not yet considered established or confirmed.

Learn More:

 | Website

Swasti is a scientific writing intern at CBIRT with a passion for research and development. She is pursuing BTech in Biotechnology from Vellore Institute of Technology, Vellore. Her interests deeply lie in exploring the rapidly growing and integrated sectors of bioinformatics, cancer informatics, and computational biology, with a special emphasis on cancer biology and immunological studies. She aims to introduce and invest the readers of her articles to the exciting developments bioinformatics has to offer in biological research today.


Please enter your comment!
Please enter your name here