In the fast-paced world of drug discovery, one of the most crucial aspects is understanding protein-ligand interactions. With the correct algorithms for predicting protein-ligand interactions, effective drug candidates may be recognized much more quickly than through traditional approaches involving costly testing. Researchers from the University of Edinburgh proposed BALM, which stands for Binding Affinity Learning Model. This joint study performed within the collaboration framework between leading universities and research institutes shows how protein and ligand language models, when properly trained to perform, may improve binding affinity predictions. Using salient language constructs, the BALM model provides a new perspective in estimating how well a ligand will dock or bind to a target protein, which is very important in drug development.
The Need for Better Predictions in Drug Discovery
Drug discovery takes a lot of time and a lot of money. One of the most critical bottlenecks is the ability to predict how ligands or small molecules are going to bind to their target proteins within the body. This binding is translational as it determines the effectiveness of a drug. The strength of the interaction is key as it determines whether the target protein will be active, inhibited, or a regulator of a potentiator, among other functions. Normally, these trends are established through an experimental approach to finding binding energy or through definition by x-ray crystallography methodologies, nuclear magnetic resonance (NMR), and molecular docking studies. Although these methods are reliable, they are quite costly and take a lot of time.
Machine learning and bioinformatics have made it easier and quicker to do this. The researchers trained domain language models for proteins and ligands to enhance the prediction of binding affinities. These existing data help the models to understand such complex interactions between proteins and ligands without the need for lengthy experimental proofs and thus accelerate drug development.
The Concept Behind BALM: A Unified Framework
The BALM (Binding Affinity Learning Model) was made to assess the binding affinity of proteins to ligands, having been trained in both protein sequences and ligand SMILES strings. It is designed in such a manner that it attains the objectives of binding, which is to maximize the cosine similarity and to minimize the angle for non-binding.
In real-life application, this theory touches the following parameters:
Input: The model takes in two forms of input: the protein sequences and the ligand SMILES strings, which further form the hybrid way of expressing the structure of a molecule.
Encoding: The model for encoding the protein sequences is ESM-2, which is a protein language model that has already been trained. On the other hand, ligand SMILES strings are encoded using ChemBERTa-2, which is a ligand-specific language model.
Feature Projection: After encoding the protein and ligand features separately, these encodings are projected into a shared latent space where the cosine similarity can be computed. Cosine similarity is a metric that measures how similar two vectors are, with values ranging from -1 to 1. In this case, the higher the score of the cosine similarity, the stronger the association with a binding interaction. The lower scores of cosine similarity suggest lower interactions or none at all.
Learning the Interaction: The BALM model enables the identification of protein-ligand pairs capable of binding (high affinity) but is distinguishable from those that cannot (low affinity) by determining the binding affinity predicted and the binding affinity actual. This is facilitated by a loss function called a mean-squared error, where the model starts from random estimations and learns to predict affinities by gradient descent.
The Models Behind BALM: ESM-2 and ChemBERTa-2
ESM-2 is an advanced transformer-based model with a million protein sequences over 220. It follows a bidirectional transformer structure that integrates masking amino acids in the sequences and facilitates in predicting the structure, function, and evolution of proteins. This makes it possible for proteins to be recognized in a much more involved way than one by one using this approach known as the masked language model.
ChemBERTa-2 mostly relies on RoBERTA architecture and utilizes deep learning in a bid to process molecular data. It has been exposed to SMILES strings of chemical compounds and has undergone training using both MLM and MTR objectives. MTR enhances the model’s ability to predict several molecular characterizations and is a great option for interpreting the ligand’s chemical features.
Both ESM-2 and ChemBERTa-2 aim to deliver embeddings of proteins and ligands. These embeddings are imported to a common space using fully connected layers followed by the ReLU activation function. This provides a consistent framework for treating protein and ligand embeddings, which originally were of different sizes (640 and 384, respectively), and enhances the chance of similarity in having the same size of 256.
Training and Datasets: How BALM Learns Binding Affinities
During the training of BALM, the protein-ligand pairs are compared using cosine similarity metrics, where a higher cosine angle indicates a stronger binding relationship between a protein and ligand. During this phase, the model is trained through Mean Squared Error Loss using Binding Score, where the predicted score and the one attained in experiments (within a range of -1 and 1) are compared to calculate the loss.
The team utilized several benchmark datasets to train and assess the efficacy of the BALM model:
BindingDB: More than 52,000 interactions between proteins and ligands are included in this dataset. After some filtering and cleaning, approximately 25,000 interactions served to train the model, and biases in interactions were cleaned to deliver more accurate results.
LP-PDBBind: This dataset gives information on proteins and their highly stable ligand surfaces in the complex. These model limitations were circumvented by training the model on sub-complexes of high-quality subsets and enhancing generalizability.
USP7 and Mpro Datasets: These sets of data centers on certain proteins, such as COVID-19’s main protease, and how this protein interacts with specific inhibitors. They were meant to further explore the performance of BALM on specialized therapeutic markets.
Evaluating Performance: Data Splits for Robust Testing
For the evaluation of the performance and generalization ability of BALM, the research group carried out several data splits:
Random Split: protein-ligand pairs are divided randomly into train-validation-test proportions.
Cold Target Split: proteins in the training, validation, and test sets belong to completely different sets, such that the model is forced to generalize to new, untrained proteins.
Cold Drug Split: ligands are divided in the same way, thus examining the model’s ability to predict interaction with untrained drugs.
Scaffold Split: ligands are divided into groups based on their core scaffolds, therefore ensuring that the test set has no drugs of similar chemical structures as those of the training set.
Conclusion
BALM represents a significant advancement in the field of protein-ligand binding affinity prediction. With the combined use of the protein and ligand language models ESM-2 and ChemBERTa-2, BALM not only enhances the predictive performance of the model but also provides good generalization across datasets and splits. By making use of this tool, researchers and pharmaceutical companies would be able to shorten the length of the drug discovery cycle and target drug candidates earlier and more efficiently, leading to the development of more precise and reliable therapies. As the field progresses, other models like BALM will be able to put forth the power of machine learning and data from the biological realm and can greatly improve our prediction and comprehension of the underlying complexities of molecular interactions.
Article Source: Reference Paper | All the code containing scripts for data processing, model training, and evaluation is publicly accessible on GitHub.
Important Note: bioRxiv releases preprints that have not yet undergone peer review. As a result, it is important to note that these papers should not be considered conclusive evidence, nor should they be used to direct clinical practice or influence health-related behavior. It is also important to understand that the information presented in these papers is not yet considered established or confirmed.
Follow Us!
Learn More:
Anchal is a consulting scientific writing intern at CBIRT with a passion for bioinformatics and its miracles. She is pursuing an MTech in Bioinformatics from Delhi Technological University, Delhi. Through engaging prose, she invites readers to explore the captivating world of bioinformatics, showcasing its groundbreaking contributions to understanding the mysteries of life. Besides science, she enjoys reading and painting.