Researchers at the Chemical Engineering Department of the University of Rochester, New York, introduce three deep learning sequence-based models for peptide properties prediction implementing a Recurrent Neural Network to compute hemolysis, solubility & nonfouling features of peptides from their sequences in a serverless architecture intending for better accessibility and to endorse open science. The sequence-based solubility predictor, MahLooL, outperforms existing state-of-the-art predictors in solubility prediction for short peptide sequences and produces competitive outcomes for hemolysis and nonfouling predictions.
Serverless Deep Learning Models for Peptide Property Prediction
To address the challenges of server maintenance and hosting costs, the researchers introduce a serverless approach to deep learning web-based models for peptide property prediction. Leveraging serverless computing will reduce dependence on cloud providers and promote accessible and cost-effective reproducibility in ML science.
The model designed by Ansari and White predicts solubility, hemolytic, and nonfouling (resistance to non-specific interactions) properties of a peptide from its amino acid stretches through a serverless deep learning web-based model that implements Recurrent Neural Network (RNN) and client-side JavaScript frameworks bypassing the dependency on a cloud provider; for equipping users to operate the model in their local machines including cellphones & laptop amidst limited computing resources and without any prerequisite installation.
The developers of MahLool endeavored to address the following facets through their approach:
- Cheminformatics-based research is often constrained due to a lack of resources and fund-enriched institutions in less privileged nations. Whereas web-based services unlock invention opportunities, owing to their easy accessibility regardless of community, region, and funds supply.
- Machine learning design is not often supplemented with unrestricted, public access to the source code, training and testing data, and published findings, thus confining the renovation of open science. Moreover, Deep learning inferences rely on GPUs and specialized third-party or self-hosted server setups, creating additional expenses and thus limiting access to low-resource research facilities.
In this scenario, the makers consider that serverless models will omit the reliance on third-party or self-hosted servers and, thus, will assist in resolving the concerns and bridge the gaps in research opportunities among lesser privileged communities by offering better accessibility, flexibility and reduce cost in cheminformatics analysis.
Training Datasets
The hemolytic predictive model was trained using data from 9316 sequences (19.6% positives and 80.4% negatives) of L- and canonical amino acids from the Antimicrobial Activity and Structure of Peptides (DBAASP v351) database. The activity is defined by extrapolating from dose-response curves to the point at which 50% of RBCs are lysed. If activity is >100 μg / ml, the peptide is hemolytic. Hemolysis is the disruption of erythrocyte (RBC) membranes rendering detrimental effects on the life span of Red Blood Cells. The model predicts the ability of peptides of query sequences to cause RBC lysis. Identifying non-hemolytic antimicrobials is crucial for safe applications against bacterial infections.
The Solubility predictive model, MahLool, contains training data of 18,453 sequences (47.6% positives and 52.4% negatives) from the PROSO II database. Data for predicting resistance to nonspecific interactions (nonfouling) was 3600 sequences, and Negative examples are based on 13,585 sequences from insoluble and hemolytic peptides and scrambled positives. Comprehending Non-Specific interactions is essential in cheminformatics studies and efficacious drug design.
Model Architecture
Deep learning models are instrumental in depicting big data in cheminformatics approaches for their better performance in feature extraction from raw high-dimensional data than machine learning algorithms. MahLool is built with recurrent neural network (RNN), using a sequential model from the Keras framework and the TensorFlow deep learning library back-end; to identify the position-invariant patterns in the peptide sequence. Long Short-term Memory (LSTM) of RNN extracts sequence correlations and dependency information between distant N- and C- terminal amino acid residues within the peptide sequences.
The peptide sequences are represented as integer-encoded vectors. Each integer corresponds to the index of the amino acid. The integer-encoded peptide sequence is applied to an embedding layer. The embedding layer converts the indices of amino acids into a representation of a fixed-length vector. The output from the embedding layer directs to a double-stacked bi-LSTM layer or a single LSTM layer for pattern recognition along a sequence that can be separated by large gaps. Bidirectional LTSM computes solubility and hemolysis, and single LTSM predicts nonfouling. The concatenated output is normalized and fed to a dropout layer with a rate of 10%, followed by a dense neural network with a rectified linear unit (ReLU) activation function. This is repeated three times, and the final single-node dense layer uses a sigmoid activation function to force the final prediction as a value between 0 and 1. Ablation analysis was performed to evaluate the contribution of different architectural components to the model’s performance, displaying Bi-LSTM to be the most contributing component. The trained models are implemented in JavaScript and loaded to a web browser.
Conclusion
MahLool offers 70% accuracy in solubility predictive tasks. For short-length (18−50) peptides, it outperforms DSResSol, which implements Squeeze-and-Excitation (SE) residual networks with dilated convolutional neural networks (CNN) to predict the solubility of peptides based on its sequence. Apart from the Solubility predictive model, the same architecture achieved competitive performances in nonfouling and hemolysis prediction compared to state-of-the-art methods. The makers have planned to upgrade the model each year. Moreover, the serverless approach holds promises to mitigate the deprives encountered by less privileged universities and extend cheminformatics research across a larger community, promoting open science.
Article Source: Reference Paper | Data Availability: GitHub | Models Availability: JavaScript
Learn More:
Aditi is a consulting scientific writing intern at CBIRT, specializing in explaining interdisciplinary and intricate topics. As a student pursuing an Integrated PG in Biotechnology, she is driven by a deep passion for experiencing multidisciplinary research fields. Aditi is particularly fond of the dynamism, potential, and integrative facets of her major. Through her articles, she aspires to decipher and articulate current studies and innovations in the Bioinformatics domain, aiming to captivate the minds and hearts of readers with her insightful perspectives.