CRISPR interference (CRISPRi) is one of the most widely used CRISPR technologies in bacteria. In this, a catalytically dead Cas protein incapable of DNA cleavage (dCas) is targeted to interfere with the transcription of a gene of choice. Due to complex interactions between guide sequences predicting it accurately remains a challenge. To tackle this problem, researchers introduced a novel mixed-effect machine learning model called MERF, which uses both guide-specific and gene-specific information to predict CRISPR interference. MERF is trained on large datasets from multiple CRISPRi genome-wide essentiality screens in E.coli K12 MG1655. It outperformed all other models achieving higher efficiency.

Introduction

CRISPR-Cas’s main applications in bacteria have come from using dCas as a platform technology that can deliver effectors to a specific locus in a programmable fashion. CRISPRi is the simplest example, where the dCas protein itself serves as an effector to silence gene expression by physically blocking the binding or procession of the RNA polymerase. CRISPR interference (CRISPRi) has opened up a range of biological applications, from silencing individual genes for genetic studies to performing genome-wide fitness screens or engineering genetic circuits. CRISPRi can directly target particular genes of interest, avoiding the need for large mutant libraries to achieve gene saturation. 

Reliable prediction of guide efficiency will therefore become increasingly important as applications of CRISPRi. The existing tools and methods are limited in data and features and lack crucial features for scaling up and optimizing CRISPRi applications in bacteria. Researchers’ mixed-effect machine learning approach provides a general strategy for learning CRISPRi guide efficiency when only indirect measurements are available by integrating data from three gene essentiality screens in E. coli.

Methods

  • Training datasets: Data on gRNA sequences, targeted genes, gene positions, and fitness effects were provided by three published CRISPRi genome-wide essentiality screens in E.coli K12 MG1655.
  • Feature engineering: For each gRNA and gene 138 sequences, thermodynamic, genomic, and transcriptomic features were extracted.
  • Cross-validation:  For both depletion and gene efficiency prediction, tenfold cross-validation model performance was evaluated.
  • Predictive models: i) The Mixed-effect Random Forest Model (MERF) is utilized to achieve higher accuracy while taking into account both guide and gene effects. ii) Other models like simple linear regression, elastic net, SVR, and histogram-based gradient boosting models were also trained for comparison. iii) Deep learning models like Custom 1D CNN and CGx_CRISPRi models were implemented and trained.
  • Model interpretation: Effects in MERF and tree-based models analyzed feature importance and interaction by TreeExplainer from the SHAP package.
  • Validation: Flow cytometry and compared predicted values were used to measure GFP silencing efficiency in E. coli and S. Typhimurium. Model performance was validated for a saturating CRISPRi screen targeting purine biosynthesis genes in E. coli under competitive growth conditions.

Key Findings

  • Gene-Specific Effects Dominate Depletion: The models considering only guide features had limited predictive power (ρ ~ 0.25). Incorporating gene features like gene expression (has the strongest effect on depletion) and GC content significantly improves prediction accuracy (ρ ~ 0.66). 
  • Data Fusion Enhances Prediction: Multiple datasets were combined to improve the accuracy and generalizability. The most benefited from data fusion was Tree-based models like random forests and gradient-boosted trees. The increase of about ~0.62-0.68 for a single dataset model was observed in Spearman correlations
  • MERF Model Isolates Guide Efficiency: MERF separates guide efficiency from gene-specific effects and outperforms the Pasture model. Its accuracy in predicting guide efficiency was confirmed by gene-wise cross-validation.
  • SHAP Analysis Reveals Design Rules: The first 60 bases are the most important; the distance to start the codon affects the efficiency. The stronger distance effect is of guides that target the first gene in an operon. It was noted that nucleotides around PAM sequences increase the efficiency in various ways. For example, Cytosine downstream of the PAM increases efficiency, and guanine at the same position decreases efficiency.
  • Independent Validation with Purines Screen: For improving performance in a saturating screen of purine biosynthesis genes, Data fusion and MERF model was used. MERF achieved higher Spearman correlations and has a higher positive predictive value.

Future applications

  • Multiplexed CRISPRi:  Fitness interaction screening and metabolic engineering are some of the potential applications of the MERF model.
  • Large-scale applications: Better guide design tools are required to ensure robust results this is where MERF comes to the rescue by offering a solution for efficient guide screening.
  • Generalizability: Techniques applicable to CRISPR with different Cas proteins and bacteria and potentially other CRISPR technologies.
  • Characterizing new dCas proteins: Identifying design rules for efficient silencing can be done by AutoML and AI approaches.
  • New insights into CRISPRi behavior: It revealed unexpected features like distinct patterns of depletion across genes that can be decreasing, mixed, and constant.

Conclusion

The ability to accurately predict CRISPRi guide efficiency is very important. CRISPR interference (CRISPRi) is the leading technique to silence gene expression in bacteria; however, the traditional design rules remain poorly defined and have many loopholes. Researchers developed a best-in-class prediction algorithm for guide silencing efficiency by systematically investigating factors influencing guide depletion in genome-wide essentiality screens, with the surprising discovery that gene-specific features substantially impact prediction. They developed a mixed-effect random forest regression model that provides better estimates of guide efficiency, winning the race from the old traditional methods and models.

Researchers further applied methods from explainable AI to extract interpretable design rules from the model. Beyond developing a predictor for CRISPRi guide efficiency, the process of model development and validation provided several new insights into the behavior of CRISPRi screens. It can be used to modulate biosynthetic pathways to optimize the production of a particular metabolite for industrial applications. It can be used in engineering synthetic regulatory circuits or metabolic networks, where collections of gRNAs are used to coordinately downregulate and upregulate associated genes and pathways.

All the applications critically depend on the efficiency of silencing provided by selected guides. Genetic screens already routinely employ tens of thousands of guides simultaneously, and it is impractical to individually test each guide’s efficiency. This problem will only be accentuated as the scale of applications increases through the use of CRISPR array technologies that allow multiplexed expression of suites of guides simultaneously to dissect and engineer increasingly complex phenotypes. To address this, researchers trained deep learning methods on tens of thousands of guide efficiency measures, as well as fused datasets to further improve the performance. This resulted in the creation of a mixed-effect random forest that distinguishes between features impacting guide efficiency and effects caused by the targeted gene while learning from numerous independent CRISPR screenings. It provides a general strategy for learning CRISPRi guide efficiency.

While researchers focused on applications of CRISPRi with dCas9 in E. coli, the techniques they have developed are, in principle, generic and could be extended to CRISPRi with any catalytically-dead nuclease in any bacterium of interest or even to entirely different CRISPR technologies. It can be broadly applicable to predict the efficiency of CRISPR guides in systems where only indirect measurements of guide activity are available.

Article Source: Reference Paper | Reference Article | Implementation of the final MERF model is available at Webserver

Learn More:

Website | + posts

Anchal is a consulting scientific writing intern at CBIRT with a passion for bioinformatics and its miracles. She is pursuing an MTech in Bioinformatics from Delhi Technological University, Delhi. Through engaging prose, she invites readers to explore the captivating world of bioinformatics, showcasing its groundbreaking contributions to understanding the mysteries of life. Besides science, she enjoys reading and painting.

LEAVE A REPLY

Please enter your comment!
Please enter your name here