A new deep learning method developed by a team of researchers at Chalmers can generate regulatory DNA sequences that control gene expression in a gene-specific manner. Despite significant sequence divergence from natural DNA, in vivo measurements reveal that 57% of the synthetic sequences with high expression levels outperform the expression levels of the correspondingly highly expressed natural controls. It will be possible to develop and produce vaccines, drugs against severe diseases, and alternative foods much more quickly and at a significantly lower cost with the help of this technology.
The researchers investigated if de novo functional regulatory DNA, encompassing the whole gene regulatory structure and providing desired gene expression levels, can be produced simply from the information of natural regulatory sequences.
The major goal is to directly correlate natural genomic data from Saccharomyces cerevisiae with the functional DNA regulatory sequence space, allowing for the controlled design of expression systems.
Gene expression is the method through which a gene in a cell is activated to produce RNA and proteins. The RNA, the protein produced from the RNA, or the function of the protein in a cell may all be used to assess the level of gene expression. In order to learn directly from genomic and transcriptomic data, a deep learning technique based on generative adversarial networks (GAN) has been prototyped. Expression GAN may generate regulatory DNA with predetermined target mRNA levels across the full gene regulatory structure, including coding and neighboring non-coding regions, and can traverse the complete regulatory sequence-expression landscape in a gene-specific way.
In order to create synthetic regulatory DNA with variable expression levels, it is common practice to stack many functional sequence motifs.
Numerous iterations of experimental screening or evaluating huge batches of sequences are typically necessary for the hunt for functional sequence variations. The investigated DNA is primarily restricted to just short portions of single regulatory regions and particular reporter genes due to the intrinsic difficulty of relating sequence to expression and the high resource requirements of mutagenesis-based techniques. In the end, this restricts gene expression regulation, failing to achieve the main design goal.
An alternative approach is to directly construct valid sequences by learning functional and biologically viable sequence spaces, which is made possible by deep neural networks. Ultimately, the complete gene regulatory framework needs to be adjusted in order to precisely control gene expression.
Therefore, it can be inferred that modern generative deep neural networks are capable of learning the whole DNA regulatory landscape directly from natural genomic sequences and transcriptome data based on recent advances in modeling DNA and protein spaces. In order to overcome the drawbacks of current approaches and enable exact and gene-specific navigation of the regulatory sequence space in conceivably any organism and tissue, de novo regulatory DNA with highly accurate target expression levels can be synthesized by utilizing information from the entire gene regulatory structure, including the coding region.
Deep neural frameworks for de novo synthesis of regulatory DNA structures
Natural-like regulatory DNA designing model – Testing multiple deep neural networks with sequence data from various region combinations as input and median TPM values as the target showed that only the entire gene regulatory structure spans the key regulatory features necessary for predicting the full dynamic range of mRNA expression levels.
The following DNA sequence characteristics were examined:
(i) Compositional validity of the DNA sequences;
(ii) Sequence similarity metrics; and
(iii) Known cis-regulatory grammar.
For instance, as training progressed, the average number of regulatory motifs, such as transcription factor binding sites (TFBS), and their combinations, grew, reaching a high at 200,000 training rounds.
The model accurately reproduced the known DNA regulatory grammar as well as features relevant to fundamental sequence composition, such as GC concentration and UTR sizes, in each sequence.
This includes –
(i) Core promoter elements, which include the TATA box in promoters, and canonical motifs of transcription factor binding sites (TFBS) from the Jaspar database, detected (q-value 0.05) using FIMO.
(ii) Kozak sequences in 5′ UTRs
(iii) Termination-related motifs, such as placement, efficiency, and poly-AT motifs in 3′ UTRs and terminators
(iv) Previously discovered expression-related motifs and motif associations.
(v) Sites anticipated to be nucleosome-depleted.
Gene-specific navigation of sequence evolution
An optimization approach is developed to orient sequence evolution in order to make use of the generative model to create regulatory DNA with desired expression levels. In a nutshell, the regulatory DNA sequence-generator and gene expression-predictor models are coupled together inside the completely verified activation-maximization framework to create the ExpressionGAN joint deep neural network architecture. The idea behind this strategy is to make use of the generator’s capacity to create regulatory sequences with accurate DNA characteristics by directing it and fine-tuning created sequences toward desired gene expression levels using the predictor.
Augmented control of expression using whole gene regulatory structure-
The promoter (400 bp), UTRs (100 and 250 bp, respectively), and terminator (250 bp), as well as two shorter sections of the promoter, are employed in addition to the whole single regions in the corresponding ranges mentioned above. These contained a core promoter region that was 170 bp upstream of the transcription start site (TSS) and an 80 bp proximal promoter region that was positioned between 170 and 90 bp upstream of the TSS.
Generated regulatory DNA containing sequence elements that regulate gene expression.
High and low expression levels were contrasted by taking samples of ExpressionGAN-generated sequences after 50,000 optimizer iterations and dividing them into high and low expression bins of 10,000 samples each. This also served to confirm that the sequences were genuine and distinct from any natural sequence. The promoters, terminators, and UTRs of the high-expression sequences had a GC content of 12% lower and 7% higher than that of the low-expression sequences. This difference in GC content was significant (Wilcoxon rank-sum test p-value 1e-16). With the combined results indicating that ExpressionGAN can synthesize both low and high-expression-related qualities throughout the entire gene regulatory structure in compliance with known information, it recapitulates the core promoter’s primary functional characteristics.
Generated regulatory DNA for In vivo gene expression control
With the total numbers of cis-regulatory motifs being shown to continuously rise in proportion to the projected gene expression levels, the resultant produced sequence selections showed features comparable to those of natural sequences, implying that they are possibly functional. Lastly, a key benefit of ExpressionGAN is that it limits the exploration of regulatory sequence space using naturally occurring regulatory principles built into the generator, giving the predictor choices for practical and physiologically plausible sequences.
To create a system that can produce accurate DNA regulatory sequences that match the entire gene regulatory structure, a generative model utilizing a generative adversarial network (GAN) technique is necessary. It is fundamentally essential to discuss whether de novo functional regulatory DNA can be produced just from knowledge of natural regulatory sequences, encompassing the whole gene regulatory structure and achieving desired gene expression levels.
A mere 100 bp promoter sequence can be built in over 1060 different ways, accounting for more DNA variation than all of the planet’s living things combined. Due to the tremendous diversity of eukaryotic organisms and the complexity of eukaryotic gene regulation, it is difficult and sometimes impossible to experimentally explore even a small portion of such a massive sequence space.
Cutting-edge deep learning models are utilized to directly learn and map the functional DNA regulatory sequence space to gene expression levels in Saccharomyces cerevisiae, allowing for the controlled construction of expression systems.
Article Source: Reference Paper
Freely available courses to learn each and every aspect of bioinformatics.
Stay updated with the latest discoveries in the field of bioinformatics.
Riya Vishwakarma is a consulting content writing intern at CBIRT. Currently, she's pursuing a Master's in Biotechnology from Govt. VYT PG Autonomous College, Chhattisgarh. With a steep inclination towards research, she is techno-savvy with a sound interest in content writing and digital handling. She has dedicated three years as a writer and gained experience in literary writing as well as counting many such years ahead.