When it comes to training and assessing genomic analytic tools, managing differential expression, and investigating data architecture, synthetic data production in omics replicates real-world biological data. Previously, researchers created Precious1GPT, a multimodal transformer trained on metadata, transcriptome, and methylation data to predict biological age and find dual-purpose therapeutic targets that may be linked to age-related disorders and the aging process. In this work, the authors present Precious2GPT, a multimodal architecture that combines decoder-only Multi-omics Pretrained Transformer (MoPT) models trained on gene expression and DNA methylation data with Conditional Diffusion (CDiffusion). When it comes to creating synthetic data, Precious2GPT outperforms CDiffusion, MoPT, and Conditional Generative Adversarial Networks (CGANs). The newest and most sophisticated model is called Precious3GPT. This particular multi-modal AI can combine and decipher omics-level data from many tissues and cell lines. It covers a range of experimental scenarios and brings together proteomes, RNA sequencing, and DNA methylation data of diverse species under one roof.
Introduction
In the fields of genomics, transcriptomics, and proteomics, in particular, generating synthetic data in omics entails building synthetic datasets that replicate actual biological data. The advancement of computational methods, privacy protection, and the augmentation of scarce real-world data are only a few of the benefits of this approach. Copula-based techniques and generative adversarial networks (GANs) are novel models for producing synthetic genomic data, including bulk RNA-seq data and DNA sequences. A more recent contribution to deep learning is diffusion models, which mimic the slow transformation of a simple noise distribution into the target data distribution by simulating a diffusion process. Generative Pre-trained Transformer 2 (GPT-2) and other large language models (LLMs) have made major contributions to sophisticated language creation, understanding, and prediction, as well as sequential data analysis. Their use of omics data is still very early on, though.
Understanding Precious2GPT
The work offers a novel approach for producing multi-omics data by combining a diffusion model and a language model. Deep learning models such as Multi-omics, Conditional Diffusion, and Conditional GANs In addition to Precious2GPT (P2GPT), which combines CDiffusion and MoPT models, pre-trained techniques were applied. P2GPT addresses issues in tissue classification, age prediction, and the identification of critical signaling pathways by concentrating on transcriptome and DNA methylation data. These strategies work in concert to give a tactical advantage when tackling problems with tissue classification, age prediction, and signaling pathway identification. The ability of P2GPT to precisely generate simulated biological data for bioinformatic studies was proved through a case study involving colorectal cancer.
More about Precious3GPT
A genuine multimodal transformer-based model, Precious3GPT was trained to mimic case-control study workflows with a focus on chemical perturbations. To execute several tasks in drug discovery, aging research, and synthetic data production, researchers tokenize multiple data types and train a single global model. Precious3GPT may be included in unique AI-based processes to support intricate research pipelines because it has an API that is compatible with ChatGPT.
Efficacy and Applications
The transformer-based model showed great training capability, producing new data according to certain parameters such as tissue kind or age. The model encountered difficulties when it came to producing tabular data from omics data on DNA methylation and continuous gene expression. The GPT-2 architecture with a modified language was employed to address this. A system of encoding was developed, wherein individual tokens were used to represent each gene and its related omics value. Treating gene-omics data as pseudo-text made the transformer-based model usable. The MoPT model, which presented the first language model adaptation tailored specifically to the biomedical field to produce tabular data, demonstrated the potential of transformer topologies in bioinformatics applications.
Dysregulated immune function, changed cell lineage, and signaling pathways are associated with chronic inflammation, a typical aging symptom. The rising disease load in the senior population results from these processes, which are set off by the P2GPT model’s out-of-scope (OOS) trials. The model’s conclusions emphasize the value of sophisticated computational models in comprehending the molecular foundations of aging and locating possible treatment strategies. The creation of synthetic data by the model makes it possible to identify biologically significant pathways and processes, demonstrating the potential of sophisticated computer models to mitigate the negative consequences of aging.
Conclusion
The research offers a hybrid method called Precious2GPT (P2GPT) that combines the advantages of MoPT and CDiffusion models to produce high-quality multi-omics DNA methylation and expression data. By reducing the limits of individual models, this novel approach improves the generating process. It may be used in data analysis, algorithm creation, and multi-omics research privacy protection. The quality of the data produced by using the separate models of CGAN, CDiffusion, and MoPT and evaluating their specificity to different species and tissue types show how effective the hybrid strategy is. By resolving the aforementioned issues, future research will enhance the model’s generalisability, precision, and thoroughness, making it an invaluable resource for translational medical studies and biological discoveries.
Article Source: Reference Paper | Reference Article
Follow Us!
Learn More:
Deotima is a consulting scientific content writing intern at CBIRT. Currently she's pursuing Master's in Bioinformatics at Maulana Abul Kalam Azad University of Technology. As an emerging scientific writer, she is eager to apply her expertise in making intricate scientific concepts comprehensible to individuals from diverse backgrounds. Deotima harbors a particular passion for Structural Bioinformatics and Molecular Dynamics.