Exploring bulk RNA-seq data has become a cornerstone for uncovering crucial insights into gene expression patterns and their relationship to phenotypes in biomedical research. One of the key tools in this domain is DESeq2, a powerful R package that has revolutionized differential expression analysis. However, a recent trend towards Python-based bioinformatics tools has highlighted the need for a native Python package to perform differential expression analysis (DEA) with generalized linear models on bulk RNA-seq data. In lieu of this, researchers from Owkin France introduced PyDESeq2, a novel Python implementation of the DESeq2 workflow. PyDESeq2 not only reproduces similar results to its R counterpart but also brings lots of additional benefits to the table.
Bridging the Gap: DESeq2 in Python
PyDESeq2’s re-implementation yields results that align closely with DESeq2. It achieves higher model likelihood, a testament to its robustness in modeling raw counts using a negative binomial distribution. Notably, PyDESeq2 demonstrates substantial speed improvements on large datasets, as demonstrated in experiments conducted on The Cancer Genome Atlas (TCGA) data.
Python’s popularity in the data science community is undeniable, and PyDESeq2 capitalizes on this by seamlessly interfacing with modern Python-based data science tools. Leveraging well-maintained and efficient scientific computing packages like NumPy and sciPy, PyDESeq2 provides a familiar environment for researchers, facilitating a smoother workflow.
While some workarounds exist, such as using python-to-R bindings like rpy2, they come with their own set of challenges. These bindings necessitate the installation and maintenance of packages in both Python and R, resulting in computational overhead and potentially compromising user control. PyDESeq2 eliminates these hurdles, offering a native Python solution for differential expression analysis.
Implementation and Features
PyDESeq2 follows the differential expression analysis methodology introduced by Love et al. (2014), modeling raw counts using a negative binomial distribution. Dispersion parameters are estimated for each gene through a negative binomial generalized linear model (GLM) and subsequently shrunk towards a global trend curve. These dispersions are then employed to compute gene-wise log-fold changes (LFC) and perform Wald tests for differential expression.
In version 0.3.5, PyDESeq2 mirrors default DESeq2 settings. It incorporates the variance-stabilizing transformation, allowing for differential expression analysis in single-factor and n-level multi-factor designs with categorical factors using Wald tests. PyDESeq2’s code structure revolves around two classes of objects: the DeseqDataSet class, responsible for handling data-modeling steps, and the DeseqStats class for statistical tests and optional log-fold changes (LFC) shrinkage. Generalized linear models (GLM) are fitted using the popular Scipy and statsmodels Python packages.
As an integral part of the scverse ecosystem, PyDESeq2 seamlessly integrates with the anndata data structure. This ensures that PyDESeq2 analyses can be effortlessly imported to and from any scverse package, providing a cohesive environment for omics research.
Comparative Analysis: PyDESeq2 vs. DESeq2 on TCGA Datasets
In a comparative analysis on eight bulk RNAseq datasets from TCGA, PyDESeq2 demonstrates its prowess. It returns remarkably similar sets of significant genes and pathways compared to DESeq2. Additionally, PyDESeq2 achieves higher model likelihood for dispersion and log-fold change parameters on the majority of genes, all while maintaining comparable speeds.
Conclusion and Future Perspectives
PyDESeq2 emerges as a formidable tool for bulk RNA-seq differential expression analysis, filling a crucial void in the Python omics ecosystem. By releasing this package, the developers aim to promote the adoption of modern data science Python tools in gene expression analysis. PyDESeq2 offers speed, reliability, and seamless integration with the Python data science landscape. Its release marks a significant milestone in advancing the accessibility and efficiency of differential expression analysis methodologies. Looking ahead, PyDESeq2’s future updates promise to include support for continuous covariates and likelihood-ratio tests, further enhancing its capabilities.
Article source: Reference Paper
Learn More:
Dr. Tamanna Anwar is a Scientist and Co-founder of the Centre of Bioinformatics Research and Technology (CBIRT). She is a passionate bioinformatics scientist and a visionary entrepreneur. Dr. Tamanna has worked as a Young Scientist at Jawaharlal Nehru University, New Delhi. She has also worked as a Postdoctoral Fellow at the University of Saskatchewan, Canada. She has several scientific research publications in high-impact research journals. Her latest endeavor is the development of a platform that acts as a one-stop solution for all bioinformatics related information as well as developing a bioinformatics news portal to report cutting-edge bioinformatics breakthroughs.