Single-cell histone post-translational modification (scHPTM) assays, such as scCUT&Tag or scChIP-seq, have emerged as powerful tools for mapping epigenomic landscapes within complex tissues. These assays enable the exploration of various histone modifications at the single-cell level, holding the key to unlocking crucial mechanisms in development and disease. However, challenges still need to be overcome in conducting scHPTM experiments and analyzing the resulting data due to the need for consensus guidelines. To address this, a computational benchmark study involving over ten thousand experiments was conducted to identify key parameters and computational choices for obtaining high-quality representations of scHPTM data. This article reviews the research and highlights its findings, aiming to improve the understanding and application of scHPTM assays.
Exploring Histone PTMs at the Single-Cell Level
Histone post-translational modifications (PTM) are important epigenetic processes that control transcription, nucleosome placement, and chromatin structure. PTMs of histones are pivotal epigenetic events that contribute to DNA repair, development, and diseases. Investigating epigenetic events in detail is imperative, as they are involved in disease occurrence, environmental adaptation, and cell differentiation.
Recent advancements in single-cell technology have made it possible to analyze histone PTMs at the single-cell level through scHPTM assays, such as single-cell chromatin immunoprecipitation followed by sequencing (scChIP-seq) and single-cell cleavage under targets and tagmentation (scCUT & Tag). scHPTM assays have already led to exciting discoveries, including the identification of epigenetic factors influencing cancer response to chemotherapy. The future holds immense potential for scHPTM in uncovering and understanding diverse epigenetic mechanisms.
The Barriers to Splendid scHPTM Success
While scHPTM has tremendous potential in uncovering the wonders of life, it is still in its nascent stage, which presents obstacles in deriving biologically important insights from raw experimental scHPTM data. Among the several other challenges that need to be addressed, two are critical. The first challenge is designing experiments effectively while balancing the number of cells and data coverage. The second challenge pertains to efficiently analyzing raw experimental data to transform them into biologically meaningful representations for further analysis, such as cell classification or lineage inference.
Despite extensive studies addressing these challenges in other single-cell procedures like scRNA-seq and scATAC-seq (single-cell sequencing Assay for Transposase Accessible Chromatin), the field of scHPTM lacks sufficient research. The absence of ample research has left scientists without rational guidelines for efficient experiment design and data analysis.
However, there is an opportunity to leverage similarities between raw experimental data in scHPTM and scATAC-seq, especially in terms of sequence reads capturing epigenomic signals distributed over the genome. Incorporating methods used in scATAC-seq into scHPTM presents potential, but there are two major differences to consider.
One difference is the distribution of sequencing reads. In scATAC-seq, the sequence reads cover small regions of about 1kbp (Kilobase pairs), while scHPTM exhibits widely varying sizes of regulatory regions, ranging from 5kbp to a whopping 2000kbp. The second major difference involves lower data coverage in scHPTM compared to higher data coverage in scATAC-seq and scRNA-seq. In scHPTM, only a few hundred to a few thousand sequence reads are obtained per cell, while several thousand or even tens of thousands of reads per cell are obtained in scATAC-seq and scRNA-seq. The lower coverage in scHPTM indicates that only a few reads can capture histone modifications in each cell, resulting in only a small percentage of expected regions containing histone modifications having at least one read indicating their presence.
These problems make it more challenging to analyze the regions of histone modifications confidently.
The Quest to Eliminate the Setbacks of scHPTM
To overcome these challenges, scientists from Google Research and PSL Research University, France, conducted a computational benchmark study. The study performed over ten thousand experiments to evaluate the influence of various parameters and computational choices on the optimal design of experiments and analysis of scHPTM data, aiming to obtain high-quality representations of the data. The study focused on factors such as the number of cells, coverage per cell, cell selection, matrix construction algorithm, feature selection, and dimension reduction algorithm. Of particular importance was understanding how these factors impact the quality of dimension reduction, which is crucial for downstream tasks like clustering, cell type identification, and differential enrichment analysis.
Instead of relying solely on known cell types to assess the quality of data representation, the researchers took a different approach by incorporating the use of labels derived from co-assays, such as RNA or protein measurements. This approach was adopted due to the limited availability of scHPTM datasets with high-quality labels, allowing the researchers to work with continuous cell states independently of labels. However, it was assumed that cells with similar epigenomic profiles measured by scHPTM would exhibit similar RNA or protein expression patterns. While this assumption holds true for enhancer markers, exceptions arise in the case of repression markers.
It is important to note that this approach successfully evaluated scATAC-seq pipelines. In addition to incorporating a neighbor score based on co-assays, the researchers utilized Adjusted Mutual Information (AMI) or Adjusted Rand Index (ARI) based on labels provided by the authors to quantify their findings. One potential limitation of this implementation is the introduction of bias toward the authors’ initial method.
Key Findings from the Computational Benchmark Study
The benchmark study focused on computational frameworks capable of generating moderate-dimensional vector representations (10-50 dimensions) for each cell. The study evaluated nine popular methods for examining count matrices generated in scHPTM analysis. These methods included cisTopic, Signac, SnapATAC, Peak VI, SCALE, ChromSCape with TF-IDFChromSCape_LSI), CPM normalization (ChromSCape_PCA), and NMF with no normalization or TF-IDF transformation. The findings revealed that ChromSCape_LSI, TFIDF-NMF, and Signac consistently performed better in both mouse brain and human PBMC datasets than the other methods. A common aspect among these three methods was the utilization of TF-IDF transformation on the count data matrix, indicating the efficacy of TF-IDF-based methods.
As far as optimization of parameters is concerned, there were many findings, and some were rather surprising. The observations made are as follows:
- The choice of matrix construction algorithm had a much larger impact on performance than initially anticipated.
- Using the best bin size for matrix construction results in better performance by up to 80%, as compared to using the worst bin size.
- Even though enhancer markers accumulate into small peaks, surprisingly, they benefitted from larger bin sizes.
- However, bin sizes of several 100kbp fail to capture finer details and local enrichments, and small bin sizes may be insufficient to produce reliable results.
- To overcome the dilemma, the scientists propose conducting differential enrichment analysis using larger bins. This would capture finer details even with large bins.
- The performance of the various methods (except Peak VI) usually reached a plateau when the number of cells was increased. This might be explained by the models’ low complexity, but unexpectedly, models with higher complexity did not perform better than lower complexity models.
- An important point to note here is that higher coverage of cells improved the performance of all the methods, and this fact can be utilized in future experiments to obtain more information.
- The most astonishing of all findings is that feature selection using variance or coverage criteria had a rather negative impact on performance, possibly due to the low coverage per cell.
Epigenetics is a fascinating field, and utilizing epigenetic principles to tackle biological challenges, especially those pertaining to human health, is profoundly fascinating. Technologies like scHPTM have the immense potential to provide novel insights into development and diseases, offering solutions to the most challenging problems faced by humanity. However, rational guidelines are crucial to enhancing the performance of scHPTM and achieving superior results, which currently need to be improved. The benchmark study reviewed in this article represents a pioneering step in the development of scHPTM, providing direction to harness the marvels of epigenetics and make significant advancements. By understanding the key parameters and computational choices identified in the study, researchers can optimize scHPTM experiments and analysis, unlocking the full potential of this powerful technique.
Article Source: Reference Paper
Neegar is a consulting scientific content writing intern at CBIRT. She's a final-year student pursuing a B.Tech in Biotechnology at Odisha University of Technology and Research. Neegar's enthusiasm is sparked by the dynamic and interdisciplinary aspects of bioinformatics. She possesses a remarkable ability to elucidate intricate concepts using accessible language. Consequently, she aspires to amalgamate her proficiency in bioinformatics with her passion for writing, aiming to convey pioneering breakthroughs and innovations in the field of bioinformatics in a comprehensible manner to a wide audience.