Researchers from the Jackson Laboratory for Mammalian Genetics, USA, have built RNA Strain-Match, a quality control tool helpful in matching RNA data to their corresponding genotypes. A significant possibility exists that some samples will be switched or mislabeled during large-scale next-generation sequencing (NGS). Therefore, it is beneficial to have tools or techniques present within the analysis pipeline to ensure samples match their genetic information. RNA Strain-Match works with sequence alignment files (files containing information about the alignment of RNA sequences with the reference genome) without requiring another type of file called RNA variant call format (file containing variations found in RNA sequences compared to the reference genome). RNA Strain-Match was tested on samples from two distinct mouse models, and it was successful in identifying as well as correcting mismatches in 50 out of 379 (13%) samples. RNA Strain-Match will be valuable for research groups working with similar data.
Boosting Accuracy in Sample Labeling with RNA Strain-Match
The integration of quality control (QC) steps is indispensable in the case of NGS (a method used to determine the precise order of nucleotides in DNA or RNA). QC ensures that all samples are correctly labeled with their corresponding metadata (contextual information of the data being studied). QC steps involve checking specific properties of samples to ascertain their identity. For example, specific markers that vary in the male and female chromosomes can help confirm the sex of the sample, or checking genetic features that vary across the samples, like transgenic insertions, can also help in the confirmation of the identity.
The most reliable technique, however, is the comparison of the genomic sequence of the samples to a reference or to genetically similar samples. This is where the applicability of RNA Strain-Match, developed by Jon A. L. Willcox and his team, gains prominence. It helps in matching RNA sequencing data to its corresponding metadata by using genetic markers called single nucleotide polymorphisms (SNPs). SNPs only have one alternative allele or form, and they are located in the coding regions of autosomal chromosomes. RNA Strain-Match utilizes alignment data in the form of BAM (Binary Alignment/Map) files as input.
BAM files are quite convenient to work with as they are often generated during the processing of RNA sequencing data of the nucleus, single cells, or a large group of cells. Popular aligners like STAR and Cell Ranger develop BAM files before the counting of transcripts is done. The other convenience that BAM files provide is that they can be fed as input without any additional formatting steps.
The Origins of RNA Strain-Match
RNA Strain-Match was originally created to check if single-nucleus RNA sequencing (NucSeq) data from the AD-BXD panel of mice were correctly assigned to their respective samples. AD-BXD panel contained F1 offspring obtained from crossing transgenic mice and various strains from a genetically diverse group of mice, the BXD (B6xD2) genetic reference panel of inbred mice. The challenge that accompanies genotyping these mice is that each genetic variant is shared by around 50% of the strains in the panel, i.e., the individuals are not genetically unique due to inbreeding. Hence, it is paramount to genotype the SNPs distributed widely across the genome to ensure sufficient recombinant sites are investigated for credible strain identification.
These locations ought to exhibit either a homozygous state for the B allele (B/B) or a heterozygous state containing alleles from both the B6 and D2 mice (B/D). But, NucSeq gets frequently contaminated by minute amounts of cell-free RNA reads (short segments of genetic information) from other unrelated samples. This can result in the presence of a few reads with the D allele, even in mice that are supposed to be homozygous for the B allele. To resolve this issue, RNA Strain-Match has been provided with the option to set thresholds for read depth (the number of reads covering a specific location) and ALT-allele fraction (the proportion of reads containing the alternative allele). The thresholds ensure that the matching scores do not get affected by low-level contamination.
While RNA Strain-Match has been tested with mouse data in this study, it is applicable to any RNA data as long as the relevant genotyping data are available.
Revealing the Magic: How RNA Strain-Match Works
In this study, RNA Strain-Match was used to uncover valuable insights into mouse brain samples, and the procedure is as follows:
Unveiling the RNA Strain-Match Pipeline:
The researchers utilized RNA Strain-Match, available on GitHub, to unravel the genetic information of RNA strains. They used a table containing the genetic information of each strain as the input.
Understanding Genotype Data:
The genotype data represents the specific combination of genetic variations in each strain. Numbers express the genotype and degree of zygosity; for example, 0/0 indicates homozygous reference, 1/1 homozygous alternative, and 0/1 indicates heterozygosity.
Aligning RNA Sequencing Data:
To align the NucSeq data obtained from the samples, the researchers employed the CellRanger count pipeline. This process involved matching the genetic sequences to the GRCm38 or GRCm39 genome builds, providing a reference for comparison.
Matching and Scoring Genetic Variations:
Matching scores were calculated to assess the similarity between samples and known strains. By analyzing autosomal coding SNPs, researchers could determine the percentage of informative SNPs that matched the strains, providing valuable insights into the genetic composition of the samples.
Unveiling Transgene Status:
In the case of the 5XFAD mice, a model for Alzheimer’s disease, the presence of specific transgenes was of utmost importance. Using read counts and normalized values, researchers confirmed the presence of 5XFAD-APP and 5XFAD-PSEN1 transgenes, essential for amyloid plaque formation.
Sample Sex Determination:
Another fascinating aspect was the determination of sample sex. By analyzing the Xist and Ddx3y genes, researchers could distinguish between males and females. A higher Xist/Ddx3y ratio indicated a female sample, while a lower ratio indicated a male sample.
Unmasking Genetic Discrepancies: RNA Strain-Match Uncovers Surprising Findings
This study examined the strain-matching accuracy of 379 samples, consisting of 294 B6xBXD and 85 CC strains. The analysis revealed that 88% of the samples returned the best matching score for their assigned strain, with an average score of 98% ± 3%. However, a number of inconsistencies were revealed upon closer inspection. Among the remaining samples were 8 pairwise sample swaps (16 mismatched samples), 1 false strain assignment, 15 mismatches due to an offset in sample labels, and 15 samples affected by a strip tube label swap. Interestingly, the match rate for B6xBXD124 samples (mean = 85% ± 1%) was significantly lower compared to other strains (mean = 99% ± 1%). Despite this, B6xBXD124 samples still showed a closer match to the BXD124 strain than any other strain. The study further investigated the genotypes and identified a notable disagreement between B6xBXD strains and the genotypes of the reference samples from which the NucSeq data was derived. The lowest aligned match, averaging 2.72% ± 1.52% for B6 matching with D2 and 1.97% ± 1.28% for D2 matching with B6, was found between the B6 and D2 strains. These findings underscore the significant variation and potential genetic differences between these strains.
During the analysis of 5XFAD-APP and -PSEN1 normalized read counts, a discrepancy in transgene status (transgenic/non-transgenic) was identified in 27 samples. Out of these, 25 mismatches were attributed to the sample swaps and offset issues that were resolved after correction based on the strain analysis. However, two additional samples labeled as transgenic exhibited low reads associated with 5XFAD; interestingly, these samples were obtained from the hippocampus (HC) and cortex (Ctx) regions of the same mouse. To validate the genotype, qPCR-based analysis was performed using genomic DNA extracted from tail tissue, confirming that these samples were indeed non-transgenic. These findings highlight the importance of carefully verifying transgene status and conducting appropriate genotype validation to ensure an accurate interpretation of experimental results.
The study also investigated the accuracy of sex labeling by utilizing the Xist/Ddx3y ratio. In this analysis, 20 samples were identified with mislabeled sex, out of which 19 samples were also found to have mislabeled strains. Interestingly, one sample that had an offset issue was only identified through a sex mismatch, as it shared both the correct strain and transgene status with another sample. Moreover, an additional sample (mouse ID: 1575) was identified as a female B6xD2Gpnmb mouse with a single X chromosome (XO). The mean D-allele fractions on the X-chromosome were calculated and compared to other B6xD2Gpnmb mice from the same study to further confirm this finding. It was observed that the suspected XO mouse had a mean D-allele fraction on the X-chromosome, which was similar to the values obtained from the three male B6xD2Gpnmb mice.
In contrast, the XX-female B6xD2Gpnmb mouse had a significantly higher mean D-allele fraction (0.29 ± 0.2). This approach not only helps identify samples with mislabeled sex but also enables the detection of spontaneous sex-chromosome aneuploidy. These findings emphasize the importance of accurately determining the sex of experimental samples and highlight a useful method for validating sex-chromosome status.
Unveiling the power of RNA Strain-Match, every sample tested demonstrated a strikingly superior match between its RNA data and the correct strain, surpassing all other tested strains. Remarkably, the application of RNA Strain-Match allowed researchers to identify and rectify the discrepancies associated with mislabeled transgenes and sexes in the samples. This crucial identification process prevented the potential loss of statistical power in the best-case scenario or the presentation of erroneous results in the worst-case scenario for the corresponding studies. Notably, the study in which 41 out of 159 (26%) samples were incorrectly assigned was particularly susceptible. Fortunately, in this instance, the QC steps were performed prior to any analyses, avoiding the dissemination of flawed conclusions. In all other studies, analyses were reconducted using the corrected sample identification before publication, ensuring the accuracy and reliability of the results. The findings underscore the indispensable role of RNA Strain-Match in preserving the integrity of genetic studies and reinforcing the importance of rigorous quality control measures. RNA Strain-Match also has the potential to be expanded in the case of protein data as well, hastening the progress in biological research.
Neegar is a consulting scientific content writing intern at CBIRT. She's a final-year student pursuing a B.Tech in Biotechnology at Odisha University of Technology and Research. Neegar's enthusiasm is sparked by the dynamic and interdisciplinary aspects of bioinformatics. She possesses a remarkable ability to elucidate intricate concepts using accessible language. Consequently, she aspires to amalgamate her proficiency in bioinformatics with her passion for writing, aiming to convey pioneering breakthroughs and innovations in the field of bioinformatics in a comprehensible manner to a wide audience.