A team of researchers from the Broad Institute of MIT and Harvard, Brigham and Women’s Hospital, and Harvard Medical School has demonstrated a computational technique called Hi-CNV to identify millions of copy number variants harboring in the human genome and link them to health-related traits such as height, lipid levels, and bone density by imposing the haplotype sharing within the biobank cohort. The study enabled us to gain insights into the genomic architecture and complexity of the human genome as a whole.
Introduction
The human genome harbors millions of large structural variants – the copy number variants (CNV). While the phenotypic impact of such polymorphism is largely unknown since only large CNV data has been attained from the single nucleotide polymorphism-array data and genomic analysis. The rare CNVs have been shown to have an association with neuropsychiatric disorders. Several studies in the past have investigated the role of pathogenic CNVs while confronting challenges such as how to group the CNVs for association testing and how to filter associations that reflect linkage disequilibrium (LD) with nearby SNPs. Hence, considering the overall team of scientists at the Broad Institute of MIT and Harvard, Brigham and Women’s Hospital, and Harvard Medical School has developed a computational method that can detect 15 million copy number variants (CNVs) in the UK Biobank -approximately six times more than previous analyses conducted of the same data.
The developed computational approach to detect CNV detection is called HI-CNV (Haplotype-Informed Copy Number Variation). The tool increases the overall capacity of CNV detection within large cohorts by merging the information across individuals who share an extended SNP haplotype. The inherent approach is that in large biobank cohorts, population-polymorphic CNVs are usually carried by several individuals who have co-acquired the variation on a shared haplotype originating from the common ancestor. Hence, the capacity to detect a CNV can be increased by dividing information about its presence (e.g., from genotyping array intensity data) across multiple carriers.
In order to identify individuals that are likely to share a segment of the genome acquired from a common ancestor. Researchers have adapted to recent approaches such as identity-by-descent (IBD) segments by Zhou et al. Explicitly, for each haplotype for each individual within the cohort, the team utilized a PBWT-based algorithm to identify its closest’ haplotype neighbors,’ i.e., the longest IBD matches and hidden Markov model (HMM) to detect CNVs co-inherited on the shared haplotypes. In order to apply the HI-CNV approach to SNP-array genotyping probe intensity data that is available for the UK cohort. The researchers also developed methods for the probabilistic model to map allele-specific measurements and intuitively genotype probes within the copy number variation (CNV) to produce specific intensity measurements compared with probes not within CNVs.
Post optimizing the approach, HI-CNV-the team applied it throughout all the UK Biobank participants with SNP-array genotyping and targeting analysis on the CNVs in the 452,500 UK biobank participants carrying the European ancestry. The approach detected more than six times CNV’s per individual as the widely used earlier approach, such as the PennCNV method. The validation analysis using whole genome sequencing (WGS) revealed a validation rate of 91% for the HI-CNV, equivalent to the PennCNV. In order to analyze the sensitivity and extensibility, the computational approach was applied to 179,538 Biobank Japan participants and observed a validation rate of 93%.
Followed by earlier analysis, the team, in order to discover the new CNV and associated phenotypic impacts, fine-mapping analysis was conducted. A combination of single-variant and burden-style analyses to test three categories of CNVs (gene-level, CNV-level, and probe-level to determine the associations related to biological and anthropometric traits such as measuring the lung function, bone mineral density, blood cell indices, blood pressure, and blood serum biomarkers in 56 heritable quantitative traits. The analysis pipeline where the fine mapping associations involve rare variants and pairwise LD (linkage disequilibrium) filter identifies the variants efficiently.
This analysis pipeline resulted in 269 fine-mapped CNV trait associations at 97 loci involving 252 likely causal CNVs. The CNV calls involved in these associations displayed a higher WGS-based validation rate (94%) than the overall call set. The associations affected almost all the categories of phenotypes that were considered, with blood cell phenotypes accounting for the majority of likely causal associations (137 of 269 associations).
The current study for the data identified 97 loci involved in the 269 fine-mapped CNV trait associations, while approximately 72 loci had yet to be identified in the previous studies for CNV association. Many CNV associations also corroborated target genes that were implicated by coding variant association studies, such as the rare height-reducing deletions in CRISPLD2 and ADAMTS17, a rare sex hormone binding globulin (SHBG)-increasing deletion in HGFAC, and a rare IGF-1-decreasing the partial deletion of MSR1.
Researchers also employed two corroboratory analyses to confirm the robustness of the approach, i.e., the HI-CNV. The first approach was for the associations involving CNVs that predicted to cause loss of function of the acknowledged target gene, and eventually compared the effects of pLoF CNVs with the effects of ultra-rare pLoF SNP in the same gene adapted according to Backman et.al. The observations revealed wide consistent effect sizes between pLoF CNVs and pLoF SNP/indel variants. The second approach elucidated confirmatory evidence that supported the CNV associations implicating gene-trait relationships which were not previously identified.
With the diverse potential for functional impact that was exhibited by CNV in discovering the previously known associations, it also directed its association with genetic disease traits. Upon analyzing a total of 757 disease phenotypes that were obtained by the UK Biobank, 68 significant associations remained after the LD-clumping, a total of 64 associations involved syndromic CNVs, three associations involved other known loci (HBA and HBB for thalassemia) appeared. The results suggested the challenge of performing the disease analyses in the healthy population cohorts, as well as larger CNV call sets or case-control studies of the population, will be necessary in order to discover the new CNV-disease associations.
Final Thoughts
The results have demonstrated a significant advantage of haplotype-informed structural variant analysis that leverages widespread distant relatedness within the huge biobank cohorts. The applied approach of HI-CNV allowed the researchers to explore CNV-phenotype associations in UK Biobank, and this approach revealed many ways in which genetic variation influences complex traits. Along with the biological discoveries, the study also provides an analytical approach to handle the statistical subtleties of performing the association and fine-mapping analyses on large genomic datasets.
Article Source: Reference Paper
Learn More:
Top Bioinformatics Books โ
Learn more to get deeper insights into the field of bioinformatics.
Top Free Online Bioinformatics Courses โ
Freely available courses to learn each and every aspect of bioinformatics.
Latest Bioinformatics Breakthroughs โ
Stay updated with the latest discoveries in the field of bioinformatics.
Mahi Sharma is a consulting Content Writing Intern at the Centre of Bioinformatics Research and Technology (CBIRT). She is a postgraduate with a Master's in Immunology degree from Amity University, Noida. She has interned at CSIR-Institute of Microbial Technology, working on human and environmental microbiomes. She actively promotes research on microcosmos and science communication through her personal blog.