A majority of variations linked to disease that have been found by genome-wide association studies are not found in areas that code for proteins. It can be difficult to prioritize putative biological mechanisms for additional functional trials by identifying candidate regulatory variations and gene targets. In order to overcome this difficulty, scientists from the National Cancer Institute, NIH, USA, created FORGEdb, an independent web application that combines several datasets to provide details on target genes, transcription factor binding sites, and related regulatory elements for more than 37 million variations. Researchers can obtain a quantitative evaluation of each variant’s relative value for certain functional studies by utilizing FORGEdb scores.


Thanks to genome-wide association studies (GWAS), more than 232,000 different variations have been linked to more than 3000 traits and diseases. Target genes, pathways, and mechanisms of action are poorly understood, and the majority of chromosomal locations have not been thoroughly investigated. Sequences that control gene expression have an excess of GWAS variations. To help interpret these mutations in the context of gene regulation, researchers have employed extensive mapping data from Roadmap Epigenomics, ENCODE, and BLUEPRINT. Nevertheless, expanded expression quantitative trait locus (eQTL) data from sizable consortia such as the Genotype-Tissue Expression Project (GTEx) or the eQTLGen project, or high-dimensional ENCODE data from modern technologies are not included in these techniques.

Exploring FORGEdb

A new web-based application called FORGEdb may assist in prioritizing and analyzing genomic variants for experimental study. With capabilities from new technologies not found in popular web tools, FORGEdb offers a more thorough examination of possible regulatory functions. For a variety of cell and tissue types, FORGEdb standalone annotates variants for positional overlap with DNase I hotspots, histone mark broadPeaks, and chromatin states. Additionally, it annotates variants for the following:

  • CADD scores
  • TF motifs
  • ENCODE4 regulatory element CRISPR sgRNAs
  • Activity-By-Contact data
  • Contextual Analysis of TF Occupancy scores
  • The nearest gene from RefSeq

For every mutation, the FORGEdb score, which is a method for evaluating functional relevance in genetics, was created. Five main categories of experimental evidence were used to calculate the score: transcription factors, chromatin accessibility, activity-by-contact 3D genomics interactions, histone marks, and differential gene expression. To ensure openness and interpretability, the score was produced utilizing datasets from ENCODE, BLUEPRINT, and the Roadmap Epigenomics collaboration. Differential gene expression, activity-by-contact 3D genomics connections, transcription factors, and positional overlap with histone marks were used to calculate the score. After that, a final score between 0 and 10 was assigned, with equal weights used to avoid bias. After 37 million variations, the system was run, and the distribution was found to be roughly normal. By correlating the scoring method to MPRA data, it was further validated, confirming that it is in line with functional importance.

A variety of functional genomic annotations are available from FORGEdb, which can be classified according to variant-level features (e.g., allelic differences at the locus are associated with a particular feature) or positional overlap (e.g., the variant is located in a genomic region demarcated by the annotation). FORGEdb offers insights into genomic context through its regional overlap characteristics, which include DNase I hotspots, TF motifs, ABC data, CRISPR regulatory element sgRNAs, and histone mark broadpeaks. Allele-specific information is provided by variant-level features such as the GTEx and QTLGen eQTL databases, CATO scores, Zoonomia PhyloP scores, and CADD scores. When taken as a whole, these annotations in FORGEdb add to a thorough knowledge of the regional genomic context and allele-specific effects for individual SNPs.

FORGEdb’s Ability to Target Specific Genes

Sequence variations (SNPs) are linked to target genes, transcription factor (TF) binding information, and potential regulatory regions using the FORGEdb tool. It annotates variations for overlap with DNase I hotspots, histone mark broadPeaks, and chromatin states across different cell and tissue types using genome-wide epigenomic track data from ENCODE, Roadmap Epigenomics, and BLUEPRINT consortia. By overlapping TF motifs and SNP-specific Contextual Analysis of TF Occupancy (CATO) scores, FORGEdb also integrates SNPs with transcription factor (TF) binding data. By utilizing Activity-By-Contact (ABC) data and allele-specific expression quantitative trait locus (eQTL) annotations using large-scale data from GTEx and eQTLGen, it establishes a connection between SNPs and target genes through overlap between SNPs and enhancer-to-promoter looping areas. 

To rank genetic variations according to their functional validity, a brand-new scoring system known as FORGEdb was created. Transparency and accessibility are guaranteed by this method, which aggregates all annotations pertaining to gene regulation into a single score. The expression quantitative trait locus (eQTL), activity-by-contact (ABC) interaction, histone mark ChIP-seq broadPeak, TF motif and CATO score, and DNase I hotspot are the five independent lines of evidence for a regulatory function that are used to calculate the FORGEdb score. The scores go from 0 to 10, with 9 or 10 indicating a high degree of functional impact proof. This method is essential in prioritizing variants for functional research and analyzing several lines of experimental data.

In GWAS analysis, the usefulness of FORGEdb scores was assessed for thirty traits and illnesses. According to the results, the mean FORGEdb score and the ranked SNP bins for each of the 30 phenotypes had a strong positive connection. Greater FORGEdb scores are correlated with larger p-values. Higher SNPs were found to be overrepresented by FORGEdb scores in the 95% credible sets in fine-mapping studies. The results indicate that FORGEdb scores have a substantial correlation with GWAS 95% credible sets and correlate with GWAS associations, indicating their potential use in prioritizing SNPs related to a variety of human traits and disorders.


Researchers can analyze variations and their regulatory context using the FORGEdb platform, which offers a number of advantages and disadvantages. It does not contain sequence constraint information from SiPhy or data on chromatin accessibility quantitative trait loci (caQTL), but it does have information on TF motifs and CATO scores for allele-specific DNase-seq-based TF binding. For researchers looking for a thorough platform to annotate SNPs and understand functional elements in the genome, especially in gene regulation and allele-specific effects, FORGEdb is still a useful tool despite these drawbacks. In addition, it integrates allele-specific association data from GTEx and eQTLGen, CADD scores, and CATO scores, which offer additional insights into the potentially harmful consequences of variant alleles. Forgedb scores correlate with functional importance based on MPRA data, which may yield more informative results than RegulomeDB and HaploReg.

Article Source: Reference Paper | FORGEdb is available via a web browser (https://forgedb.cancer.gov/ and https://forge2.altiusinstitute.org/files/forgedb.html).

Learn More:

Website | + posts

Deotima is a consulting scientific content writing intern at CBIRT. Currently she's pursuing Master's in Bioinformatics at Maulana Abul Kalam Azad University of Technology. As an emerging scientific writer, she is eager to apply her expertise in making intricate scientific concepts comprehensible to individuals from diverse backgrounds. Deotima harbors a particular passion for Structural Bioinformatics and Molecular Dynamics.



Please enter your comment!
Please enter your name here