Researchers at Mount Sinai Institute introduced Variant-to-Phenotype (V2P), a new machine learning framework that not only predicts whether a genetic variant is harmful but also reasons what kind of disease phenotype it leads to, making variant interpretation more precise and clinically useful.
Explosion of Genetic Data and How it Transformed into Computational Approaches for Predicting the Type of Phenotype
Advances in high-throughput sequencing technologies have made it much easier and cheaper to generate massive amounts of genetic data. As a result, researchers now have access to millions of human sequence variants.
Despite this flood of data, most variants remain uncharacterized, and many are classified as variants of uncertain significance, meaning we don’t know if they are harmful, benign, or what disease they might cause.
To address this issue, scientists have developed computational prediction methods that automatically estimate the effect of variants. These methods have improved over time, thanks to:
- Better model architectures, like machine learning and deep learning
- Improved data curation (cleaner, more comprehensive variant databases)
The progress has been steady over the past few decades, but issues in relating variants to specific disease phenotypes still persist.
An Introduction to Variant-to-Phenotype (V2P) and Problems with Current Prediction Tools
Modern sequencing technologies are producing vast amounts of genetic data, including that of single-nucleotide variants (SNVs) and insertions or deletions called indels. These variants can occur in both coding and non-coding regions of DNA.
Traditional computational tools that assess whether a genetic variant is harmful or benign usually treat all pathogen variants as a single class (let’s say SNVs but not indels), or they only focus on coding variants, ignoring non-coding variants. But the major limitation is that they don’t distinguish which disease phenotype these variants cause. This means they restrict predictions to only one type of variant and often miss the subtle relationship between genotype (variant) and phenotype (disease outcome).
Instead of just predicting “pathogenic vs benign”, Variant-to-Phenotype (V2P) is a multi-tasking and multi-output machine learning model that predicts whether a variant is pathogenic and which of the 23 top-level Human Phenotype Ontology (HPO) categories the variant is likely to cause. This dual conditioning makes predictions more biologically meaningful.
It’s also designed to overcome the limitations explained above by predicting pathogenicity for both SNVs and indels and working across coding and non-coding regions of DNA.
Architecture and training of the V2P model
V2P is an ensemble of six models of gradient boosted decision trees with multi-label classifiers and has been trained using a specially skewed dataset (containing ~500,000 variants) of pathogenic and benign variants. The data is annotated with gene, protein, and network-level features.
All models use LightGBM v3.3.5, a gradient boosted decision tree. It was chosen because it consistently performs well on tabular biological data and handles complex feature interactions. Also, some phenotypes, like thoracic cavity abnormalities, are rare in the dataset. To avoid bias even after using a skewed dataset, V2P uses random multi-label oversampling (minority phenotype samples were oversampled by 25%). This gives the model stronger signals for underrepresented classes.
V2P incorporates phenotype information during training. It uses diverse biological features from gene level, protein level, network interactions, etc, to improve accuracy and gives a set of 24 probability values ranging from 0 to 1 (unlike a binary pathogenicity score). The values represent the likelihood that a variant is pathogenic or not, and map it to one or more of 23 HPO categories.
Evidence of Improvement
V2P was tested on high-quality datasets like HGMD, ClinVar, etc. When compared against other variant effect predictions across different datasets:
- V2P shows higher accuracy in phenotype-specific predictions
- Exhibits successful identification of pathogenic variants in real patient sequencing data and simulated datasets.
- Outperforms existing tools in initial benchmarking
This exceptional performance of V2P is because it exploits the relationships between pathogenic variants and disease phenotypes during training. This conditioning makes predictions more accurate than traditional tools that only consider pathogenicity in general.
Conclusion
Traditional variant effect predictors treat pathogenic variants as a single homogeneous class, focusing narrowly on coding SNVs, and fail to capture the phenotype-specific outcomes of genetic variation. This limits their utility in clinical genomics. Variant to Phenotype (V2P), which is an ensemble model, predicts whether a variant is pathogenic or benign, along with categorising them.
Article Source: Reference Paper | Reference Article | Code availability: GitHub/Zonodo.
Disclaimer:
The research discussed in this article was conducted and published by the authors of the referenced paper. CBIRT has no involvement in the research itself. This article is intended solely to raise awareness about recent developments and does not claim authorship or endorsement of the research.
Follow Us!
Learn More:
Saniya is a graduating Chemistry student at Amity University Mumbai with a strong interest in computational chemistry, cheminformatics, and AI/ML applications in healthcare. She aspires to pursue a career as a researcher, computational chemist, or AI/ML engineer. Through her writing, she aims to make complex scientific concepts accessible to a broad audience and support informed decision-making in healthcare.












