In this article, we will be exploring how machine learning (ML) techniques can be used to identify and analyze biomarkers from next generation sequencing (NGS) data. Biomarkers are specific biological molecules or characteristics that can be used to identify the presence or severity of a particular disease or condition. They play a crucial role in medical diagnosis, treatment, and prognosis, and their discovery and validation is an important area of research in the field of biomedical science. Next Generation Sequencing is a powerful tool that allows researchers to analyze large amounts of genetic data quickly and accurately. By combining the capabilities of machine learning with next generation sequencing data, we can unlock the potential to identify and validate new biomarkers that can improve our understanding of diseases and lead to more effective treatments.


It was when traditional classification strategies based on biological characteristics failed to explain the disparities in treatment modalities and clinical outcomes of seemingly similar clinical and pathological breast cancer profiles that deeper mechanisms driving these differences were deliberated. Sure enough, gene expression profiling of these tumors followed by clustering was able to identify subclasses within the samples that greatly correlated with the clinical outcome. These studies went on to propel the identification of key players characteristic of each subclass, aid in reinventing treatment regimens tailored to accommodate the genetic variations and discover biomarkers that could better predict diagnostic, treatment, and prognostic outcomes.


The human genome project, with a budget of around 3 billion dollars, initially expected to commence in 15 years, was brought down to 13 years by improvising the Sanger method to use capillary electrophoresis that runs 384 simultaneous reactions. Fast forward to 2008, the genome of Nobel Laureate James Watson was sequenced in two months with an estimated cost of 1 million dollars with the advent of NGS. Fast forward to today, where the advancements in next generation sequencing have reached a new peak with the Guinness world record for the fastest DNA sequencing technique that took 5 hours and 2 minutes to sequence a complete patient genome at the University of Stanford. With NGS being more accessible and affordable than ever, there is increasing momentum in applying genetic knowledge to improve clinical care and in predicting clinical outcomes.


NGS or Next Generation Sequencing is a DNA sequencing approach that allows millions of DNA fragments to be sequenced in parallel, massively speeding up the entire process. The initial library preparation steps fragment the isolated DNA samples, which are then equipped with adapters on either side for interaction with the complementary strands in the NGS platform. The DNA fragments, thus bound to the NGS platform, are subjected to clonal amplification that creates clones of each DNA fragment around them. This allows NGS technologies to perform sequencing with just nanograms of the sample as opposed to Sanger sequencing. The fragments are sequenced parallelly by the addition of fluorescently labeled nucleotides. Furthermore, certain NGS technologies also offer the benefit of multiplex sequencing that sequences more than one sample per run by the addition of barcodes to the DNA fragments unique to each sample. 


With the ability to sequence millions of reads comes the burden of gigabytes of data that has to be systematically pooled and analyzed. The raw sequence reads generated as fastaq files require further processing before they are presentable for analysis. Fastaq files contain the data on the sequenced bases and a Phred quality score accompanying each base, which is an approximation of the sequencer incorrectly calling the base at that location. The obtained data then needs to be cleaned of adapter sequences and PCR duplicates, removed of low-quality reads, and assembled into genomes or transcriptomes. 

The further downstream analysis of the data branches out depending on the type of omics data, namely genomics, transcriptomics, or epigenomics. Genomics refers to the study of the whole genome of the organism that allows the identification of mutations and genomic variants within it. Transcriptomics assesses the RNA molecules to study their differential expressions or instances of alternate splicing. Epigenomics, the more recent addition to the field, studies phenotypic alterations that do not alter DNA sequences, such as DNA methylation and Histone modifications.


Disease-related biomarkers are measurable indicators of various disease characteristics. They can be employed in assessing the risk of developing a disease, diagnosing and prognosis of a disease, or predicting the response to treatment. To a certain extent, an individual’s disease manifestations are influenced by their unique genetic makeup. The disease itself takes on highly heterogeneous forms fueled by the highly heterogeneous underlying genetic basis. The traditionally used molecular biomarkers fall short of capturing the said variations hence paving the way to a new class of genomic biomarkers.

Tumors can exhibit clonal heterogeneity in their mutational signatures and in their microenvironment constituents that influence their metastatic patterns and drug responses. For instance, breast cancer is well known for its heterogeneity, even in patients with similar clinical profiles. Some of the common gene-based biomarker signatures applied clinically for breast cancers are Oncotype DX, MammaPrint, and PAM 50. Oncotype DX uses 21 genes to predict tumor metastasis, relapse, tumor stage, and the benefit of chemotherapy along with hormone therapy in early-stage ER-positive Her2-negative breast cancers. MammaPrint uses a group of 70 genes to predict metastasis and recurrence risk of early-stage ER-positive or ER-negative breast cancer. PAM-50 uses a set of 50 genes to classify breast cancer samples into one of its five intrinsic subtypes, as well as predict the requirement of chemotherapy as a treatment option.

All three tests were developed through tumor gene signatures assessed from microarray or RT-PCR data. With NGS analysis giving a greater coverage of genetic data, biomarker discovery could benefit from a surplus of candidate genes. Furthermore, it doesn’t restrict biomarkers to just genes but opens up the potential for mutations and epigenetic modifications. But how do we shortlist the potential candidates from the hundreds and thousands presented to us?


Artificial Intelligence (AI) has penetrated almost all niches of our lives: gathering, analyzing, and delivering data, thus making day-to-day decision-making tasks much easier. Hence it is no wonder that its influence has extended to tackling the huge quantities of data that NGS delivers to find patterns and profiles that have a relevant translational impact on medical research. Biomarker discovery is one such field where its influence is becoming increasingly prominent. Through both supervised and unsupervised machine learning techniques, many studies have reported omics-derived robust predictors for disease-based outcomes.

Unsupervised Machine Learning

Unsupervised machine learning groups together unlabelled data claiming similarity within their distributions. Weighted gene co-expression network analysis (WGCNA) is an unsupervised method that relies on hierarchical clustering to group correlated genes into modules. These modules are assessed for their influence on various clinical traits under study. Hub genes within each module can be evaluated for their potential as biomarkers. Such studies can also be supplemented through differential expression of gene (DEG) analysis. A study on gastric cancer transcriptome data identified four disease-specific biomarkers through WGCNA and DEG analysis with further validation through in vitro study. In another study using Rheumatoid Arthritis data, WGCNA and DEG together identified 22 key genes associated with the disease phenotype whose validity was assessed through classification models. 

Other methods, such as principal component analysis (PCA), perform dimensionality reduction by identifying two new axes that represent the highest contribution to variation within the input data. The data is then redispersed along these axes to identify clusters of samples based on their expression profile and allow the detection of major genes that are contributing to the inter-cluster variation. PCA is, though not a powerful enough clustering technique to be used independently as the major component in biomarker identification, it can be used in tandem with other techniques to provide further evidence to the study.

Supervised Machine Learning

Though most supervised ML models are not programmed to identify key features by themselves, they can perform this task quite effectively in combination with other feature selection techniques. Supervised ML techniques are trained on labeled data for classification or regression tasks. With NGS data analysis, the features used would be the omics data, while the labels would be the clinical outcomes. Various feature selection methods exist, such as filter methods, wrapper methods, or a combination of two, while ML techniques, such as random forest, have intrinsic feature selection strategies embedded within their algorithm. Filter methods filter out features based on their statistical properties, such as correlation, variance, etc. The remaining features are used to develop the prediction model. In the wrapper method, we develop a model on a subset of features and, based on the model performance, further add or remove features from that subset. Though in actual practice, a combination of the two is applied successively to bring down the large array of features available through NGS. MammaPrint initially used microarray data of 25,000 genes that were brought down to 5000 by filtering out genes with a log fold of less than 2, further brought down to 231 through a filter method that removed genes having a low correlation with disease outcome, and finally reached the final count of 70 with wrapper method in conjunction with an ML classifier.


One of the major aspects to understand when concerning ML is that the data is king. To build models and derive inferences that can be translated into clinical settings, the data has to be of good quality. And many a time, the biggest issue is the lack of good quality data that can deter reproducible results. This can arise from the very beginning of the process, in the sequencing stage. Tools such as fastqc and multiqc can be employed to perform quality control of the sequenced reads and, if required, eliminate the poor-quality ones.

The raw read sequences obtained for each gene are susceptible to the influence of the gene lengths and sequencing depth, which makes gene comparisons unfair. To bring the raw reads down to comparable ranges, normalization is performed. Normalization techniques such as FPKM (fragments per kilobase of exon per million mapped fragments) and RPKM (reads per kilobase of exon per million reads mapped) address the issue of both sequencing depth and gene length and are best suited for comparing between genes, while TPM (transcript per million) is suited for comparison of the same gene across different conditions as it only normalizes the sequencing depth.

Furthermore, another factor to be taken into account is batch effects which are technical differences caused by external factors such as the sequencing machine or the technician who ran it. To account for this, quantile normalization is generally performed. It ranks the genes within each sample based on their read values, takes the average of the gene expression values at each rank across all the samples, and substitutes those as the new values for the genes.

It is a well-known concept that ML models are plagued by the curse dimensionality, which becomes more evident with omics-based data that produces over thousands of genes to analyze, far more than the sample counts. It is also to be noted that the genes contributing to the disease state are only a small portion of this. One way to filter out the non-contributing genes is through one of the feature selection methods mentioned above and/or through DEG analysis. 

Class imbalance, which is the imbalance between sample classes provided as inputs, becomes especially problematic during unsupervised clustering, where low sample counts for a class translate as low data available for that class to stand out and be clustered separately. Hence, when dealing with data having unfair class distribution, supervised ML combined with the feature selection method is preferred over unsupervised ML.


The interplay of machine learning and NGS data has the potential to produce breakthrough discoveries in the field of personalized medicine. The success of genomic biomarkers in representing the more complex, heterogeneous diseases is a preview of the possibilities they present. Though a lot of studies have been published and a lot more research is currently conducted in this very field, only a small fraction of it is continued to the in vitro and in vivo analysis. With more and more advancements in NGS, this scenario is likely to improve, making genomic medicine a staple in clinical care.


  • Next Generation Sequencing has made genomic data both widely available and affordable.
  • It has opened new avenues into large-scale genomic studies underlying the disease pathology.
  • Genomic biomarkers that better capture the heterogeneity within the disease manifestation, progression, and outcome are the superior choices in complex diseases such as cancer.
  • The growth of momentum in genomic biomarker discovery is driven by the high-throughput data brought in by the NGS and the endless potential of data analysis strategies provided by machine learning.
  • Some of the challenges faced in the analysis of NGS data for biomarker discovery by machine learning include the quality of sequences, the effect of gene length and sequencing depth, batch effects, dimensionality reduction, and class imbalance.

Article Source: 

Learn More:

Top Bioinformatics Books

Learn more to get deeper insights into the field of bioinformatics.

Top Free Online Bioinformatics Courses ↗

Freely available courses to learn each and every aspect of bioinformatics.

Latest Bioinformatics Breakthroughs

Stay updated with the latest discoveries in the field of bioinformatics.

Website | + posts

Catherene Tomy is a consulting Content Writing Intern at the Centre of Bioinformatics Research and Technology (CBIRT). She has a master’s degree in Molecular Medicine from Amrita University with research experience in the fields of bioinformatics, cell biology, and molecular biology. She loves to pull apart complex concepts and weave a story around them.

Website | + posts

Dr. Tamanna Anwar is a Scientist and Co-founder of the Centre of Bioinformatics Research and Technology (CBIRT). She is a passionate bioinformatics scientist and a visionary entrepreneur. Dr. Tamanna has worked as a Young Scientist at Jawaharlal Nehru University, New Delhi. She has also worked as a Postdoctoral Fellow at the University of Saskatchewan, Canada. She has several scientific research publications in high-impact research journals. Her latest endeavor is the development of a platform that acts as a one-stop solution for all bioinformatics related information as well as developing a bioinformatics news portal to report cutting-edge bioinformatics breakthroughs.


Please enter your comment!
Please enter your name here