The Researchers at The University of Chicago, Chicago, developed 24 binary classifiers of MTB drug resistance condition across eight anti-MTB drugs and three different Machine Learning algorithms: logistic regression, random forest, and 1D CNN with the use of a training dataset of 10,575 MTB isolates obtained from 16 countries, where an extended pan-genome reference was used for detecting genetic features. The model might aid in developing novel combinatorial treatments against multidrug resistance TB.
With an expected 1.5 million deaths in 2019, Mycobacterium tuberculosis (MTB) is still the world’s deadliest infectious illness. A 6-month course of four first-line medications, including isoniazid (INH), rifampicin (RIF), ethambutol (EMB), and pyrazinamide (PZA), is presently recommended for drug-susceptible TB illness. Second-line drugs were created to treat first-line drug-resistant tuberculosis disease, which requires a course for second-line drugs lasting at least nine months and up to 20 months. Drug-resistant tuberculosis continues to pose a danger to worldwide tuberculosis control efforts. In 2019, the World Health Organization estimated that nearly half a million people worldwide developed rifampicin-resistant tuberculosis (RR-TB), with 78% of those developing multidrug resistance TB (MDR-TB). Given the time-consuming nature of culture-based diagnostic techniques, it is critical to quickly establish drug sensitivity profiles of tuberculosis.
Researchers use traditional association rule methods to predict antimicrobial resistance to overcome these limitations and find antibiotic resistance more efficiently. These methods are designed on whole-genome sequencing (WGS) data and identifying mutations linked to Antimicrobial-resistant (AMR).
On a vast and diverse dataset of MTB isolates, the researchers report a study of MTB drug resistance classification utilizing traditional ML methods such as logistic regression (LR), random forests (RF), and deep neural network architecture of 1D CNN. The researchers also compared the performance of ML classifiers against that of the Mykrobe predictor, a state-of-the-art statistical modeling tool.
They used the sequence read archive (SRA) database to download whole-genome sequencing (WGS) data for 10,575 MTB isolates and acquired correlating lineage and phenotypic drug susceptibility test (DST) data from the 100,000 Genomes Project and the CRyPTIC Consortium to prepare the training data and labels. When training and assessing ML models, the phenotypic DST results for the drugs were used as labels. The majority of isolates in this dataset are susceptible, while a small percentage are resistant to all four first-line medications listed above, as well as four second-line treatments: amikacin (AMK), capreomycin (CM), kanamycin (KM), and ofloxacin (OFX).
Rather than using a single strain reference, all references from the Comprehensive Antibiotic Resistance Database (CARD) database were used to form reference clusters as a pan-genome reference, incorporating references from different bacteria. For variant discovery, sequencing reads were aligned to these reference clusters. We can generate more trustworthy alignments and detect variations more accurately by utilizing a pan-genome reference.
The researchers utilized a command-line tool called ARIBA to look for potential genomic markers that could help classify MTB treatment resistance. It produced a read depth file, a summary file for alignment quality, and a report file with identified mutations and AMR-associated genes.
They next filtered away low-quality mappings and gathered 263 genetic characteristics from the remaining high-quality mappings, including novel coding area variants, well-studied resistance-causing variants, and AMR-associated gene presences discovered in at least one of the 10,575 isolates. They also included indicator variables for each of the 19 lineages, resulting in a total of 282 features in their feature vector.
The researchers used two classic machine learning techniques, RF and LR, on sample sets labeled with phenotypic DST data to train MTB AMR classifiers for the eight medications (first-line and second-line), with a feature vector of 282 characteristics for each sample.
CNN is a sort of deep neural network that works with multi-dimensional data. Because deep learning algorithms need a lot of computing power, the researchers used feature selection to ensure that only relevant features were used as input.
The top 42 (RIF), 68 (INH), 113 (PZA), and 125 (EMB) drug-specific characteristics were gathered after feature selection. There were 42 traits common to all four sets. The authors also used the same feature selection process on second-line medicines and compared their results to WHO’s recently released AMR-associated mutations in MTB. Overall, they chose 78.8% of variants that are also on the WHO list. They designed and developed a multi-input CNN architecture after selecting the essential features. It seamlessly combines sequential and non-sequential aspects.
The researchers used tenfold cross-validation to train and validate 24 binary AMR status classifiers across eight (first- and second-line) drugs and three multiple machine learning algorithms: LR, RF, and a custom 1D CNN. They also evaluated the state-of-the-art Antimicrobial resistance (AMR) prediction tool Mykrobe predictor on the identical sample sets used for the eight TB medications, respectively, to compare their models with a rule-based method. To compare the different approaches, the precision, sensitivity, specificity, accuracy, F1-score, and G-mean were determined.
Different measures were calculated to assess the four techniques’ effectiveness. The F1-score is the harmonic mean of sensitivity and accuracy, and it evenly distributes precision and sensitivity. They added the geometric mean of sensitivity and specificity (G-mean) as an additional metric because the F1-score does not account for True Negatives (TN). For all four first-line medications and one second-line drug, the three ML approaches outperformed the rule-based method Mykrobe predictor in terms of F1-score, with the 1D CNN classifier achieving the most significant overall results. In the case of EBM, their best model increased sensitivity from 72.4 % to 94.5 %, implying that 1D CNN models can detect more complicated or subtle genetic pathways.
In silico approaches include statistical association rules and machine learning. On a broad and diverse MTB isolate cohort, the researchers created ML models for first-line TB medication resistance classification to compare to a statistical rule-based strategy. The results reveal that ML models are much more reliable and accurate than the rule-based Mykrobe predictor in predicting TB treatment resistance across the four first-line medicines.
“We did not do hyperparameter tuning as part of our first-stage investigation for MTB AMR classification, but it is a potential technique to improve our models in the future. We can also include novel non-coding area variants and bigger variants as additional features, and compare the computationally expensive wrapper-type feature selection techniques to the filter-based one utilized in this study.” The researchers had anticipated this outcome. In future studies, reliable SNP detection methods can be used to contribute low-frequency variants as extra features for training the ML model.
Story Source: Kuang, X., Wang, F., Hernandez, K.M. et al. Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN. Sci Rep 12, 2427 (2022)
Code Availability: https://doi.org/10.1038/s41598-022-06449-4
Data Availability: https://github.com/KuangXY3/MTB-AMR-classification-CNN