Mobile genetic elements (MGEs) are transposable elements or transposons, which are DNA sequences that have the ability to move within a genome. They are found in various species like bacteria, plants, animals, and humans. They rely on whose cells to replicate and propagate. Plasmids and viruses lead to horizontal gene transfer, which impacts the valuation of diverse ecosystems. Recent studies of metagenomic sequence analysis helped in discovering a variety of viral genomes, but still, the study of plasmids remains untouched.
geNomad is a new tool introduced for the simultaneous detection of plasmids and viruses. This tool has performed brilliantly in comparison to previous identification techniques as it uses a method that combines alignment-free and gene-based models. The use of geNomad integrating it with metagenomic data has also shown previously missed RNA and giant virus sequences, which further improves the understanding of viral diversity. It is an efficient method feasible for large-scale surveys which help to study the publicly available sequencing data. Overall, this tool helps in a significant way by improving the ability to investigate and characterize MGEs, giving information on their function in the formation of biological communities and ecosystems.
The geNomad framework for classification and annotation
The geNomad framework functions in a stepwise manner in which the detection of plasmids and viruses are transformed and annotated. This technology integrates the advantages of alignment-free classification models and gene-based classification models. GeNomad categorizes sequences based on their nucleotide composition and records non-local linkages by using an IGLOO-based neural network for a global representation. Simultaneously, in the marker branch, the framework predicts an annotated protein using custom markers and specialized Prodigal software. These predicted proteins are matched against a vast database of protein profiles. A range of genomic features, such as gene density and marker content, are calculated and fed into a classification model, generating confidence scores for each class. An attention mechanism adjusts the contributions of both branches based on the presence of specific markers in the input sequence. To improve classification reliability, geNomad offers a score calibration mechanism, enabling the estimation of true probabilities and false discovery rates. For sequences classified as viral, geNomad assigns them to the International Committee on Taxonomy of Viruses (ICTV) taxa. The output includes rich metadata and the nucleotide and amino acid sequences of identified plasmids and viruses. This innovative tool enhances the accuracy and efficiency of MGE identification and characterization.
A dataset of marker protein profiles
geNomad relies on a dataset of 227,897 protein profiles for chromosome, plasmid, and virus classification. These profiles were constructed by clustering diverse protein sequences from various sources, followed by specificity assessment for each class. To mitigate taxonomic bias, overrepresented sequences were downweighted. The dataset is dominated by virus-specific markers (69.2%) and provides taxonomic and functional information. Functional enrichment analysis revealed distinct functions associated with each class.
geNomad accurately identifies plasmids and viruses
The performance of geNomad is properly assessed and compared to other plasmid and virus identification tools. The test datasets used in the assessment were diverse, and geNomad consistently displayed good accuracy in categorizing sequences, even those that differed from the training data. It showed great sensitivity and precision in both plasmid and virus classification. Plasmid classification achieved its best performance and even surpassed the PlasX tool, which had high precision but low sensitivity. The rate of wrong classification, like identifying viruses as plasmids, was very low compared to other tools. In virus classification, it again outperformed all other tools, maintaining high sensitivity and precision. A benchmark study was done using representative genomes from the ICTV, which showed geNomad’s superiority in identifying viruses across various host categories and even challenging cases, such as in Viruses.
Sensitive and precise identification of proviruses
geNomad succeeds excellently in detecting and identifying provinces, which are mainly integrated viruses inside the host genome. It uses a conditioner random field model to score genomic areas with a high concentration of viral markers. This method outlines providers’ boundaries while eliminating false viral islands and extending provirus edges up to neighboring tRNAs and integrases.
Benchmarking against other tools using the TIGER dataset, geNomad demonstrated superior performance in identifying proviruses, achieving high precision and sensitivity. Its predicted proviral regions also exhibited lower contamination and higher completeness levels compared to other tools, making it a robust choice for provirus detection.
geNomad is designed for both user-friendliness and efficiency. It offers various installation methods, a comprehensive command line interface, and a web application through NMDC EDGE. It can be seamlessly integrated into larger workflows, making it versatile for different users. In benchmark tests, geNomad demonstrated exceptional speed, outperforming several comparable tools and significantly reducing processing times, especially when compared to VirSorter2 and PlasX. This efficiency allows geNomad to handle large datasets, exemplified by its use in processing approximately 260 million scaffolds for the IMG/VR and IMG/PR databases, the largest virus and plasmid sequence databases available.
Discovering the RNA and giant viruses
geNomad significantly enhances the discovery of RNA and giant viruses, addressing limitations in existing tools. In metatranscriptomes, geNomad successfully identified RNA virus genome sequences, classifying 99.9% of RdRP gene-containing sequences as viral. Importantly, it also detected RNA virus genomes lacking the RdRP gene, in contrast to other tools that classified only 43.7% of such sequences as viral. Additionally, geNomad effectively uncovered novel clades of giant viruses in 28,865 metagenome assemblies, placing 11,414 scaffolds in the Nucleocytoviricota tree. The tool’s application to soil metagenomes led to the identification of 235 more Nucleocytoviricota scaffolds, revealing several previously unknown clades of giant viruses. This underscores geNomad’s valuable contribution to expanding our understanding of RNA and giant virus diversity in various environments.
Overview of the methodology of geNomad
The dataset for geNomad is composed of prokaryotic and eukaryotic genomes, plasmid sequences, and virus sequences. The data is filtered to remove redundancy and irrelevant sequences like prophages and provirus-like regions. Eukaryotic genomes are clustered based on their amino acid identity (AAI) to reduce taxonomic bias. After this, a comprehensive protein profile database is built to aid in classifying sequences. This database includes protein alignments from various sources, both external and de novo clusters. The profiles are used to create Hidden Markov Models (HMMs) and are associated with specific functions and gene ontology terms.
Two types of classification models are used, which are gene-based and sequence-based classifiers. The gene-based classifier uses features related to gene structure and marker annotation and is trained using a decision forest classification model. The sequence-based classifier is trained using supervised contrastive learning with an IGLOO encoder and dense neural network classifier. Both types of models have associated hyperparameters that are optimized for performance. At last, an aggregator model combines the outputs of the gene-based and sequence-based classifiers using an attention mechanism. These results are then fed to a dense neural network layer with softmax activation to produce the final scores.
A score calibration model is trained using artificial communities with varying proportions of chromosome, plasmid, and virus sequences. This model helps calibrate geNomad scores based on the empirical composition of sequence data.
Several existing tools for virus, plasmid, and chromosome detection are included in performance benchmarks to assess geNomad’s effectiveness. The assessment was based on measuring the sensitivity of geNomad and other tools while considering false discovery rates (FDR). These tools are executed with default parameters, and cutoffs are established to maintain an FDR close to 5% or 10% to ensure a fair evaluation.
Various processes, data, and models used to develop and evaluate geNomad demonstrated the rigorous approach taken to create this versatile and accurate tool for identifying genomic elements in various sequences.
The detection of plasmids and viruses within sequencing data is critical in genomics as it helps to learn about the complex variety of these mobile genetic components and their significant effect on evolution, ecology, and therapeutic significance. geNomad is an enormous library of marker protein profiles, which includes thorough functional and taxonomic annotations. This database is a great resource in and of itself, providing academics with a plethora of data for a variety of applications. GeNomad goes above and beyond existing approaches, improving classification performance. In comparative assessments, geNomad surpasses existing tools in terms of classification performance and computational efficiency. It excels in the taxonomic classification of viruses and gene functional annotation. As a result, geNomad has become a vital tool for plasmid and virus researchers.
Prachi is an enthusiastic M.Tech Biotechnology student with a strong passion for merging technology and biology. This journey has propelled her into the captivating realm of Bioinformatics. She aspires to integrate her engineering prowess with a profound interest in biotechnology, aiming to connect academic and real-world knowledge in the field of Bioinformatics.