uDance, a novel method that builds highly accurate and scalable phylogenetic trees using the divide and conquer approach, has been devised jointly by scientists from the University of California, San Diego, and Arizona State University. Phylogenetic trees are like family trees for all living organisms that help us comprehend how different species are related to each other through evolution. The previous methods presented the drawbacks related to inaccuracy and inefficiency in handling large numbers of organisms. uDance, however, refines different parts of a phylogenetic tree independently and can also build off existing trees. Scientists implemented uDance and successfully generated a tree for around 2,00,00 genomes using 387 marker genes and a huge amount of amino acid data with enhanced accuracy and scalability.
The Significance of uDance
The vast array of species on our planet translates into a large quantity of genetic material to investigate. There exist two methods, one emphasizing numerous genes and fewer organisms, and another focusing on a smaller number of organisms and more genes. The objective is to accumulate a plentiful supply of genes as well as numerous living organisms.
To achieve this goal, scientist Siavash Mirabab and his team have come up with uDance. The divide and conquer strategy employed by uDance breaks down the task into smaller parts and then merges them to build the family tree. This approach not only allows for efficient analysis of a massive amount of genetic data but also enables the updating of the tree as more and more data becomes available.
uDance was successful in creating a family tree for over 1,99,000 organisms using 387 marker genes while being computationally efficient which is a substantial improvement compared to previous methods.
How uDance Works
- Three types of data are given as input to uDance, which include a starting tree, genetic information from known organisms (backbone sequences), and new genetic data from unknown organisms (query sequences).
- The new genetic data is placed on the starting tree based on the similarities with known genes from the backbone sequences.
- The tree is then divided into smaller groups to manage the huge dataset, and representative sequences are selected for each group.
- A genetic tree is generated for each group to decipher the relationships within that group.
- At last, the genetic trees from all the groups are combined to create a final family tree that includes both known as well as unknown organisms.
Even though some sequences may not fit well into the tree and have to be removed, they can still be used for future updates.
Accuracy Exhibited by uDance in Simulations
To evaluate the accuracy of uDance in creating massive phylogenetic trees, it was tested using a simulated 10,000 taxon 500-gene dataset. The simulations included multiple scenarios with various levels of genetic differences between species due to horizontal gene transfer and incomplete lineage sorting. uDance exhibited low error and remarkable accuracy, outperforming other phylogenomic methods like ASTRAL and concatenation in most cases.
Even under increased gene tree discordance, uDance retained its accuracy, achieving the lowest mean normalized Robinson-Foulds (nRF) distances in almost all model cases except high discordance (HD)-P5 in which predicted gene trees differ from species tree by 78%. uDance also fared better in computational efficiency and scalability as it takes advantage of distributed computing clusters, lowering the time and amount of resources needed.
uDance’s Epic Phylogenomic Reconstruction of 2,00,000 Microbial Genomes
The true power of uDance was unveiled when it successfully reconstructed a phylogeny of 2,00,000 microbial genomes. The researchers gathered 6,56,574 Archaea and Bacteria genomes from the NCBI and curated a dataset of 2,96,745 genomes with multiple sequence alignments for 6,56,907 amino acid sites. With uDance, they incrementally updated the backbone tree to include 15,953 genomes (16K tree) and then further updated it to include 1,99,330 genomes (200K tree). Some genomes that were similar to others, unrecognizable, or of poor quality were removed. The updated tree showed a clear distinction between Archaea and Bacteria and included a much larger variety of microbial groups with a significant increase in the representation for certain groups compared to the previous woL (Tree of Life) tree, also called the 10K tree.
Analysis of the Performance of uDance Trees
On the Basis of Resolution
- The 10K, 16K and 200K trees show a clear separation between Archaea and Bacteria, bolstering the robustness of the phylogenetic analyses.
- At the phylum level, the trees provided similar classifications for various microbial groups. However, some differences appeared as short branches across the phylogenetic trees but were not evenly distributed.
- Some bacterial groups in the 10K and 16K trees shifted their positions in the 200K tree, the reason for which may be the inclusion of more genomic data in the 200K tree leading to a better representation of certain bacterial groups.
- An observation in the 200K tree implies that the Terrabacteria group plays a fundamental role in the evolutionary history of non-CPR (Candidate Phyla Radiation).
- Contrary to other recent analyses, the 200K tree placed CPR separately from Terrabacteria.
- Terrabacteria was placed between Terrabactera and a group known as Gracilicutes, which is consistent with findings from other recent studies.
- These findings highlight the need for further investigations into microbial relationships.
On the Basis of Published Taxonomies
- GTDB Taxonomy is a classification system for microbial groups that, when compared with the trees created using uDance, revealed some similarities as well as some differences.
- The uDance trees exhibited varying degrees of consistency with major super-phyla of the NCBI taxonomy as updates were applied. Some groups in uDance trees showed higher consistency with taxonomy over iterations, whereas others remained strongly paraphyletic.
- In spite of the differences, uDance trees demonstrated relatively stable degrees of discordance with NCBI groups and retained similar topological dissimilarity compared to the GTDB tree.
On the Basis of Branch Support
- Significant differences were observed between gene trees and overall species tree, with some genes being more similar to the species tree and others being distantly related.
- Even tree partitions (groups of organisms) also showed diverse levels of discordance with the species tree. So, the ASTRAL method was implemented to assign branch support to the species and partition trees.
- Branch support indicates how confident the algorithm is with regard to the accuracy of a particular branch. It was observed that partitions with more variety in organisms tended to have higher branch support, and the 200K tree had more branches with high support compared to the 16K tree.
Phylogenetic trees help us understand how organisms are evolutionarily related, but considering the vast diversity in organisms, it gets difficult to obtain accurate phylogenetic trees. uDance’s divide and conquer method gives it an edge over existing phylogenomic methods in terms of accuracy and scalability. It easily handles large datasets and even updates the trees gradually as more and more new data is added without beginning from scratch. It is almost fully consistent with traditional taxonomies, except for minor differences that exist due to naming issues. It demonstrates improvement in downstream microbiome analysis and even shows how models of evolution vary across the different parts of the tree, which many existing methods miss.
Article Source: Reference Paper
Neegar is a consulting scientific content writing intern at CBIRT. She's a final-year student pursuing a B.Tech in Biotechnology at Odisha University of Technology and Research. Neegar's enthusiasm is sparked by the dynamic and interdisciplinary aspects of bioinformatics. She possesses a remarkable ability to elucidate intricate concepts using accessible language. Consequently, she aspires to amalgamate her proficiency in bioinformatics with her passion for writing, aiming to convey pioneering breakthroughs and innovations in the field of bioinformatics in a comprehensible manner to a wide audience.