A team of international researchers led by the University of California San Diego’s Department of Computer Science and Engineering has demonstrated that the La Jolla Assembler (LJA), a new genome assembly algorithm, vastly improves large genome reconstruction, the process of arranging DNA snippets into complete genomes, which is a crucial part of genomic sequencing.

Furthermore, LJA lowers error rates and improves the capacity to scale the entire human genome assembly. This will make massive population studies easier to undertake, in which thousands or millions of people’s genomes are sequenced and compared in order to better understand the genetic variables that lead to disease. The findings were published in the journal Nature Biotechnology this week.

Image Description: UC San Diego computer scientist Pavel Pevzner’s team adopted a computational approach called de Bruijn graphs, which helped them assemble millions of “reads” into complete genomes. This technique models a genome as a complex road network that connects various cities (short genomic fragments) and finds ways to traverse the network while using each road.
Image Source: https://ucsdnews.ucsd.edu/pressrelease/algorithm-scales-ability-to-assemble-complete-genomes

“We used LJA to completely reconstruct almost half of the chromosomes in the human genome in a completely automatic fashion,” says Pavel Pevzner, the Ronald R. Taylor Distinguished Professor of Computer Science and senior author on the paper. The La Jolla Assembler (LJA) is a fast technique that uses the Bloom filter, sparse de Bruijn graphs, and disjointig generation to enable automated assembly of lengthy, HiFi reads.

When compared to previous assembly methods that used extended, high-fidelity (HiFi) readings, this resulted in a five-fold reduction in assembly mistakes. The precision of this method will be useful in large population investigations of complicated and little researched areas of the human genome, such as centromeres or antibody-generating sites.

Genome assemblers are computer programs that reassemble genomes from a set of shorter sequences (reads). Short read methods, which generate reads of up to 300 nucleotides, were virtually solely used by researchers for many years. These produced critical genomic data, but they also left gaps in genomic sequences, many of which were in biomedically significant areas. As a result, the Human Genome Project, which was completed two decades ago, left thousands of unassembled sections – unknown DNA with clinical and scientific implications.

“This incomplete human genome assembly produced a revolution in biology and medicine 20 years ago,” says Anton Bankevich, a postdoctoral researcher in the Department of Computer Science and Engineering and first author on the paper. “However, the missing pieces of the genome may hold many more secrets.”

Long, HiFi reads (greater than 10,000 nucleotides) have lately become popular among scientists, allowing them to sequence whole human and animal genomes. The Telomere-to-Telomere (T2T) group produced the first entire human genome last year, which was a significant milestone. This effort, however, necessitated a great deal of human labor and would be nearly hard to expand to hundreds, let alone millions, of genomes.

Pevzner’s team used de Bruijn graphs, a computer tool that helped them assemble millions of reads into whole genomes, to automate the process and boost speed and accuracy. This method, which represents a genome as a complicated road network linking numerous towns (short genomic segments) and discovers ways to traverse the network while utilizing each road, was devised by Dutch mathematician Nicolaas de Bruijn and has since become a sequencing workhorse. In some ways, history was repeating itself. Pevzner and others employed de Bruijn graphs to make sense of brief readings more than 20 years ago.

“Although it looks like simply applying this 20-year old technique to HiFi reads would lead to excellent human genome assemblies, all previously developed algorithmic ideas fall apart when faced with constructing the enormously complex de Bruijn graph of the human genome,” said Andrey Bzikadze, a co-author on the paper and a graduate student in the Bioinformatics and Systems Biology Program at UC San Diego. “Reusing old methods would require a prohibitive amount of computer memory, making them impossible to implement.”

This problem is solved by LJA, which reduces data footprint and assembly mistakes. It paves the way for faster and more accurate large-scale population studies, in which scientists will need to assemble millions of genomes to find the gene sequences that cause disease or bestow good health.

Assembling a single genome isn’t enough to boost biological progress. Scientists can learn about the functioning of different genomes and their links to illnesses by comparing them. As a result, we need to scale genome assembly efforts and develop algorithms that yield genome assembly of the same quality as the T2T human genome but can be done automatically.

Story Source: Bankevich, A., Bzikadze, A.V., Kolmogorov, M. et al. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-022-01220-6



Please enter your comment!
Please enter your name here