The key to unlocking genomic secrets is here. Researchers from Northwestern University, in collaboration with Stony Brook University, have developed DNABERT-2, which is a better alternative to the previously used foundation models, DNABERT and Nucleotide Transformer. These models worked by breaking genomic sequences into fixed-size k-mers, which takes a toll on computing efficiency. However, DNABERT-2 utilizes Byte Pair Encoding (BPE), a compression algorithm to process genomic sequences, which enhances computing efficiency and overall performance while reducing processing time and memory usage. The effectiveness of DNABERT-2 was tested using a benchmark called Genome Understanding Evaluation (GUE), and the results showed that DNABERT-2 outperforms DNABERT with three times more efficiency on 23 out of 28 datasets.ย
A Brief Background of Foundation Models
Genomic sequences are the storehouses of crucial information on an organism’s genetic makeup, and over the years, foundation models have proven their immense applicability in analyzing and driving insights from those sequences. Numerous downstream tasks, such as promoter prediction, gene expression prediction, DNA methylation prediction, chromatin state analysis, variant effect prediction, etc., have become greatly simplified thanks to these technologies. These models have improved the understanding of the regulation of transcription, the effects of regulatory elements, and also the role of non-coding variants in human diseases and characteristics.ย
Two of the early foundation models used in genomic analysis were DNABERT and Nucleotide Transformer. Their mechanism involved breaking the genomic sequence into fixed-size k-mers (fixed-size permutations of the A, T, C, and G bases of the DNA), and this mechanism is termed tokenization. The tokenization process employed in DNABERT and Nucleotide Transformer is of two types, overlapping k-mer tokenization and non-overlapping k-mer tokenization.
Even though DNABERT was successful in causing advancements in genomic analysis, it had many limitations, such as:
- It was trained only on the human genome. The genomes of the other diverse species were not included in its training data. Hence, it was not efficient for conducting multi-species genomic analysis.
- Employing overlapping k-mer tokenization often led to information leakage, which means that even when some parts of the sequences were hidden to allow prediction by the model, the exposed parts of the sequences gave away too much information preventing the model from learning complex patterns in the data.
- Non-overlapping k-mer tokenization introduced an entirely different problem. Even when similar sequences with slight changes were used, the outputs were drastically different, leading to inconsistencies and inefficient performance.
Similar is the case with nucleotide transformer.
Overcoming the limitations of DNABERT with DNABERT-2
Researcher Zhihan Zhou, along with his team, developed a much-improved version of DNABERT called DNABERT-2. This improved model uses a novel compression algorithm called Byte Pair Encoding (BPE) rather than relying on k-mer tokenization. The key difference between these methods is that the former does not produce tokens of the same length. BPE works by merging frequently occurring pairs of nucleotides and genomic segments to create a variable-length vocabulary of tokens. Such a mechanism offers many advantages. It prevents information leakage, reduces the sequence length, and improves computational efficiency. Additionally, the variability in length introduces complexity and nuances in the learning process of the model, enabling it to learn complex data patterns. Due to this, the model is able to predict both the number and type of nucleotides in a masked region entirely on its own, which according to research (regarding language models), has proven to be beneficial in many scenarios.
The “Rigorous” Training and Mechanism of DNABERT-2 Revealed
DNABERT-2 was trained using masked language modeling (MLM), and two types of datasets were utilized, which were:
- Pre-training Dataset: This dataset included genetic information from the human genome and multiple other species (135 species) as opposed to the dataset of DNABERT, which included the human genome. This pre-training dataset is 12 times larger than that of DNABERT. Care was also taken to remove any unclear or incomplete sequences in the dataset.
- Benchmark Dataset: A benchmark dataset was also created to evaluate the performance of DNABERT-2 in understanding genetic sequences compared to other models. The benchmark included 28 datasets, consisting of sequences varying from 70 to 1000 characters in length, derived from humans, mice, viruses, and yeast.
For the process of tokenization, DNABERT-2 combines a tokenizer called SentencePiece with the BPE algorithm to compress the sequences. The process of tokenization is repeated until a vocabulary consisting of variable-length tokens capable of representing the entire sequence is obtained. To determine the optimum size of the vocabulary, many different options were tested, and it was found that a large vocabulary allows the representation of sequences using fewer tokens, improving computational efficiency. However, a vocabulary that is too large may retard the model’s performance as it will have to deal with many rare tokens.
Apart from algorithmic improvements, several architectural adjustments were made as well to improve the model’s performance, such as:
- Attention with Linear Biases (ALiBi): This technique was implemented to enable the model to handle longer sequences during training and prediction tasks.
- Flash Attention: This technique helped in speeding up the computations.
- Low-Rank Adaptation (LoRA): This technique made the fine-tuning process more efficient.
How DNABERT-2 Fares Against Other Models
DNABERT-2 was evaluated against DNABERT and Nucleotide Transformer using the benchmark GUE. This benchmark included seven different tasks in genomic analysis. The two main aspects considered in this benchmark were how well the models performed on different genetic tasks and how computationally efficient they were.
Despite having more parameters, DNABERT-2 utilizes significantly fewer Floating Point Operations (FLOPs) during computation than the other two models. DNABERT-2 was able to achieve performance comparable to or even better than the two models, all the while utilizing 56 times lesser GPU time. DNABERT-2 outperformed the other models on GUE by six absolute scores in 23 out of the 28 benchmark datasets. DNABERT-2 was further pre-trained on diverse genetic data, and yet again, it demonstrated improved performance with much less extra computational cost.
Out of the three models, DNABERT-2 emerged as the clear winner in the GUE evaluation.
Conclusion
The advent of foundation models has greatly advanced genomic analysis. And this advancement can be further driven with improved foundation models that can counter the drawbacks of the already existing models. DNABERT-2 appears as the much-needed revolution in this matter. Its novel algorithmic approach, as well as its architectural adjustments, make it efficient computation-wise, memory-wise as well as parameter-wise. It has been able to surpass DNABERT and Nucleotide Transformers, which were largely used earlier. Regarding future prospects, DNABERT-2 does show potential for scaling up in model size, paving the way for even more powerful genome language models in the future and exponentially increasing the quality of genomic analysis.
Story Source: Reference Paper | DNABERT-2 code, data, and pre-trained model are publicly available on GitHub
Learn More:
Neegar is a consulting scientific content writing intern at CBIRT. She's a final-year student pursuing a B.Tech in Biotechnology at Odisha University of Technology and Research. Neegar's enthusiasm is sparked by the dynamic and interdisciplinary aspects of bioinformatics. She possesses a remarkable ability to elucidate intricate concepts using accessible language. Consequently, she aspires to amalgamate her proficiency in bioinformatics with her passion for writing, aiming to convey pioneering breakthroughs and innovations in the field of bioinformatics in a comprehensible manner to a wide audience.