Omics datasets generated by high throughput sequencing contain vast amounts of nucleotide data that must be analyzed to extract meaningful biological insights. A common early step is identifying and extracting specific sequences of interest, like genetic barcodes or variants. Existing tools have limitations for this task, motivating the development of Flexiplex, a new versatile sequence searching and demultiplexing tool.

The Need for Flexible Sequence Searching in Omics Data

High throughput sequencing techniques allow entire genomes, transcriptomes, and more to be read at base pair resolution. For example, a single RNA sequencing experiment can generate billions of short sequence reads. To analyze this flood of data, researchers first must extract the particular reads relevant to their specific project. This could involve:

  • Finding reads from cells expressing certain genes or mutations
  • Demultiplexing reads by sample barcode to attribute them correctly
  • Identifying reads from specific cell types in a mixture

Standard command line tools have drawbacks for sequence searching on large omics datasets:

  • grep finds exact matches only, no tolerance for errors
  • Other tools allow mismatches but are slow on big data
  • Most search for a small number of sequences, not thousands of barcodes
  • Output formats may not work with downstream analysis

Specialized demultiplexing tools exist but have limitations:

  • Designed for specific experiment types like 10x single-cell RNA-seq
  • Often cannot handle highly noisy reads like long-read sequencing
  • Many are complex to set up and require extensive computing resources

There is a need for a flexible sequence search and demultiplexing tool that is fast, lightweight, and customizable to diverse omics applications.

Introducing Flexiplex for Versatile Sequence Analysis

To address the limitations of existing methods, the researchers from The Walter and Eliza Hall Institute of Medical Research developed Flexiplex, an efficient tool for searching reads and demultiplexing barcodes. Flexiplex has several key features:

  • Finds approximate matches allowing substitutions, insertions, and deletions
  • Demultiplexes reads by finding the best barcode match from a large list
  • Split chimeric reads containing multiple barcodes
  • Easy to install and run with minimal dependencies
  • Customizable – flanking sequences, barcode lists, error tolerance, etc
  • Multithreaded for fast processing of large datasets

Flexiplex uses a combination of two algorithms:

  • Edlib for rapid flanking sequence search
  • Custom dynamic programming method for optimal barcode alignment

This enables both speed and sensitivity to errors in the barcode and UMI regions.

Flexiplex can be used in two modes:

  1. Search for user-provided sequences allowing mismatches
  2. Discover novel barcodes directly from the data

Benchmarking Flexiplex’s Performance

To validate Flexiplex’s capabilities, the researchers tested it on real and simulated sequencing datasets, comparing it to leading specialized tools.

Accurate Sequence Search in Low-Error Short Reads

Flexiplex was first benchmarked for searching known sequences in low-error Illumina data. The Chen et al. single-cell mixture dataset has:

  • Fusion gene unique to MCF-7 cells
  • Viral gene unique to HEK293T cells
  • SNP unique to T47D cells

Using 34-54bp segments from these genes, Flexiplex efficiently extracted matching reads:

  • Processed 200 million reads in 24 minutes (1 thread)
  • 10X faster than similar tools like seqkit grep

Allowing 1-2 mismatches boosted sensitivity:

  • Found 97% more reads for MCF-7 fusion vs. grep’s exact matching
  • Cellular barcode analysis showed high precision – almost no false positives

This demonstrates Flexiplex’s power for fast and accurate sequence search even in low error data.

Demultiplexing Noisy Long Reads

Next, the researchers tested demultiplexing cellular barcodes from noisy Oxford Nanopore long reads. On Ebrahimi et al.’s simulated ONT dataset, Flexiplex correctly demultiplexed the most reads across all error rates. To validate this on real data, the researchers used Tian et al.’s scmixology 2 dataset:

  • A pool of 5 cell lines with ONT cDNA reads
  • Matched Illumina data for orthogonal cell line validation

Comparing Illumina SNPs and short-read barcodes, Flexiplex achieved the highest accuracy:

  • 99% concordant cell line assignments between methods
  • Outperformed specialized tools like scTagger and FLAMES

Flexiplex also split chimeric reads effectively, further boosting performance. This shows Flexiplex’s robustness for demultiplexing even highly erroneous reads.

Discovering Cell Barcodes from Scratch

Finally, Flexiplex’s ability to discover novel barcodes without any prior barcode list was tested. The leading tools were compared on Tian et al.’s data and 3 hiPSC ONT datasets from You et al. Flexiplex showed competitive sensitivity and specificity to other tools for recovering true barcodes. Critically, Flexiplex was 4-40X faster than specialized tools like scTagger and BLAZE for barcode discovery.

Conclusions

Flexiplex enables fast and customizable analysis of sequencing reads to extract biological signals from noise. Benchmarks on real datasets demonstrate:

  • High accuracy – finds true matches robustly, even with errors
  • Speed – processes data rapidly leveraging multithreading
  • Low resource – memory efficient compared to alternatives
  • Easy to use – simple install and runtime

Flexiplex balances generality and specialization – adaptable to diverse experiments while still highly performant. It addresses the growing need for efficient sequence search and demultiplexing as omics datasets scale exponentially. With its combination of accuracy, speed, flexibility, and usability, Flexiplex represents an important new addition to the omics analysis toolkit.

Article source: Reference Paper | Flexiplex is available on GitHub

Learn More:

Website | + posts

Dr. Tamanna Anwar is a Scientist and Co-founder of the Centre of Bioinformatics Research and Technology (CBIRT). She is a passionate bioinformatics scientist and a visionary entrepreneur. Dr. Tamanna has worked as a Young Scientist at Jawaharlal Nehru University, New Delhi. She has also worked as a Postdoctoral Fellow at the University of Saskatchewan, Canada. She has several scientific research publications in high-impact research journals. Her latest endeavor is the development of a platform that acts as a one-stop solution for all bioinformatics related information as well as developing a bioinformatics news portal to report cutting-edge bioinformatics breakthroughs.

LEAVE A REPLY

Please enter your comment!
Please enter your name here