Simple Science

Cutting edge science explained simply

# Biology # Bioinformatics

Revolutionizing Genome Size Estimation with LRGE

New tool LRGE improves accuracy in genome size estimation using long-read sequencing.

Michael B Hall, Lachlan J M Coin

― 5 min read


LRGE: Smart Genome LRGE: Smart Genome Estimation genome size estimates. New software provides fast, accurate
Table of Contents

Genome size is a crucial aspect of genetics, playing a key role in areas like genome assembly and the study of evolution. This topic becomes particularly tricky when it comes to organisms that are not commonly studied in labs, as well as when working with diverse or repetitive genetic data. Assessing genome size can be especially difficult with recent advancements in Sequencing technology that produce long Reads.

The Challenge of Accurate Estimation

Current genome size estimation methods often concentrate on short-read data, which presents its own set of challenges. These methods typically demand substantial computing power or depend on already assembled genomes, which limits their effectiveness with the latest long-read sequencing technologies from companies like Pacific Biosciences and Oxford Nanopore Technologies.

As technology progresses, generating high-quality bacterial genome assemblies is becoming easier. With increasing amounts of data being produced, automated systems for tasks like identifying genetic variants and assembling genomes are now common in the field. However, many of these systems still require users to provide estimates of genome size, or they may attempt to calculate these sizes automatically. Unfortunately, existing tools for size estimation usually focus on short-read data and don’t handle the higher error rates that come with long reads very well. This can lead to many inaccurate results.

A New Method for Genome Size Estimation

Here enters a fresh method that utilizes long-read overlap data to provide accurate genome size estimates without relying on already assembled references or k-mers, which are short sequences used for these calculations. This new technique focuses on overlaps between reads to identify patterns across the entire genome, which makes it a strong alternative to older approaches.

The method involves analyzing how individual reads overlap with each other. By looking at the expected number of overlaps between a set of query reads and a set of target reads, it calculates an estimate for genome size. The average of these estimates is then taken to create a final genome size estimate, which can be more reliable since it minimizes the impact of any outliers, such as reads that don’t overlap at all.

The Software Behind the Method

The software implementing this new estimation technique is called LRGE and is built using the Rust programming language. It leverages a tool called minimap2 to generate the overlaps. The software offers two strategies for size estimation: the “Two-set” strategy, where the query and target reads are different, and the “all-vs-all” strategy, where both sets of reads are identical.

The Two-set strategy has the advantage of using a smaller query set, which allows for quicker Estimations, while the all-vs-all strategy ignores overlaps of reads with themselves. The software has been tested against various other methods like GenomeScope2, Mash, and Raven to compare their efficacy.

Testing the New Approach

A large-scale evaluation using thousands of bacterial long-read sequencing runs helped to confirm the effectiveness of LRGE against existing methods. The evaluations included reads from both Oxford Nanopore and Pacific Biosciences, with known high-quality assemblies serving as benchmarks for comparison.

Additionally, while LRGE was initially focused on bacteria, the method was also tested on multicellular organisms, including yeast and fruit flies, to see how well it handles larger and more complex genomes.

Accuracy and Performance

When looking at the results, it became clear that both strategies provided similar estimates, and LRGE generally outperformed other tools in terms of accuracy, especially with ONT data. However, it was noted that Raven, a genome assembly tool, performed exceptionally well with PacBio data.

Interestingly, LRGE showed a tendency to underestimate Genome Sizes when there were dramatic differences in read depths across the genetic material being analyzed. For instance, when encountering gene regions with hundreds of thousands of reads, the estimates could skew much lower than the true size. Conversely, low-quality reads sometimes led to much larger estimates due to fewer overlaps being detected.

Providing a Confidence Range

Each estimate generated by LRGE comes with a range of confidence, indicating where the actual genome size is likely to fall. By analyzing percentile ranges, researchers found that they could be quite confident (over 90%) about the estimated size lying within a specific range.

Efficiency in Runtime and Resource Use

The computational resources used by LRGE also showed promising results, as it operated relatively quickly and required less memory compared to other estimation methods. While there were some outlier cases where the runtime spiked, especially when faced with challenging data, overall, LRGE proved to be a more efficient choice.

Overall Implications

In conclusion, LRGE stands out as a reliable and efficient way to estimate genome size tailored to the new long-read sequencing techniques. By focusing on read overlap data, it successfully avoids the limitations of earlier k-mer-based methods and performs well across diverse datasets, including those from both bacteria and more complex eukaryotic organisms.

The advantages of LRGE extend beyond just accurate estimation; it also demands fewer computational resources than other existing tools and performs comparably to assembly-based methods while being much quicker. This flexibility and efficiency make LRGE a valuable asset in the field of bioinformatics, aiding various applications ranging from genome assembly to evolutionary research.

In the world of genetics, where size sometimes matters, having a tool that can give reliable estimates without breaking the bank on computing power is undoubtedly a win. With LRGE, scientists can feel confident in their genome size estimations, helping to pave the way toward a clearer understanding of genetic material and its implications. Who knew genome size estimation could be so exciting?

Original Source

Title: Genome size estimation from long read overlaps

Abstract: SummaryAccurate genome size estimation is an important component of genomic analyses, though existing tools are primarily optimised for short-read data. We present LRGE, a novel tool that uses read-to-read overlap information to estimate genome size in a reference-free manner. LRGE calculates per-read genome size estimates by analysing the expected number of overlaps for each read, considering read lengths and a minimum overlap threshold. The final size is taken as the median of these estimates, ensuring robustness to outliers such as reads with no overlaps. Additionally, LRGE provides an expected confidence range for the estimate. LRGE outperforms k-mer-based methods in both accuracy and computational efficiency and produces genome size estimates comparable to those from assembly-based approaches, like Raven, while using significantly less computational resources. We validate LRGE on a large, diverse bacterial dataset and confirm it generalises to eukaryotic datasets. Availability and implementationOur method, LRGE (Long Read-based Genome size Estimation from overlaps), is implemented in Rust and is available as a precompiled binary for most architectures, a Bioconda package, a prebuilt container image, and a crates.io package as a binary (lrge) or library (liblrge). The source code is available at https://github.com/mbhall88/lrge under an MIT license.

Authors: Michael B Hall, Lachlan J M Coin

Last Update: 2024-12-02 00:00:00

Language: English

Source URL: https://www.biorxiv.org/content/10.1101/2024.11.27.625777

Source PDF: https://www.biorxiv.org/content/10.1101/2024.11.27.625777.full.pdf

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to biorxiv for use of its open access interoperability.

Similar Articles