Advancements in Genome Assembly with RAFT Tool
RAFT improves genome assembly by reducing gaps in sequences.
― 5 min read
Table of Contents
Building accurate models of human genomes is a major task in genetics. Scientists face hurdles when trying to create complete genome sequences, especially when trying to get the full picture of two versions of a genome from each person. Recent work has tried to create these complete sequences, called telomere-to-telomere (T2T) assemblies, using advanced sequencing techniques. The challenge is to produce high-quality genomes that clearly show variations between the two versions.
Sequencing Technologies
Modern sequencing technologies, like those from Pacific Biosciences and Oxford Nanopore, help scientists collect long pieces of DNA code, which are crucial for creating these accurate genome models. These techniques provide DNA segments that are longer than those from older methods, making it easier to piece together the whole genome. The longer these pieces are, the better the chance of creating a full picture without missing important details.
Genome Assembly Process
The process of assembling a genome from these reads involves several steps. First, scientists find overlaps between the different DNA pieces. Next, they correct any errors in the reads. After that, they build a graph that links these reads based on where they match up. Finally, they identify paths through this graph to recreate the genome sequence.
However, when simplifying the graph, there can be complications. Some reads might entirely fit within others, leading to their removal. This can inadvertently cut important connections that are needed to form a complete and accurate representation of the genome. Consequently, scientists have identified this process as a significant issue in genome assembly.
Assembly Gaps
When reads are removed, gaps can appear in the assembly, which scientists refer to as assembly gaps. These gaps often occur in areas where the genetic variation between the two versions of a genome is low. So, when one version is covered by a longer read, the reads that belong to the other version might get dropped. This can create gaps in the final sequence, which are problematic for accurate assembly.
Previous Solutions
Researchers have proposed various methods to tackle the issue of assembly gaps. Some algorithms make certain assumptions about the length of the reads or the amount of coverage provided by the sequencing process. These approaches, however, do not always hold true in real-world sequencing, especially for complex genomes that have high repetition.
Some of the tools created to recover these important reads work in simple cases but fail in more complicated scenarios. Others rely on extremely long reads to rescue data but may not always be available.
Calculating Assembly Gaps
Understanding how often assembly gaps occur can help researchers make better choices about sequencing strategies. By analyzing different sequencing setups, scientists can estimate how likely it is for gaps to appear in their data. This knowledge can guide decisions about which sequencing methods to use for particular genomes.
One method developed for this purpose works by simulating the sequencing process and analyzing the output. It can help predict where assembly gaps are most likely to occur and identify factors that contribute to these gaps.
Introducing RAFT
To further minimize assembly gaps, a new tool called RAFT was developed. This tool shortens long DNA reads into pieces of equal length, creating a more uniform read-length distribution. By doing so, RAFT aims to prevent the removal of important reads that previously led to assembly gaps.
RAFT evaluates the alignment of reads and discards only those areas of reads that are highly repetitive. The goal is to keep the reads that help stitch together complex regions of the genome while simplifying the overall read-length distribution.
RAFT Process
In the RAFT workflow, scientists start with long, error-checked reads and alignment information. The process involves identifying portions of reads that can be fragmented while retaining those that cover complex or repetitive areas. This dual approach ensures that reads that could help bridge gaps in the genome remain intact, while others are cut down to size.
After RAFT processes the reads, they are then passed on to a genome assembly tool to create the final genome representation. This updated workflow has shown to be effective in reducing assembly gaps and improving overall genome quality.
Testing RAFT's Effectiveness
To evaluate how well RAFT performs, researchers conducted experiments using both simulated and real datasets. They measured the number of assembly gaps remaining after processing with the RAFT tool compared to traditional methods. In simulations, RAFT significantly reduced the number of gaps. When tested on real datasets, RAFT also showed improvements in the continuity of the assembled genome.
Results of the Evaluation
The results of the evaluation indicated that using RAFT in combination with existing genome assembly tools leads to a better assembly that minimizes gaps. When comparing datasets generated through standard methods to those processed with RAFT, researchers found that the new method produced assemblies with longer contiguous segments and fewer interruptions.
RAFT's runtime efficiency is also an area of note. Although it requires extra processing time compared to basic assembly methods, the benefits in terms of assembly quality make it a worthy addition to genome sequencing workflows.
Conclusion
The assembly of genomes from sequencing data presents a complex challenge, especially when variations between two haplotype sequences need to be resolved. The introduction of RAFT provides a practical solution to the problem of assembly gaps caused by contained read deletions. By creating uniform-length reads and retaining important segments, RAFT enhances the overall quality of genome assembly.
Moving forward, continuous advancements in sequencing technologies and assembly methods will likely contribute to even more accurate models of genetic information. Tools like CGProb and RAFT are steps in the right direction that help scientists address current limitations in genome assembly, leading to more robust and continuous genomes.
Title: Telomere-to-telomere assembly by preserving contained reads
Abstract: Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the overlap-based algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. However, this procedure is not guaranteed to be safe. In practice, it occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform readlength distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.
Authors: Chirag Jain, S. S. Kamath, M. Bindra, D. Pal
Last Update: 2024-03-12 00:00:00
Language: English
Source URL: https://www.biorxiv.org/content/10.1101/2023.11.07.565066
Source PDF: https://www.biorxiv.org/content/10.1101/2023.11.07.565066.full.pdf
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to biorxiv for use of its open access interoperability.