Revisiting SARS-CoV-2 Data with Viridian Tool
Viridian improves sequencing accuracy for tracking COVID-19 variants.
Zamin Iqbal, M. Hunt, A. S. Hinrichs, D. Anderson, L. Karim, B. L. Dearlove, J. Knaggs, B. Constantinides, P. W. Fowler, G. Rodger, T. L. Street, S. F. Lumley, H. Webster, T. Sanderson, C. Ruis, B. Kotzen, N. De Maio, L. N. Amenga-Etego, D. S. Amuzu, M. Avaro, G. A. Awandare, R. Ayivor-Djanie, T. Barkham, M. Bashton, E. M. Batty, Y. Bediako, D. De Belder, E. Benedetti, A. Bergthaler, S. A. Boers, J. Campos, R. A. A. Carr, Y. Y. C. Chen, F. Cuba, M. E. Dattero, W. Dejnirattisai, A. T. Dilthey, K. O. Duedu, L. Endler, I. Engelmann, N. M. Francisco, J. Fuchs, Gnimpieba
― 6 min read
Table of Contents
- The Challenge of Data Collection
- Understanding the Errors
- The Need for Reprocessing Data
- Introducing the Viridian Tool
- Evaluating the Performance of Viridian
- Analyzing Data from the Early Omicron Wave
- The Global Sequencing Effort
- Building a High-Quality Phylogenetic Tree
- The Impact of High-Quality Data
- Conclusion
- Original Source
- Reference Links
In late 2019, the world faced a new virus called SARS-CoV-2, which caused the COVID-19 pandemic. Scientists quickly realized that they needed to track the virus's changes over time to manage its spread and develop effective vaccines. One key way to do this is through genetic sequencing, which allows researchers to study the virus's genes and understand how it evolves. However, the process of analyzing these Genetic Sequences was challenging, especially when the pandemic started, and the amount of data increased rapidly.
The Challenge of Data Collection
Before the pandemic, scientists typically worked with small sets of genetic data, often fewer than 5,000 samples. The data they used was usually well-organized and collected from known sources, such as hospitals or public health organizations. In 2020, this changed dramatically. The pandemic created a massive demand for quick data collection and analysis, pushing scientists and bioinformaticians to their limits.
Many of the systems and tools used for data analysis were not ready for the sudden influx of samples. Researchers had to adapt quickly, often prioritizing speed over accuracy. This led to several problems, including errors in the genetic sequences that would be used for future studies and vaccine development.
Understanding the Errors
As the virus spread, it mutated. Researchers often used a method called "Amplicon Sequencing," where the virus's genome is divided into smaller pieces called tiles. These tiles are then amplified and sequenced. However, as the virus evolved, some parts of its genome underwent changes that made it challenging to obtain accurate sequences. For example, changes in primer-binding sites could result in missing data, known as dropouts.
Many software tools for sequencing made incorrect assumptions, treating missing data as identical to a reference genome. This caused researchers to see samples reverting to ancestral states, which was misleading. These systematic errors had real consequences for scientists trying to track the virus's evolution.
The Need for Reprocessing Data
Given these challenges, it became essential to revisit the genetic data collected during the pandemic. The goal was to identify and correct the errors that had crept in due to the hurried nature of the previous analysis. By reassembling the data with a consistent workflow, researchers could produce a high-quality dataset that would better serve future studies.
Introducing the Viridian Tool
To address the issues with existing data, a new tool called Viridian was developed. This tool was designed specifically for processing amplicon sequencing data from various technologies, including Illumina and Oxford Nanopore. One of the key features of Viridian is its ability to identify the amplicon scheme used in the data automatically.
Viridian works in several stages. First, it checks the data to determine which primers were used during the sequencing. Then, it samples the reads for each amplicon to ensure that a sufficient depth of data is collected, which helps improve the accuracy of the final sequence.
Once the data is sampled, the tool generates consensus sequences, which represent the best estimate of the virus's genome. It uses an iterative approach to refine these sequences, making adjustments based on the data it receives until a consistent sequence is produced.
Evaluating the Performance of Viridian
To ensure that Viridian performed better than previous methods, it was tested against existing workflows. Researchers conducted three evaluations using both simulated and real data, including a comprehensive set of samples collected from various countries in Africa during the early Omicron variant outbreak.
The initial tests showed that Viridian successfully identified primer schemes with high accuracy. Furthermore, when compared to other popular assembly tools, it produced fewer errors in the final sequences. This result was particularly important as it indicated that Viridian could be a more reliable option for researchers working with SARS-CoV-2 data.
Analyzing Data from the Early Omicron Wave
For the evaluation, researchers analyzed over 12,000 samples that included various variants of SARS-CoV-2, including Alpha, Beta, and Delta, while also capturing the emergence of the Omicron variant. These samples were processed using both Viridian and traditional methods to gauge the improvements.
The results were promising. Many systematic errors that had been present in the traditional analyses were absent in the Viridian assemblies. In essence, Viridian improved the accuracy of the sequences, which is crucial for understanding the virus's behavior and for making informed public health decisions.
The Global Sequencing Effort
As of early 2023, there were millions of SARS-CoV-2 raw sequence datasets available. However, many of these datasets lacked consistent information on the primer schemes and assembly techniques used. To tackle this, the team behind Viridian set out to process all publicly available sequencing runs, generating new consensus genomes that would serve as a valuable resource for the scientific community.
The aim was to create a comprehensive global phylogeny that minimized the need for error masking in the data. By using Viridian to assemble the sequences, the researchers hoped to provide a cleaner and more reliable dataset for further studies.
Building a High-Quality Phylogenetic Tree
One of the biggest achievements of this project was the construction of a high-quality phylogenetic tree based on the reprocessed sequences. Phylogenetic Trees help visualize the relationships between different viral strains and track how they evolve over time. A clear and accurate tree is vital for understanding the dynamics of the virus and the effectiveness of interventions such as vaccines.
The first step in building the tree involved processing all the relevant SARS-CoV-2 datasets through Viridian. The results were then compared with existing datasets to assess the improvements in quality. Researchers found that the trees built from Viridian sequences had significantly fewer problematic areas that would typically need to be masked.
The Impact of High-Quality Data
The high-quality data generated from the Viridian assemblies had several implications for the scientific community. With fewer artefacts and systematic errors, researchers could conduct more accurate analyses of the virus's mutations and transmission patterns.
Additionally, the improved data quality led to better estimates of the number of unique SARS-CoV-2 introductions in different countries, reducing the occurrence of false positives that could skew public health decisions. Accurate data means better responses to outbreaks and more targeted public health strategies.
Conclusion
The COVID-19 pandemic brought forth unprecedented challenges in genomic surveillance and data collection. The rapid rise in SARS-CoV-2 infections highlighted the need for efficient data processing tools and robust error correction strategies. With the development of Viridian, researchers can now reprocess vast amounts of data to produce higher-quality sequences and phylogenetic trees.
By continuously improving the accuracy of genomic data, scientists hope to enhance their understanding of how the virus evolves and spreads. The goal is to ensure that the lessons learned from this pandemic inform future responses to emerging infectious diseases. In essence, building reliable datasets and maintaining rigorous quality control will be crucial for tackling public health challenges in the years to come.
Title: Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny
Abstract: The SARS-CoV-2 genome occupies a unique place in infection biology - it is the most highly sequenced genome on earth (making up over 20% of public sequencing datasets) with fine scale information on sampling date and geography, and has been subject to unprecedented intense analysis. As a result, these phylogenetic data are an incredibly valuable resource for science and public health. However, the vast majority of the data was sequenced by tiling amplicons across the full genome, with amplicon schemes that changed over the pandemic as mutations in the viral genome interacted with primer binding sites. In combination with the disparate set of genome assembly workflows and lack of consistent quality control (QC) processes, the current genomes have many systematic errors that have evolved with the virus and amplicon schemes. These errors have significant impacts on the phylogeny, and therefore over the last few years, many thousands of hours of researchers time has been spent in "eyeballing" trees, looking for artefacts, and then patching the tree. Given the huge value of this dataset, we therefore set out to reprocess the complete set of public raw sequence data in a rigorous amplicon-aware manner, and build a cleaner phylogeny. Here we provide a global tree of 4,471,579 samples, built from a consistently assembled set of high quality consensus sequences from all available public data as of June 2024, viewable at https://viridian.taxonium.org. Each genome was constructed using a novel assembly tool called Viridian (https://github.com/iqbal-lab-org/viridian), developed specifically to process amplicon sequence data, eliminating artefactual errors and mask the genome at low quality positions. We provide simulation and empirical validation of the methodology, and quantify the improvement in the phylogeny. We hope the tree, consensus sequences and Viridian will be a valuable resource for researchers.
Authors: Zamin Iqbal, M. Hunt, A. S. Hinrichs, D. Anderson, L. Karim, B. L. Dearlove, J. Knaggs, B. Constantinides, P. W. Fowler, G. Rodger, T. L. Street, S. F. Lumley, H. Webster, T. Sanderson, C. Ruis, B. Kotzen, N. De Maio, L. N. Amenga-Etego, D. S. Amuzu, M. Avaro, G. A. Awandare, R. Ayivor-Djanie, T. Barkham, M. Bashton, E. M. Batty, Y. Bediako, D. De Belder, E. Benedetti, A. Bergthaler, S. A. Boers, J. Campos, R. A. A. Carr, Y. Y. C. Chen, F. Cuba, M. E. Dattero, W. Dejnirattisai, A. T. Dilthey, K. O. Duedu, L. Endler, I. Engelmann, N. M. Francisco, J. Fuchs, Gnimpieba
Last Update: Nov 5, 2024
Language: English
Source URL: https://www.biorxiv.org/content/10.1101/2024.04.29.591666
Source PDF: https://www.biorxiv.org/content/10.1101/2024.04.29.591666.full.pdf
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to biorxiv for use of its open access interoperability.
Reference Links
- https://github.com/nebiolabs/VarSkip
- https://github.com/neherlab/hivwholeseq
- https://github.com/pysam-developers/pysam
- https://github.com/iqbal-lab-org/cylon
- https://github.c
- https://github.com/epi2me-labs/wf-artic
- https://github.com/iqbal-lab-org/covid-truth-eval
- https://github.com/iqbal-lab-org/covid-truth-datasets
- https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_id=2697049&fields=all&limit=10000000
- https://github.com/ncbi/sra-tools
- https://github.com/enasequence/enaBrowserTools
- https://github.com/ncbi/datasets
- https://www.ncbi.nlm.nih.gov/books/NBK179288/
- https://github.com/marting
- https://github.com/connor-lab/ncov2019-artic-nf
- https://github.com/joblib/joblib
- https://github.com/tqdm/tqdm