Simple Science

Cutting edge science explained simply

# Biology # Bioinformatics

Introducing MANIAC: A New Tool for Viral Genomics

MANIAC improves ANI measurement for viral genome analysis.

Rafal J Mostowy, W. Ndovie, J. Havranek, J. Koszucki, J. Leconte, L. Chindelevitch, E. M. Adriaenssens

― 6 min read


MANIAC Transforms Viral MANIAC Transforms Viral Genome Analysis relatedness among viruses. New tool efficiently calculates genetic
Table of Contents

Average Nucleotide Identity (ANI) is a method used to measure how closely related different microorganisms, such as bacteria and viruses, are to each other. By comparing specific genetic sequences called orthologous genes, scientists can see how many of the nucleotides (the building blocks of DNA) are the same between two organisms. This measurement helps researchers understand the evolutionary distance between species, guide taxonomy (the classification of organisms), and aid in other areas of microbial research.

While ANI is useful for close relatives, its limitations mean it does not always provide accurate evolutionary distances for organisms that are not very closely related. Nevertheless, ANI has become a key tool in various fields of microbial research, playing a significant role in species classification, detecting gene transfer events between organisms, and aiding in metagenomics studies.

The Evolution of ANI Measurement Techniques

Initially, researchers used tools like BLAST for identifying orthologous genes, which involved aligning DNA sequences to determine genetic similarity. However, as next-generation sequencing technologies advanced, the number of microbial genomes available for study grew. As a result, traditional methods became less practical due to the immense amount of computational power they required.

New tools emerged, allowing scientists to perform pairwise calculations of ANI more efficiently. These new approaches can be divided into two main categories: alignment-based and alignment-free methods. Alignment-based methods still rely on searching sequences but have adopted updated tools like MUMmer that are quicker than BLAST, although they can be less sensitive. On the other hand, alignment-free methods utilize short sequences known as k-mers to estimate ANI directly or identify areas for local alignment. These methods are much more efficient and can handle larger datasets, but they may sacrifice some accuracy when dealing with distantly related genomes.

Despite the popularity of ANI in studying bacteria, its use has been less common in viral research. However, in recent years, ANI has started to gain traction in viral genomics for tasks like identifying new viruses, removing bacterial DNA from viral sequences, assigning taxonomy to new viral strains, and examining genetic boundaries between viral populations.

Differences Between Bacterial and Viral Genomes

Currently, most tools for calculating ANI have been optimized for bacterial genomes, working best around a threshold of 95% ANI for species classification. However, viral genomes present unique challenges due to their higher variability in nucleotide sequences, shorter lengths, and lack of shared genes. These differences can make standard methods less effective for viruses.

Some methods specifically designed for analyzing viral genomes exist, but they do not provide a clear metric for the proportion of genetic similarity from aligned genomes. Recently, a new tool called VIRIDIC was proposed, but it relies heavily on BLAST, limiting its scalability for analyzing larger datasets.

This raises the need for a tool that can assess genetic relatedness in viruses while considering the unique characteristics of viral genomes, such as:

  1. Both ANI and alignment fraction (AF) to account for genetic variability.
  2. Ability to measure ANI at lower thresholds, such as 70%.
  3. Scalability to analyze datasets with thousands, or potentially millions, of viral genomes.

Introducing MANIAC for Viral Genomics

To address these challenges, a new approach called MANIAC (MMseqs2-based, ANI Accurate Calculator) was developed. MANIAC is designed to efficiently measure both ANI and AF between pairs of viral genomes. It employs a combination of alignment-free searching and alignment-based techniques, ensuring sensitivity and speed.

The tool operates in three modes:

  1. Genome Mode: Analyzes complete genome sequences.
  2. Coding Sequence (CDS) Mode: Works with nucleotide sequences from predicted genes.
  3. Protein Mode: Focuses on amino acid sequences and calculates Average Amino Acid Identity (AAI).

This versatility allows researchers to choose the most relevant analysis for their needs.

How MANIAC Calculates ANI and AF

In Genome Mode, MANIAC splits genomes into smaller non-overlapping fragments and uses the MMseqs2 search module to identify similar sequences between these fragments and the full genomes. A set of parameters determines how the searching is conducted, including identity thresholds and coverage metrics.

For every pair of genomes analyzed, MANIAC calculates ANI as the average identity of aligned nucleotides. It considers both directions for each genome pair to obtain a single ANI value. Additionally, it computes the AF, which reflects the proportion of the genomes that were aligned during the analysis.

Furthermore, MANIAC's design prioritizes sensitivity and accuracy through careful selection of parameters, optimizing searches to ensure that the results are reliable even when working with large datasets.

The Scalability of MANIAC

MANIAC is built to handle extensive genomic datasets, making it capable of processing millions of genome pairs efficiently. Initial benchmarks indicate that it can accurately estimate ANI and AF at the same level as established gold-standard methods, while also being faster and more adaptable to different types of viral genomes.

The tool balances speed and precision, allowing researchers to conduct large-scale analyses that were previously impractical. This capability is particularly crucial in the rapidly changing field of viral genomics, where new sequences are continuously being discovered.

Testing MANIAC's Performance

To validate its effectiveness, MANIAC's performance was compared to well-known tools like pyani, fastANI, and Mash using a dataset of phage genomes. The results showed that MANIAC had a very high correlation with pyani's estimates of ANI, outperforming other speed-focused alternatives.

The research demonstrated that even when tested against simulated data, MANIAC consistently provided accurate estimates, particularly for viral genomes with ANI below 80%. This indicates that it can be relied upon for both close and more distant genetic comparisons.

Applying MANIAC to Biological Questions

Having established its efficiency and precision, MANIAC was used to explore two key areas in viral research:

  1. Investigating the Existence of ANI Gaps in Phage Populations: The tool was used to analyze a large number of phage genomes to confirm that an ANI gap exists, suggesting significant evolutionary boundaries within viral populations.

  2. Taxonomic Classification of Viral Genera: By examining ANI and AF, researchers aimed to improve the accuracy of classifying newly identified viral genera, facilitating better understanding and categorization of viral diversity.

Observations from ANI Distributions

The analysis of ANI distributions among phage genomes revealed a bimodal pattern, with a distinct ANI gap located between 78% and 85%. This suggests evolutionary discontinuities, similar to findings in bacterial populations but adjusted for the unique dynamics of viral evolution.

Furthermore, the presence of many high ANI but low AF pairs highlights the importance of considering both metrics in taxonomic classification, as genetic mosaicism can complicate straightforward assignments.

Conclusion

MANIAC represents a significant step forward in the field of viral genomics. By offering an efficient means to calculate ANI and AF, it allows researchers to probe deeper into the relationships between viral species. The tool's ability to handle vast datasets while providing precise estimates positions it as an essential resource for future research in virology and microbial genomics.

In summary, MANIAC not only enhances the study of viral genetics but also contributes to the broader understanding of how viral species are classified and related to one another. As ongoing efforts refine viral taxonomy, tools like MANIAC will play a crucial role in establishing clearer boundaries and classifications in the diverse world of viruses.

Original Source

Title: Exploration of the genetic landscape of bacterial dsDNA viruses reveals an ANI gap amidst extensive mosaicism

Abstract: Average Nucleotide Identity (ANI) is a widely used metric to estimate genetic relatedness, especially in microbial species delineation. While ANI calculation has been well optimised for bacteria and closely related viral genomes, accurate estimation of ANI below 80%, particularly in large reference datasets, has been challenging due to a lack of accurate and scalable methods. To bridge this gap, here we introduce MANIAC, an efficient computational pipeline optimised for estimating ANI and alignment fraction (AF) in viral genomes with divergence around ANI of 70%. Using a rigorous simulation framework, we demonstrate MANIACs accuracy and scalability compared to existing approaches, even to datasets of hundreds-of-thousands of viral genomes. Applying MANIAC to a curated dataset of complete bacterial dsDNA viruses revealed a multimodal ANI distribution, with a distinct gap around 80%, akin to the bacterial ANI gap ([~]90%) but shifted, likely due to viral-specific evolutionary processes such as recombination dynamics and mosaicism. We then evaluated ANI and AF as predictors of genus-level taxonomy using a logistic regression model. We found that this model has strong predictive power (PR-AUC=0.981), but that it works much better for virulent (PR-AUC=0.997) than temperate (PR-AUC=0.847) bacterial viruses. This highlights the complexity of taxonomic classification in temperate phages, known for their extensive mosaicism, and cautions against over-reliance on ANI in such cases. MANIAC can be accessed under https://github.com/bioinf-mcb/MANIAC. ImportanceWe introduce a novel computational pipeline called MANIAC, designed to accurately assess Average Nucleotide Identity (ANI) and alignment fraction (AF) between diverse viral genomes, scalable to datasets of over 100k genomes. Through the use of computer simulations and real data analyses, we show that MANIAC could ac- curately estimate genetic relatedness between pairs of viral genomes around 60-70% ANI. We applied MANIAC to investigate the question of ANI discontinuity in bacterial dsDNA viruses, finding evidence for an ANI gap, akin to the one seen in bacteria but around ANI of 80%. We then assessed the ability of ANI and AF to predict taxonomic genus boundaries, finding its strong predictive power in virulent, but not in temperate phages. Our results suggest that bacterial dsDNA viruses may exhibit an ANI threshold (on average around 80%) above which recombination helps maintain population cohesiveness, as previously argued in bacteria.

Authors: Rafal J Mostowy, W. Ndovie, J. Havranek, J. Koszucki, J. Leconte, L. Chindelevitch, E. M. Adriaenssens

Last Update: 2024-12-07 00:00:00

Language: English

Source URL: https://www.biorxiv.org/content/10.1101/2024.04.23.590796

Source PDF: https://www.biorxiv.org/content/10.1101/2024.04.23.590796.full.pdf

Licence: https://creativecommons.org/licenses/by-nc/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to biorxiv for use of its open access interoperability.

Similar Articles