Simple Science

Cutting edge science explained simply

# Biology# Genomics

Comparing Genome Builds: GRCh37 vs. GRCh38

Researchers compare GRCh37 and GRCh38 genome builds, revealing key differences in variant detection.

― 6 min read


Genome Build ComparisonGenome Build ComparisonInsightsGRCh38.detection differences in GRCh37 andStudy reveals critical variant
Table of Contents

Back in 2001, scientists finished putting together the first version of the human genome, which is like our genetic instruction manual. Since then, they've found and fixed thousands of mistakes, located areas that vary among individuals, and included a wider range of people in the study. As a result, they've come up with several updated versions or "builds" of this genome, but there's a catch: each of these builds has its own unique way of numbering the sections. Think of it like having different editions of a book, but each edition has its own page numbers.

While these new builds are generally more accurate, getting everyone to adopt them in research and medicine takes time. One big reason for this slow change is that it costs money and time to update the computer systems that work with this data. When researchers want to use new builds, they often need to realign all their sequencing data, which means storing a lot of raw data and running some pretty heavy calculations. To save time and money, scientists have created tools to change or "liftover" the genomic coordinates from one build to another, similar to converting a recipe from metric to imperial units.

However, these handy tools were mainly designed for handling chunks of the genome that are larger than individual mutations. When they are used to shift single variants from one build to another, errors can pop up, and it's not always clear what kind of problems these errors cause, especially for complex changes within our genes.

The Great Variant Showdown: GRCh37 vs. GRCh38

To settle the score, researchers decided to compare two of the most popular genome builds: GRCh37 and GRCh38. They looked at DNA from 50 pairs of tumors and normal tissues, analyzing the data with the same tools and processes. By aligning the sequencing data to both builds, they could see what variants were detected on each one. After that, they converted the variants found in GRCh37 into GRCh38 and compared them.

They looked closely at four types of genetic changes: regular single nucleotide variations, Structural Variants, somatic single nucleotide variants that occur only in tumor tissues, and somatic structural variants.

What They Found: Germline vs. Somatic Variants

When they tallied the results, most of the regular genetic changes found were similar between the two builds, with over 93% overlap. However, they still uncovered around 166,700 specific variations in GRCh37 that didn't show up in GRCh38. For structural variations, the numbers were lower, with about 900 unique changes per individual. Analyzing data aligned to GRCh38 showed that researchers were identifying more of these variants than in GRCh37.

For somatic variants, things got a bit trickier. Only about 82% of single nucleotide variants and 53% of structural variants showed up in both builds, leading to a lot of discrepancies. On average, researchers found over 3,600 unique somatic variants in GRCh37 that couldn’t be matched in GRCh38, while GRCh38 revealed more of these changes overall.

The Mystery of Discordance

To dig deeper, researchers calculated how often the genetic calls from each build disagreed with each other. They examined three different measures of accuracy and found that the disagreements for regular genetic variants were much lower than for somatic ones. For example, only about 3.8% of regular single nucleotide variants showed disagreement, while the disagreement rate for somatic single nucleotide variants soared to nearly 26%.

This supposed discrepancy hints that if researchers stick to GRCh37 for their analysis, they might be missing crucial somatic mutations-a bit like trying to find Waldo in the wrong edition of "Where's Waldo?" The researchers also noted that different types of structural variants had varying levels of disagreement. For instance, deletions and insertions were often in agreement, while duplications led to confusion.

Variability Across the Genome

The researchers also looked at whether the disagreements were randomly spaced out in the genome. They found that some areas were definitely more problematic than others. One section of the genome, in particular, had a lot of variability, with 16,784 genetic changes but also a high rate of disagreement.

Other factors added to the complexity of understanding these results. For instance, discrepancies in somatic single nucleotide variants tended to be linked to lower quality scores but higher GC content. Researchers also noticed that the coverage level, which indicates how many times a particular part of the genome has been sequenced, influenced these disagreements.

False Positives and Validation

Most of the differences detected could likely be explained by errors in the variant-detection processes. By using targeted deep-sequencing, researchers tried to validate their findings. They discovered that variants confirmed to be accurate had a validation rate of over 93%. However, they found that many of the unique variants from each build were more likely to be false positives, with around 34.6% of the GRCh37-specific variants and 51.3% of the GRCh38-specific variants being validated.

Introducing StableLift: A New Tool

In light of all these findings, scientists introduced a new tool called StableLift. This machine-learning approach uses lots of data features to calculate the likelihood that any given variant will appear across different genome builds. The researchers trained StableLift on data from the same 50 tumor-normal pairs and then validated it on other sets of data.

StableLift performed well, especially with regular single nucleotide variants, showing a high level of accuracy. It was able to discard many of the problematic variant calls, making the analyses cleaner and more reliable. The researchers also applied StableLift to structural variants and found similarly impressive results.

Conclusion: A Call for Caution

This study sheds important light on how researchers handle data across different genome builds. While it’s easier to simply use the latest genome build, many still use the older GRCh37, sometimes leading to misleading conclusions.

As the genetic field continues to evolve, moving from linear genome references toward more complex models, managing discrepancies will become even more crucial. With tools like StableLift, researchers can navigate these challenges better, reducing errors and understanding genetic variations in our biological instruction manual.

So, the next time someone mentions the human genome, just remember: it’s a lot like cooking. You need the right recipe, the right ingredients, and sometimes, you need to know which edition of the cookbook you’re using!

Original Source

Title: StableLift: Optimized Germline and Somatic Variant Detection Across Genome Builds

Abstract: Reference genomes are foundational to modern genomics. Our growing understanding of genome structure leads to continual improvements in reference genomes and new genome "builds" with incompatible coordinate systems. We quantified the impact of genome build on germline and somatic variant calling by analyzing tumour-normal whole-genome pairs against the two most widely used human genome builds. The average individual had a build-discordance of 3.8% for germline SNPs, 8.6% for germline SVs, 25.9% for somatic SNVs and 49.6% for somatic SVs. Build-discordant variants are not simply false-positives: 47% were verified by targeted resequencing. Build-discordant variants were associated with specific genomic and technical features in variant- and algorithm-specific patterns. We leveraged these patterns to create StableLift, an algorithm that predicts cross-build stability with AUROCs of 0.934 {+/-} 0.029. These results call for significant caution in cross-build analyses and for use of StableLift as a computationally efficient solution to mitigate inter-build artifacts.

Authors: Nicholas K. Wang, Nicholas Wiltsie, Helena K. Winata, Sorel Fitz-Gibbon, Alfredo E. Gonzalez, Nicole Zeltser, Raag Agrawal, Jieun Oh, Jaron Arbet, Yash Patel, Takafumi N. Yamaguchi, Paul C. Boutros

Last Update: Nov 3, 2024

Language: English

Source URL: https://www.biorxiv.org/content/10.1101/2024.10.31.621401

Source PDF: https://www.biorxiv.org/content/10.1101/2024.10.31.621401.full.pdf

Licence: https://creativecommons.org/licenses/by-nc/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to biorxiv for use of its open access interoperability.

More from authors

Similar Articles