Introducing deepKin: A New Method for Measuring Genetic Relatedness
deepKin improves how we assess genetic relationships using SNP data.
― 6 min read
Table of Contents
- Methods for Measuring Genetic Relationships
- Introducing DeepKin: A New Approach
- Understanding the Methods of DeepKin
- Inferencing Relatedness with DeepKin
- Guidelines for Using DeepKin
- The Importance of Effective Number of Markers
- Validating the Variance
- Real-World Applications: UK Biobank
- Key Findings and Conclusions
- Original Source
- Reference Links
Understanding how individuals are related to each other is very important in genetics and public health studies. Specifically, this is crucial when researchers look at many Genetic Markers across the whole genome, a process known as genome-wide association studies (GWAS). Researchers also measure risk for certain traits or diseases using a tool called the Polygenic Risk Score (PRS). Traditionally, scientists would look at family trees to estimate how closely related people are. This method gives a good idea of expected genetic similarities. However, with the rise of genetic data from genome-wide single nucleotide polymorphisms (SNPS), researchers can now calculate real genetic relationships based on actual data.
This shift to using SNP data face some challenges. Different methods of measuring SNPs, along with how data is checked for quality, can add confusion. Therefore, figuring out the relationships that come from SNP data can be complicated.
Methods for Measuring Genetic Relationships
There are different ways to estimate how closely related people are using SNP data. Some methods use maximum likelihood approaches, while others use moments-based estimators. Although moments-based estimators may not be as precise, they are faster and easier to compute. Over the years, some factors have been studied that affect how we measure Relatedness. One study looked into how relationships can vary due to random genetic sampling and genetic linkage.
Currently, many researchers use SNP-based measures in population studies, but there hasn't been as much focus on how much these measures vary. The differences in SNP data due to relationships can significantly impact the power to detect pairs that are closely related compared to those that are not.
Static cut-off numbers are often used to decide if two samples are related. This can lead to mistakes, like false positives, when the variation in estimates is ignored. If researchers only rely on fixed cut-offs without considering how the data behaves, they might incorrectly label pairs as related.
Introducing DeepKin: A New Approach
The new method, called deepKin, offers a fresh way to measure relatedness using SNP data. This tool is different from earlier methods because it provides information about the sampling variation that comes with calculating relatedness. By using this new approach, deepKin can help researchers understand whether differences in relatedness are significant.
DeepKin focuses on three key concepts in estimating relatedness:
- It sets a critical value to divide significant relatedness from insignificant ones.
- It identifies the minimum number of genetic markers needed to spot a specific type of relative.
- It shows how much statistical power can be adjusted based on the degree of relatedness being tested.
The team behind deepKin tested it through simulations and real data, showing its effectiveness. They also made deepKin available to researchers as an R package.
Understanding the Methods of DeepKin
A core aim of this study is to define the level of variation for moments-based genetic relatedness. DeepKin uses an approach similar to that of the original KING method, but with different scaling factors. Researchers can create matrices to describe genetic relationships based on genotypic values.
The KING estimator computes relatedness using specific formulas, but its estimates only represent half of the actual relatedness expected. To clarify comparisons, researchers will often double the KING estimates.
However, measuring actual genetic similarity can yield values anywhere from 0 to 1. This means there are many factors that could influence the results, and understanding the sampling variance is crucial for the estimation.
Inferencing Relatedness with DeepKin
DeepKin provides a method for researchers to test if pairs of individuals are related. By examining relationships through a statistical lens, DeepKin can calculate z-scores and corresponding p-values based on earlier empirical distributions. If researchers set a level of significance, deepKin can define a critical value for drawing conclusions about relatedness.
While relatedness scores can range continuously, it can be useful to group them into categories for easier analysis. DeepKin allows the assessment of an observed relationship against predefined degrees of relatedness using statistical tests.
The method involves two primary parameters: sample size and effective number of markers. Ultimately, deepKin aims to improve how genetic relationships are inferred by providing guidelines that help researchers make informed decisions.
Guidelines for Using DeepKin
Researchers can follow a couple of key guidelines when using deepKin:
Choose Markers Wisely: They can pinpoint the minimum effective number of markers required to detect specific relationships. By focusing only on the necessary variants, researchers can save time and reduce costs.
Understand Statistical Power: Once the significance level is set, the researchers can determine how much power could be improved or compromised based on the number of markers available. Essentially, increasing effective markers can boost the chances of identifying important relationships.
The Importance of Effective Number of Markers
The effective number of markers, often referred to as "me", is significant in estimating relatedness through deepKin. It describes the average genetic correlation between different variants. Researchers can compute this number, but doing so directly can be costly in terms of computing power.
To address this issue, two estimators are proposed. The first is a GRM-based estimator, which looks at off-diagonal elements of the genetic relationship matrix. The second is a randomization-based estimator, which improves efficiency by iterating through a set number of trials.
In simulations, researchers validate deepKin's effectiveness using both estimators to demonstrate statistical precision.
Validating the Variance
The methodical validation of the deepKin's approach involves focusing on both single and multiple locus models. Researchers tested how well the expected results align with observed data under various scenarios to confirm the robustness of their findings.
Simulations demonstrate that the deepKin method effectively captures true relationships, ensuring reliability across different degrees of relatedness.
Real-World Applications: UK Biobank
In a practical application, researchers applied deepKin to a large dataset from the UK Biobank, which included information from over 3,000 participants. They examined multiple SNP sets with different characteristics to understand the impact of different genetic markers.
By doing this, researchers could observe how deepKin performed in classification tasks, finding correlations between varying degrees of relatedness. It was confirmed that as effective markers increased, deepKin became more reliable in classifying relationships.
Furthermore, deepKin explained the relationships within the UK Biobank dataset, highlighting related individuals and their connections based on geographical locations. This added depth to the understanding of how population structure can influence genetic relationships.
Key Findings and Conclusions
The differences between deepKin and earlier methods, such as KING, lie in deepKin's ability to account for missing elements like sampling variance and thereby enhance statistical inference. A thorough understanding of the sampling variance ties directly to the effectiveness of relatedness inference.
Moreover, the effective number of markers plays a critical role, allowing researchers to fine-tune their analyses for optimal results. In turn, this can influence how researchers assess relationships, particularly when considering allele frequencies in SNP sets.
Researchers suggest further studies to refine the assumptions made in models and encourage the removal of low-frequency variants to avoid misleading results.
Overall, deepKin offers a fresh approach to genetic relationship analysis that can be used in various fields, including genetics and forensic applications. It brings a new level of precision and rigor to understanding how individuals are related based on genetic data.
Title: DeepKin: precise estimation of in-depth relatedness and its application in UK Biobank
Abstract: Accurately estimating relatedness between samples is crucial in genetics and epidemiological analysis. Using genome-wide single nucleotide polymorphisms (SNPs), it is now feasible to measure realized relatedness even in the absence of pedigree. However, the sampling variation in SNP-based measures and factors affecting method-of-moments relatedness estimators have not been fully explored, whilst static cut-off thresholds have traditionally been employed to classify relatedness levels for decades. Here, we introduce the deepKin framework as a moment-based relatedness estimation and inference method that incorporates data-specific cut-off threshold determination. It addresses the limitations of previous moment estimators by leveraging the sampling variance of the estimator to provide statistical inference and classification. Key principles in relatedness estimation and inference are provided, including inferring the critical value required to reject the hypothesis of unrelatedness, which we refer to as the deepest significant relatedness, determining the minimum effective number of markers, and understanding the impact on statistical power. Through simulations, we demonstrate that deepKin accurately infers both unrelated pairs and relatives with the support of sampling variance. We then apply deepKin to two subsets of the UK Biobank dataset. In the 3K Oxford subset, tested with four sets of SNPs, the SNP set with the largest effective number of markers and correspondingly the smallest expected sampling variance exhibits the most powerful inference for distant relatives. In the 430K British White subset, deepKin identifies 212,120 pairs of significant relatives and classifies them into six degrees. Additionally, cross-cohort significant relative ratios among 19 assessment centers located in different cities are geographically correlated, while within-cohort analyses indicate both an increase in close relatedness and a potential increase in diversity from north to south throughout the UK. Overall, deepKin presents a novel framework for accurate relatedness estimation and inference in biobank-scale datasets. For biobank-scale application we have implemented deepKin as an R package, available in the GitHub repository (https://github.com/qixininin/deepKin).
Authors: Guo-Bo Chen, Q.-X. Zhang, D. Jayasinghe, S. H. Lee, H. Xu
Last Update: 2024-05-01 00:00:00
Language: English
Source URL: https://www.biorxiv.org/content/10.1101/2024.04.30.591647
Source PDF: https://www.biorxiv.org/content/10.1101/2024.04.30.591647.full.pdf
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to biorxiv for use of its open access interoperability.