Simple Science

Cutting edge science explained simply

# Health Sciences # Epidemiology

Improving Research with EHRs and Biobanks

Combining genetic data and advanced methods addresses missing data in health research.

Bhramar Mukherjee, M. Salvatore, R. Kundu, J. Du, C. R. Friese, A. M. Mondul, D. A. Hanauer, H. Lu, C. L. Pearce

― 6 min read


EHRs and Biobanks: A EHRs and Biobanks: A Research Revolution health studies. New methods tackle missing data in
Table of Contents

Electronic health records (EHRs) are digital versions of patients' medical histories. These records contain a wealth of information about people’s health, treatments, and outcomes which researchers are increasingly using to study health trends and improve healthcare.

One exciting aspect of EHRs is their connection to Biobanks, which are collections of biological samples and related health information. Some biobanks now include genetic data alongside EHRs, offering researchers a broader range of information. This combination can lead to insights into public health and individual patient care.

The Challenge of Missing Data

While EHRs provide valuable data, they also present challenges. One significant issue is missing data. When certain health information is not recorded or is absent, it can lead to biased conclusions. Missing data can occur for various reasons, such as if a patient did not go to a follow-up appointment, if certain tests were not performed, or even if the data entry was flawed.

Researchers often use complete case analyses, which means they only include patients with all necessary data. However, this approach can lead to inaccuracies if the missing data isn't random. For instance, if healthier patients are more likely to have complete records, it could skew the results.

Missing data can fall into three categories:

  1. Missing Completely at Random (MCAR): The missing data is entirely random and unrelated to any characteristic of the participants.
  2. Missing at Random (MAR): The probability of missing data relates to observed data but not the missing data itself.
  3. Missing Not at Random (MNAR): The missing data is related to the value of what is missing, making it more complex to handle.

Handling Missing Data

There are methods to address missing data, with Multiple Imputation being a popular solution. This technique fills in missing values multiple times to create several complete datasets. Researchers then analyze each dataset and combine the results to get a more accurate estimate.

The success of these methods can vary depending on the type of missingness. For instance, if data is missing randomly, the analyses can still yield reliable results. However, if data is missing not at random, these methods may struggle to provide accurate conclusions.

Genetic Data as a Tool

Biobanks often include genetic information. This can be particularly useful in handling missing data. Researchers can create "Polygenic Risk Scores" (PRS), which summarize genetic information relevant to specific traits or diseases. These scores can help researchers understand the relationships between health data and genetic predispositions.

By applying PRS in analyses, researchers may be able to adjust for missing information more effectively. This could lead to better estimates of how factors like body mass index (BMI) relate to health outcomes, such as glucose levels in the blood.

Selection Bias in Biobanks

Another concern with biobanks is selection bias. This occurs when the individuals included in the study do not accurately represent the general population. For example, if researchers only recruit patients who are undergoing surgery, they may miss important data from otherwise healthy individuals.

To address selection bias, researchers can use weighting methods. These methods adjust for the over- or under-representation of certain groups within the study. For instance, if a group is underrepresented in the sample, researchers can assign higher weights to their observations in the analysis to reflect their importance.

Research Objectives

In this research, we aim to investigate whether combining PRS-informed multiple imputation and sample weighting can reduce biases due to missing data in association studies. Our objectives include:

  1. Evaluating if PRS-informed multiple imputation meaningfully reduces bias in the analysis.
  2. Assessing the combined effect of PRS-informed imputation and sample weighting on estimates of associates between BMI and glucose.

Methods Overview

To conduct our study, we performed simulations to test different missing data scenarios. We generated populations with various characteristics, created datasets, and manipulated missing data to see how different methods performed under these conditions.

We looked at different sample sizes, including small and large populations, and examined how bias and coverage rates changed across various approaches. This involved analyzing data both with and without PRS and applying weights based on selection probabilities.

Case Study: Michigan Genomics Initiative

We applied our methods to real-world data from the Michigan Genomics Initiative (MGI), a biobank that collects health and genetic data from a large cohort of participants. We specifically focused on adults aged 40 and older without a diabetes diagnosis.

In our MGI analysis, we evaluated the relationship between BMI and glucose levels. We analyzed people who identified as non-Hispanic White and non-Hispanic Black separately to see if there were differences in results.

Findings from Simulations

Our simulations revealed that using PRS-informed multiple imputation generally resulted in lower bias, especially when data was missing at random (MAR). Both naive approaches and the weighted methods showed that multiple imputation helped maintain better coverage rates and reduced bias in most scenarios. However, the performance suffered under conditions of missing not at random (MNAR).

In cases where both exposure and outcome data were missing, all methods had difficulty maintaining validity. While PRS-imputed analyses performed slightly better, they still struggled to achieve ideal results in MNAR conditions.

Findings from the Case Study

When we analyzed the MGI data, we compared estimates for BMI's effect on glucose levels using various methods. We found that both complete case analysis and multiple imputation led to different estimates. Importantly, incorporating sample weights brought the estimates closer to values reported in a national health survey benchmark.

For non-Hispanic Whites, the unweighted complete case estimate was lower than expected, but applying weights improved the estimate significantly. For non-Hispanic Blacks, we found small differences, suggesting that selection bias played a more significant role than missing data.

Implications and Recommendations

Our findings highlight the need for researchers to consider both missing data and selection biases when analyzing EHR-linked biobank data. While PRS-informed multiple imputation can enhance accuracy, especially in MAR scenarios, it is not a panacea for MNAR conditions.

Researchers should continue to explore various patterns of missingness and consider additional strategies, such as sensitivity analyses, to better understand the effects of missing data. Moreover, biobanks should provide PRS and appropriate weights for better representation, allowing for more reliable results in future studies.

Conclusion

Addressing missing data and selection bias is critical for the reliability of research using EHR-linked biobanks. By combining advanced imputation methods with genetic information and appropriate sampling weights, researchers can improve the accuracy of their findings and contribute to better healthcare outcomes. Further exploration of these methods will be essential in enhancing the quality of health research and informing public health strategies.

Original Source

Title: Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting

Abstract: Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.

Authors: Bhramar Mukherjee, M. Salvatore, R. Kundu, J. Du, C. R. Friese, A. M. Mondul, D. A. Hanauer, H. Lu, C. L. Pearce

Last Update: 2024-10-29 00:00:00

Language: English

Source URL: https://www.medrxiv.org/content/10.1101/2024.10.28.24316286

Source PDF: https://www.medrxiv.org/content/10.1101/2024.10.28.24316286.full.pdf

Licence: https://creativecommons.org/licenses/by-nc/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to medrxiv for use of its open access interoperability.

Similar Articles