Simple Science

Cutting edge science explained simply

# Health Sciences# Health Informatics

Synthetic Data Revolutionizes COVID-19 Risk Prediction for Veterans

Synthetic data aids in predicting COVID-19 risks among Veterans while ensuring privacy.

― 5 min read


AI-Powered COVID-19 RiskAI-Powered COVID-19 RiskModelsusing synthetic data.Innovative models predict health risks
Table of Contents

Recent developments in Big Data and Artificial Intelligence (AI) have allowed researchers to work with complex medical data, particularly Electronic Health Records (EHR). However, issues around patient privacy and ethical AI use have made it hard to share this data widely. To get around these restrictions, scientists have turned to synthetic data, which replicates some characteristics of real data without revealing personal details. This method not only allows researchers to share their findings more easily but also helps reduce bias in research.

The precisionFDA Platform

The FDA has developed a platform called precisionFDA to support advancements in personalized medicine and inform regulatory science. This platform is secure and cloud-based, offering on-demand computing and data storage. It also provides access to reference data and spaces for collaboration. Since it was launched in 2015, precisionFDA has attracted over 6,000 members, including manufacturers, healthcare providers, and researchers. The platform encourages public engagement through forums, expert blogs, and community challenges.

The COVID-19 Risk Factor Modeling Challenge

As the COVID-19 pandemic unfolded, there was a growing body of evidence on various Risk Factors that could lead to severe illness, such as age, obesity, and existing health conditions. Predictive models using EHR data can help identify patients at higher risk, allowing for earlier and more aggressive treatment. Veterans, in particular, face unique health challenges and may need models tailored specifically for them. However, using data about Veterans raises privacy concerns. To address these issues, the FDA and Veterans Health Administration (VHA) initiated the COVID-19 Risk Factor Modeling Challenge to explore how synthetic data could be useful.

The first phase of the challenge ran in June 2020 and encouraged participants to use Machine Learning to develop models predicting health outcomes related to COVID-19 illness in Veterans. By using synthetic data to protect identities, researchers could analyze health outcomes without the usual security concerns. The challenge focused on five main outcomes: COVID-19 status, survival status, ventilation needs, hospitalization duration, and ICU duration.

Methodology

For the challenge, synthetic health records were created for 147,451 fictional patients using a tool called Synthea. These records included a range of medical conditions, treatments, and patient demographics. Participants received 80% of this data for training their models, while 20% was kept aside for testing. Each model’s ability to predict the five health outcomes was evaluated using standard metrics.

As a follow-up, a second phase of the challenge was introduced to validate the top models from Phase 1 using two additional datasets. These included a second synthetic dataset generated by a different software and a real dataset of Veterans’ health records. Participants adapted their models to fit these new datasets and were evaluated on the same metrics used in Phase 1.

Results from Phase 1

In total, 21 teams submitted 34 model entries in Phase 1. The participants used various machine learning techniques, with many employing advanced models, including Gradient Boosting Machines and Random Forests. The results showed that the models predicting severe outcomes like survival status performed better than those predicting less severe outcomes. For example, models predicting whether a patient would need a ventilator were more accurate compared to those predicting COVID-19 status.

The performance of different models varied, but among the top entries, the models using Gradient Boosted Machines generally achieved the best results. The first phase results indicated that it was easier to predict severe health outcomes compared to mild ones, likely due to distinct features associated with severe conditions.

Results from Phase 2

Phase 2 of the challenge focused on validating the top-performing models from Phase 1. The results showed that these models continued to outperform random chance in predicting health outcomes. The models were validated against three datasets, with Synthea data yielding the best results in terms of accuracy. Both the synthetic and real health records showed that the models could reliably forecast health outcomes.

Across all datasets, the models trained on synthetic data generally performed similarly to those trained on real data. Importantly, the top performers identified several risk factors associated with COVID-19. These factors included common health conditions linked to higher severity, such as respiratory or cardiovascular issues.

Identifying Risk Factors

Throughout the challenge, the models were also good at pinpointing risk factors that could predict health outcomes. Participants identified pre-existing conditions, medications, and demographic details as important factors. Despite some differences between datasets, each model highlighted at least one risk factor that was also recognized in real Veteran health records.

The ability of these models to identify crucial risk factors suggests that synthetic data can be a useful tool for understanding health risks, particularly during urgent health crises like a pandemic.

Limitations and Future Directions

While the results were promising, there were limitations in the study. One concern was that the models trained on synthetic data showed inflated performance metrics compared to those trained on real data. This could indicate that working with real data is inherently more challenging due to its complexity and variability.

Though the challenge provided valuable insights into synthetic data’s potential, more research is needed to directly compare synthetic data to real data in practical settings. Additionally, the variety of machine learning techniques used was limited, which may not provide a full picture of how different algorithms perform with this data.

Conclusion

The COVID-19 Risk Factor Modeling Challenge showcased how machine learning and synthetic data can work together to address public health issues. By creating a platform for researchers to develop and share models, the challenge offered insights into the risks associated with COVID-19 among Veterans.

The need for accessible data is essential, especially during a health crisis, and synthetic data can bridge the gap when privacy concerns are high. Moving forward, improving synthetic data generation methods and expanding the range of machine learning algorithms used will be important for further research in this field.

Overall, the challenge highlighted the potential benefits of using synthetic data in medical research, which can help inform better healthcare decisions and improve patient outcomes.

Original Source

Title: Synthetic Health Data Can Augment Community Research Efforts to Better Inform the Public During Emerging Pandemics

Abstract: The COVID-19 pandemic had disproportionate effects on the Veteran population due to the increased prevalence of medical and environmental risk factors. Synthetic electronic health record (EHR) data can help meet the acute need for Veteran population-specific predictive modeling efforts by avoiding the strict barriers to access, currently present within Veteran Health Administration (VHA) datasets. The U.S. Food and Drug Administration (FDA) and the VHA launched the precisionFDA COVID-19 Risk Factor Modeling Challenge to develop COVID-19 diagnostic and prognostic models; identify Veteran population-specific risk factors; and test the usefulness of synthetic data as a substitute for real data. The use of synthetic data boosted challenge participation by providing a dataset that was accessible to all competitors. Models trained on synthetic data showed similar but systematically inflated model performance metrics to those trained on real data. The important risk factors identified in the synthetic data largely overlapped with those identified from the real data, and both sets of risk factors were validated in the literature. Tradeoffs exist between synthetic data generation approaches based on whether a real EHR dataset is required as input. Synthetic data generated directly from real EHR input will more closely align with the characteristics of the relevant cohort. This work shows that synthetic EHR data will have practical value to the Veterans health research community for the foreseeable future.

Authors: Amanda Lienau, A. Prasanna, B. Jing, G. Plopper, K. Krasnov Miller, J. Sanjak, A. Feng, S. Prezek, E. Vidyaprakash, V. Thovarai, E. Maier, A. Bhattacharya, L. Naaman, H. Stephens, S. Watford, W. J. Boscardin, E. Johanson

Last Update: 2023-12-13 00:00:00

Language: English

Source URL: https://www.medrxiv.org/content/10.1101/2023.12.11.23298687

Source PDF: https://www.medrxiv.org/content/10.1101/2023.12.11.23298687.full.pdf

Licence: https://creativecommons.org/publicdomain/zero/1.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to medrxiv for use of its open access interoperability.

Similar Articles