Using Technology to Improve Death Data Collection
This study investigates new ways to gather mortality information using online sources.
Mohammed Al-Garadi, Michele LeNoue-Newton, Michael E. Matheny, Melissa McPheeters, Jill M. Whitaker, Jessica A. Deere, Michael F. McLemore, Dax Westerman, Mirza S. Khan, José J. Hernández-Muñoz, Xi Wang, Aida Kuzucan, Rishi J. Desai, Ruth Reeves
― 8 min read
Table of Contents
- The Importance of Accurate Death Data
- Other Sources of Death Data
- Social Media: A New Hope?
- Our Study: Extracting Death Information Using Technology
- Collecting the Data
- Preparing to Train Our Tools
- Building the Technology
- Evaluating Our Results
- Findings: What Did We Learn?
- Addressing the Challenges
- Implications and Future Directions
- Conclusion
- Original Source
Mortality, or death, is an important aspect of Healthcare research. Researchers study various reasons why people die and how often it happens to understand health trends and improve patient care. One of the most common ways to look at this is through all-cause mortality, which means considering all reasons for death. Knowing when and why someone dies is crucial for many types of health research, including clinical trials and safety monitoring of medical products.
Data
The Importance of Accurate DeathAccurate details about deaths, like the time and cause, are essential for effective research. If researchers fail to capture this information, it can lead to big mistakes, like underestimating how many people die due to certain medical products. This can have serious consequences for public health.
Researchers have found that poor access to date of death and cause of death information is a major barrier to conducting thorough studies. For example, the US FDA has a system called the Sentinel Active Risk Identification and Analysis (ARIA) system to address regulatory questions. This system relies on accurate death data, and any gaps can lead to incomplete studies.
The best source of death data in the United States comes from vital statistics collected from death certificates filled out by coroners, medical examiners, or doctors. Once this information is compiled at the state level, it is sent to the Centers for Disease Control and Prevention (CDC) for further coding and analysis. However, there’s a catch: it takes a long time-usually about nine months-before this information is released to the public.
Other Sources of Death Data
While death certificates are the “gold standard” for mortality data, there are other sources like claims databases and medical records. But there is a downside to these sources too. Claims databases might miss information about uninsured people, while medical records can vary widely between healthcare providers, making it hard to combine the data for analysis.
When it comes to claims databases, details about deaths are often incomplete or not recorded at all. Similarly, when we look at electronic health records, they often lack comprehensive death data, especially if patients weren’t under the care of that specific health system when they died. This missing information creates significant challenges for researchers who want to use these databases for studies on public health and care quality.
Social Media: A New Hope?
In recent years, social media has emerged as a potential new source of death-related information. People share news about deaths on platforms like Twitter, GoFundMe, and various memorial websites. While this may seem like a chicken-and-egg problem, researchers are starting to tap into these online platforms to gather information that could be useful for healthcare research.
There’s growing interest in using social media for public health. User posts have been used to track illnesses, measure risky behaviors, find disease hotspots, and analyze medication usage. However, a big challenge remains: how to effectively extract date and cause of death information from all that noise. Although social media may offer quicker and broader coverage of mortality information, turning it into usable data comes with its own set of hurdles.
Our Study: Extracting Death Information Using Technology
In this study, we aimed to develop some cool tools to help snag both the fact of a death and the cause of death from public records online. These tools would allow us to see whether social media and obituary data contain enough useful information to improve understanding of mortality trends. By combining this information with traditional sources, we hoped to enhance the quality of the data used in healthcare research.
We relied on a technology called Natural Language Processing (NLP) to sift through all the information on social media and other online platforms. NLP allows computers to understand and interpret human language, making it easier to extract relevant data.
Collecting the Data
We collected information from various online sources, including Twitter, GoFundMe, and multiple obituary websites from 2015 to 2022. We looked for posts containing keywords like “death” and “deceased.” To put it simply, we went hunting for anything that could help us gather details about mortality.
To gather data from Twitter, we used around 50 keywords, which led us to about 40 million tweets! Then, we applied the same strategy to GoFundMe and memorial websites, but with slightly different methods. For the obituary sources, we collected structured details like names, dates of birth, and dates of death.
Once we had all this information, we used NLP techniques to fill in any gaps or correct any mistakes. The idea was to maximize the data we could extract and build a comprehensive dataset on mortality.
Preparing to Train Our Tools
To train our NLP tools, we needed a gold-standard reference dataset. To achieve this, we created an annotated set of records using names, dates, and causes of death. We instructed the annotators on how to classify the data accurately. They categorized names, dates, and causes, ensuring that every detail was accounted for.
A total of 4,200 records were sampled for training, testing, and validation, with each record being scrutinized to ensure high quality. We even calculated agreement rates between the annotators to ensure everyone was on the same page.
Building the Technology
We worked on two NLP tools in parallel to extract the necessary information from the online sources. We utilized deep learning methods, which are like super-smart algorithms, to handle the complex task of identifying names, dates, and causes of death.
For identifying causes of death, we relied on a technique called few-shot learning, which is a way to train models using only a small number of examples. This technique used a large language model that can understand context and produce accurate results.
Evaluating Our Results
After we developed our tools, we tested them on a new set of data to evaluate their performance. We wanted to see how well our NLP models could identify the causes of death compared to human annotators. We had trained nurses who reviewed the results to ensure everything met the established guidelines.
Our evaluations involved comparing the results from our models and the human annotators. This allowed us to measure how accurately both were able to identify the primary cause of death and all other relevant information.
Findings: What Did We Learn?
After crunching the data, we found that our NLP tools performed quite well! One model called RoBERTa stood out, achieving a high score for accuracy in extracting names, dates, and causes of death.
Interestingly, when we compared causes of death from our models with those identified by human annotators, we found that our automated system performed admirably. In some cases, the model was even better at identifying the primary cause of death than humans!
However, we did notice that our model did struggle a bit with identifying additional causes of death, particularly in sources that listed multiple conditions.
Addressing the Challenges
As great as our results were, we did encounter a few hurdles along the way. One of the biggest challenges was that social media data doesn’t always represent the entire population fairly. Some segments of society might not be as active online, which could lead to gaps in the data.
Additionally, while our systems gathered a lot of accurate information, some of the details could be unclear. Extracting the cause of death from text can be tricky, especially when there are multiple conditions mentioned. While our methods performed well, there’s still room for improvement.
Implications and Future Directions
Automated extraction of mortality information from online sources holds great potential for healthcare research. Traditional mortality databases often suffer from delays in reporting and can be incomplete. By utilizing social media and online obituary data, we could quickly gather critical details about deaths that could contribute to better healthcare research.
Moreover, if we can validate these online sources, they could provide timely insights into emerging trends in mortality. This would be particularly valuable in tracking health crises, like pandemics or environmental disasters.
To effectively incorporate this online data into existing public health systems, collaboration between researchers and public health agencies will be essential. Developing protocols for integrating this data into surveillance systems could enhance public health responses.
Conclusion
In conclusion, our study demonstrated the promising potential of using advanced NLP techniques to extract critical mortality information from various online sources. By tapping into social media and obituaries, we can fill gaps in traditional mortality databases and provide more timely and comprehensive insights into health trends.
However, as we move forward, it’s important to acknowledge the limitations of online data and the need for further refinement of our tools. By continuing to improve our methods and validating our findings, we can ensure that this new approach to mortality data in healthcare research becomes even more valuable.
So, while we ponder the mysteries of life and death, we might just find that our laptops can help shed some light on the subject!
Title: Automated Extraction of Mortality Information from Publicly Available Sources Using Language Models
Abstract: BackgroundMortality is a critical variable in healthcare research, but inconsistencies in the availability of death date and cause of death (CoD) information limit the ability to monitor medical product safety and effectiveness. ObjectiveTo develop scalable approaches using natural language processing (NLP) and large language models (LLM) for the extraction of mortality information from publicly available online data sources, including social media platforms, crowdfunding websites, and online obituaries. MethodsData were collected from public posts on X (formerly Twitter), GoFundMe campaigns, memorial websites (EverLoved.com and TributeArchive.com), and online obituaries from 2015 to 2022. We developed a natural language processing (NLP) pipeline using transformer-based models to extract key mortality information such as decedent names, dates of birth, and dates of death. We then employed a few-shot learning (FSL) approach with large language models (LLMs) to identify primary and secondary causes of death. Model performance was assessed using precision, recall, F1-score, and accuracy metrics, with human-annotated labels serving as the reference standard for the transformer-based model and a human adjudicator blinded to labeling source for the FSL model reference standard. ResultsThe best-performing model obtained a micro-averaged F1-score of 0.88 (95% CI, 0.86-0.90) in extracting mortality information. The FSL-LLM approach demonstrated high accuracy in identifying primary CoD across various online sources. For GoFundMe, the FSL-LLM achieved 95.9% accuracy for primary cause identification, compared to 97.9% for human annotators. In obituaries, FSL-LLM accuracy was 96.5% for primary causes, while human accuracy was 99.0%. For memorial websites, FSL-LLM achieved 98.0% accuracy for primary causes, with human accuracy at 99.5%. ConclusionsThese findings highlight the potential of leveraging advanced NLP techniques and publicly available data to enhance the timeliness, comprehensiveness, and granularity of mortality surveillance. Funding statementThis project was supported by Task Order 75F40123F19010 under Master Agreement 75F40119D10037 from the US Food and Drug Administration (FDA). FDA coauthors reviewed the study protocol, statistical analysis plan, and the manuscript for scientific accuracy and clarity of presentation. Representatives of the FDA reviewed a draft of the manuscript for the presence of confidential information and accuracy regarding the statement of any FDA policy. The views expressed are those of the authors and not necessarily those of the US FDA.
Authors: Mohammed Al-Garadi, Michele LeNoue-Newton, Michael E. Matheny, Melissa McPheeters, Jill M. Whitaker, Jessica A. Deere, Michael F. McLemore, Dax Westerman, Mirza S. Khan, José J. Hernández-Muñoz, Xi Wang, Aida Kuzucan, Rishi J. Desai, Ruth Reeves
Last Update: 2024-11-01 00:00:00
Language: English
Source URL: https://www.medrxiv.org/content/10.1101/2024.10.28.24316027
Source PDF: https://www.medrxiv.org/content/10.1101/2024.10.28.24316027.full.pdf
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to medrxiv for use of its open access interoperability.