Simple Science

Cutting edge science explained simply

# Health Sciences# Health Informatics

Revolutionizing Prescription Data Extraction with PRESNER

PRESNER enhances analysis of prescription data using advanced NLP techniques.

― 5 min read


PRESNER: Next-GenPRESNER: Next-GenPrescription Analysisdata extraction in healthcare research.Automated tool transforms prescription
Table of Contents

Electronic health records (EHR) are important for understanding health trends and treatment effects. They store a lot of information, including prescriptions given to patients. By linking these records with biobank data, researchers can study how medications affect people and how different genes influence these effects. One such study source is the UK BioBank, which contains health information and biological samples from more than half a million volunteers.

What is the UK Biobank?

The UK Biobank collects detailed health records from individuals, including prescription data. This data gives researchers insight into how various medications are used and how they impact health. Since 2019, a large portion of this data has included information from the UK National Health Service (NHS), allowing researchers to access nearly 57 million prescription records.

Challenges with Prescription Data

Most prescription databases use specific codes to categorize drugs. This means that to analyze the data, researchers often have to extract information manually, which can be tedious and time-consuming. Instead, a better approach may be to pull information directly from the text in the records. This method allows for a more straightforward extraction of necessary details, including drug names and dosages.

The Need for Accurate Data Extraction

In healthcare research, it is essential to correctly identify and categorize prescription drugs. This includes knowing the active ingredients, brand names, and whether the drug is for systemic use, like oral medications, or local use, like creams. Researchers also need to pay attention to details such as dosage and strength for their studies.

Advances in Data Extraction Technology

Natural Language Processing (NLP) is a technology that helps in extracting critical information from text. In the healthcare field, this technology has improved significantly, especially with the arrival of advanced models like BERT. These models help in identifying drug names and related information effectively.

Introducing PRESNER

PRESNER is a new tool designed to help researchers automatically extract and categorize prescription data from electronic health records. This tool uses advanced NLP techniques to identify drug names and other important information while mapping them to established drug classification systems.

How Does PRESNER Work?

PRESNER consists of various components that work together to analyze prescription data. It can recognize drug names and categorize them according to their potential effects on the body. This is important for researchers who need accurate data for their studies. The tool can also filter prescriptions based on different criteria, making it easier for users to find the information they need.

Building a Reliable Drug Dictionary

A significant feature of PRESNER is its built-in dictionary, which includes a comprehensive list of drug names and their respective classifications. This dictionary is updated regularly to ensure that researchers have access to the most recent information. It helps the pipeline match prescriptions to the right classifications, which is crucial for accurate data analysis.

Data Sources Used

PRESNER uses prescription data from the UK Biobank, which is gathered from individuals receiving care through the NHS. This data provides a wealth of information about prescribed medications, including names, quantities, and usage dates. In addition, PRESNER utilizes another dataset known as the n2c2 corpus, which contains numerous annotated entities related to medications. This broadens the scope of the data available for training the model.

The NER Component

The core of PRESNER lies in its Named Entity Recognition (NER) capabilities. This function helps the system recognize and categorize drugs and their associated information from the text. NER is crucial as it allows for the automation of data extraction, resulting in faster and more reliable data processing.

Fine-Tuning the Model

To make PRESNER effective, the model underwent fine-tuning with both the UK Biobank information and the n2c2 corpus. This process involved adjusting the model to ensure that it could accurately understand the specific wording and context found in prescription entries. By using both datasets, the model can better understand the language used in medical prescriptions.

The Comparison with Other Methods

In testing, PRESNER outperformed baseline models that relied on traditional dictionary approaches. While these prior methods were precise, they struggled with capturing the full range of drug names and synonyms. PRESNER’s use of advanced machine learning techniques allowed it to overcome these challenges, successfully recognizing and categorizing more medications.

Classifying Drugs

After recognizing drug names, PRESNER can classify them based on their effects on the body. It distinguishes between systemic drugs, which enter the bloodstream, and non-systemic drugs, applied locally. By doing this, researchers can filter their data based on specific drug categories, aiding their studies.

Results and Performance

PRESNER successfully processed a significant portion of the UK Biobank's prescription entries. The tool matched many of these entries to the appropriate Drug Classifications, providing researchers with valuable insights into medication usage. Its performance was especially strong for important categories like drug strength and dosage, which are essential for accurate health analyses.

Limitations of PRESNER

Despite its strengths, PRESNER has some limitations. Not all drug names may be recognized or included in the dictionary, particularly newer medications or those with multiple brand names. There is also the challenge of ensuring the model consistently identifies drugs that could serve multiple purposes. Users are encouraged to manually review the output, particularly for drugs that were difficult to classify.

Future Directions

As the UK Biobank continues to expand and include more data, tools like PRESNER will be invaluable for processing this information swiftly. There is also potential for similar tools to be used with other databases, which could help streamline data extraction in various healthcare settings.

Conclusion

Access to prescription data linked with biobank information can pave the way for significant research in pharmacogenomics and other health studies. However, processing this data effectively is vital for yielding accurate results. Tools like PRESNER demonstrate how advanced technology can facilitate this process, making it easier for researchers to access structured information and insights from large datasets. Future improvements may focus on enhancing the recognition of drug names and expanding the dictionaries to include more comprehensive lists of medications.

Original Source

Title: Automated Extraction and Classification of Drug Prescriptions in Electronic Health Records: Introducing the PRESNER Pipeline

Abstract: Electronic health record (EHR) systems with prescription data offer vast potential in pharmacoepidemiology and pharmacogenomics. The large amount of clinical data recorded in these systems requires automatic processing to extract relevant information. This paper introduces PRESNER, a name entity recognition (NER) and classification pipeline for EHR prescription data. The pipeline uses the pre-trained transformer Bio-ClinicalBERT fine-tuned on UK Biobank prescription entries manually annotated with medication-related information (drug name, route of administration, pharmaceutical form, strength, and dosage) as the core NER system. Moreover, PRESNER also maps drugs to the Anatomical Therapeutic and Chemical (ATC) classification system and distinguishes between systemic and non-systemic drug products. It outperformed a baseline model combining the state-of-the-art Med7 and a dictionary-based approach from the ChEMBL database with a macro-average F1-score of 0.95 vs 0.71. In addition to UK Biobank prescription data, PRESNER can also be applied to other English prescription datasets, making it a versatile tool for researchers in the field.

Authors: Maria Herrero-Zazo, C. Colon-Ruiz, T. W. Fitzgerald, I. Segura-Bedmar, E. Birney

Last Update: 2023-10-05 00:00:00

Language: English

Source URL: https://www.medrxiv.org/content/10.1101/2023.10.04.23296481

Source PDF: https://www.medrxiv.org/content/10.1101/2023.10.04.23296481.full.pdf

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to medrxiv for use of its open access interoperability.

More from authors

Similar Articles