New Dataset Advances Causal Relation Extraction in Biomedicine
CRED dataset enhances research on gene-disease causal relationships in biomedical literature.
― 6 min read
Table of Contents
Understanding how one thing can cause another is important in many fields. For example, in biomedicine, knowing how genes relate to diseases can help create better treatment plans. Instead of treating diseases based on associations, doctors can focus on genes that actually cause the diseases. This approach can lead to more effective treatments. In economics and social sciences, people often look for reasons behind past events to predict future occurrences. Similarly, in machine learning, recognizing the difference between causal features and mere correlations can improve the performance of models. In natural language processing (NLP), knowing causal relationships can lead to better results in tasks like summarizing text and answering questions.
However, manually finding these cause-effect relationships in huge amounts of text, like the 35 million articles in PubMed, is too difficult. That is why advanced methods in NLP are becoming more common, especially for extracting specific causal information from the literature.
Causal Relation Extraction (CRE)
In the area of NLP, extracting useful information from published articles is already well-developed. However, extracting causal relations is still new and developing. Some initial steps have been taken, but they often focus on just one disease or one aspect of a single sentence.
One major challenge is finding cause-effect relationships across multiple sentences and for various diseases. This difficulty comes from not having enough diverse Datasets to train models on. Another obstacle is understanding how the models arrive at their conclusions. This makes it hard to pinpoint which words indicate a causal relationship rather than just a correlation.
Our Contributions
To help with these challenges, we created a new dataset called CRED. This dataset includes information about disease-causing genes extracted from published Biomedical sources. It is unique because it consists of both single-sentence and multiple-sentence relationships, covering a wide variety of 500 genes and 284 diseases.
To see how useful CRED is, we trained different classifiers on it. We found that a specific model outperformed others, achieving a great F1 score, which is a measure of a model’s accuracy. We also checked to see if our model was indeed focusing on words that suggest causality, not just on the names of the genes and diseases. The model did show it was paying attention to these causal words.
We then applied our model to real-world data, specifically looking at Parkinson’s disease. Our model was able to identify known genes that cause Parkinson’s disease from the abstracts. We also created a score that indicates how strongly a specific Gene-disease pair is connected.
Related Work on Causal Relation Extraction
This section looks at other datasets and methods related to causal relation extraction. Because there are only a few studies on this topic, much of the previous work has focused on other types of information extraction, especially in biomedicine.
Existing Datasets
There are a few datasets available for information extraction from biomedical literature. One popular dataset works with the relationship between chemicals and diseases, but it does not specifically focus on causal relationships. Other datasets, like GAD, also do not distinguish clearly between causal and non-causal relations. There are even datasets focusing on drug effects, but they don’t have a direct causal focus.
Given these existing datasets, we realized there was a need for a dataset specifically for causal relations in biomedicine. While some initial datasets have been developed, they often lack the ability to capture relationships across multiple sentences.
Creating Our Dataset: CRED
A main goal of our work was to build the CRED dataset of causal and non-causal gene-disease pairs. We managed to create a dataset with 5618 pairs, including both types. To do this, we followed a systematic approach.
Selecting Abstracts
First, we gathered a list of gene-disease pairs from a known database. We then searched for abstracts mentioning those pairs in PubMed, focusing on the most relevant results. In total, we collected 267 abstracts covering a variety of genes and diseases.
Recognizing Gene-Disease Pairs
After selecting the abstracts, we used a tool to identify gene and disease names in the texts. This tool helped ensure that different representations of the same entity were correctly grouped.
Annotating Causality
Next, we carefully read each abstract to label the gene-disease pairs as either causal or non-causal. This step required clear guidelines to ensure accuracy. If the relationship was not explicitly stated, we considered it non-causal.
Building the Dataset
In a second phase, we took additional abstracts and ran them through our trained model to predict which gene-disease pairs might be causal. Our annotators then verified these predictions, adding more causal pairs to our dataset.
Evaluation and Results
We assessed our dataset's utility by training various classifiers on CRED and comparing their performance. We found that our best-performing model not only had a good balance of precision and recall but also did well on out-of-sample data. This testing showed the strength of our dataset in supporting further research in causal relation extraction.
Training the Classifier
To train the classifier, we used data augmentation techniques to improve performance in distinguishing between causal and non-causal relationships. We also undertook cleaning and preprocessing to focus on context rather than specific names.
Testing Performance
We used multiple methods to test the performance of our best model. The results showed that our model not only performed well with the CRED dataset but also had robust capabilities when tested on other datasets.
Real-world Applications
Our model has practical applications, especially in understanding connections between genes and diseases. For instance, when applied to all articles about Parkinson's disease, the model was able to identify many genes linked to the disease, including those not present in the training dataset.
Causality Scores
Additionally, the model can generate scores to determine how strongly a specific gene-disease pair is associated based on the number of times it is mentioned across different abstracts. This ability is crucial for establishing credibility in the findings.
Conclusion
This work established CRED as a dataset for extracting causal relationships between genes and diseases from scientific literature. With 5618 gene-disease pairs collected, we have taken significant steps towards enhancing the understanding of these relationships in biomedical research. We demonstrated that our model could successfully identify causal connections and quantify the strength of that causality.
Through this new dataset, we hope to pave the way for future studies and improvements in how research investigates cause-effect relationships in biomedicine and beyond. The development of CRED shows the increasing importance of creating specialized datasets that address specific research needs, thus benefiting the scientific community.
Title: Beyond associations: A benchmark Causal Relation Extraction Dataset (CRED) of disease-causing genes, its comparative evaluation, interpretation and application
Abstract: Information on causal relationships is essential to many sciences (including biomedical science, where knowing if a gene-disease relation is causal vs. merely associative can lead to better treatments); and can foster research on causal side-information-based machine learning as well. Automatically extracting causal relations from large text corpora remains less explored though, despite much work on Relation Extraction (RE). The few existing CRE (Causal RE) studies are limited to extracting causality within a sentence or for a particular disease, mainly due to the lack of a diverse benchmark dataset. Here, we carefully curate a new CRE Dataset (CRED) of 3553 (causal and non-causal) gene-disease pairs, spanning 284 diseases and 500 genes, within or across sentences of 267 published abstracts. CRED is assembled in two phases to reduce class imbalance and its inter-annotator agreement is 89%. To assess CREDs utility in classifying causal vs. non-causal pairs, we compared multiple classifiers and found SVM to perform the best (F1 score 0.70). Both in terms of classifier performance and model interpretability (i.e., whether the model focuses importance/attention on words with causal connotations in abstracts), CRED outperformed a state-of-the-art RE dataset. To move from benchmarks to real-world settings, our CRED-trained classification model was applied on all PubMed abstracts on Parkinsons disease (PD). Genes predicted to be causal for PD by our model in at least 50 abstracts got validated in textbook sources. Besides these well-studied genes, our model revealed less-studied genes that could be explored further. Our systematically curated and evaluated CRED, and its associated classification model and CRED-wide gene-disease causality scores, thus offer concrete resources for advancing future research in CRE from biomedical literature.
Authors: Manikandan Narayanan, N. Bansal, S. D. R C, A. Pathak
Last Update: 2024-09-21 00:00:00
Language: English
Source URL: https://www.biorxiv.org/content/10.1101/2024.09.17.613424
Source PDF: https://www.biorxiv.org/content/10.1101/2024.09.17.613424.full.pdf
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to biorxiv for use of its open access interoperability.