Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language

Improving Radiology Diagnoses with AI Language Models

AI models can enhance how radiologists access patient information for better diagnoses.

― 5 min read


AI in Radiology: A GameAI in Radiology: A GameChangerchallenges.diagnostics but face accuracyAI language models improve radiology
Table of Contents

Electronic Health Records (EHR) hold a lot of important health information that can help doctors, especially radiologists, make better diagnoses. However, these records often contain unstructured data, like lengthy notes, which are hard to quickly sift through. This makes it challenging for radiologists to gather relevant patient history or evidence that could assist in making a diagnosis.

The Challenge of Manual Review

Radiologists typically face time constraints, and the sheer volume of notes for individual patients can make the manual review difficult. They often miss out on vital information that exists within EHRs because the process of reading through countless notes to find applicable evidence can be inefficient and overly time-consuming. This results in a situation where radiologists might not have a complete picture of a patient’s medical history when interpreting imaging results.

The Role of Large Language Models

Recent advancements in technology, particularly in the field of artificial intelligence, have led to the development of Large Language Models (LLMs). These models can analyze unstructured data and potentially provide a solution to the challenges faced by radiologists in retrieving pertinent information from EHRs. They can summarize relevant evidence based on specific queries made by clinicians, allowing for a more efficient diagnostic process.

How Large Language Models Work

In our approach, we proposed using an LLM named Flan-T5 XXL. This model can evaluate whether a patient is at risk for or already has a specific condition based solely on text from clinical notes. If the answer is affirmative, the model then summarizes the evidence supporting this assessment. This might involve a simple question: "Is the patient at risk of [Condition]?" followed by summarizing why the model thinks so.

The Evaluation Process

To test this method, we enlisted radiologists to perform manual evaluations of the model's outputs. This was done to determine whether the LLM could provide information that is both accurate and useful compared to traditional retrieval methods. The results showed that the LLM-based approach generally performed better than standard retrieval methods, providing outputs that clinicians preferred more often.

Hallucinations in Outputs

Despite the promising results, a significant challenge emerged: the model sometimes generated fictitious or "hallucinated" evidence. This means it might produce plausible-sounding claims that have no actual support in the patient records. This issue could mislead clinicians, who may then need to verify the accuracy of the model’s outputs against the actual notes. Such misalignment can reverse any gains in efficiency and safety that the model might offer.

Identifying Hallucinations

We investigated ways to determine when the model was hallucinating evidence. One approach involved assessing the model's confidence in its outputs. When the model was less certain about a response, it was more likely to hallucinate. By employing these confidence scores, clinicians might be able to filter out unreliable outputs, choosing not to act on uncertain information.

The Need for Contextual Evidence

In order to effectively assist with diagnosis, the model must retrieve two types of evidence from EHRs:

  1. Risk Evidence: This indicates whether a patient could develop a condition in the future.
  2. Current Evidence: This shows whether a patient is currently experiencing a condition.

For example, if a patient recently had a surgery and is on blood thinners, they may be at risk for a hemorrhage. Conversely, if imaging shows signs of bleeding, this would suggest they currently have a hemorrhage.

Privacy Considerations

When conducting this research, it was essential to consider patient privacy. We used models that could be operated in-house to comply with regulations, avoiding cloud-based systems that might expose sensitive data.

Evaluation of the Approach

The evaluation process involved collaboration with radiologists, who reviewed the outputs from both the LLM and traditional retrieval methods on a selection of patient notes. They were tasked with assessing if the evidence surfaced was accurate and relevant to specific diagnoses. Overall, the LLM outputs were found to be more useful and informative compared to what traditional methods provided.

Inter-radiologist Agreement and Time Cost

In order to measure consistency, different radiologists were asked to evaluate the same outputs. This way, we could assess how closely their judgments aligned. The agreement among the radiologists illustrated their varied perspectives on what constituted useful evidence. Also noteworthy is the time it took to verify the model's outputs, with LLM suggestions requiring a longer evaluation period due to the need for careful checks.

Evidence Evaluation Metrics

To better understand how effective the model was, we categorized outputs based on their perceived usefulness. Radiologists rated evidence on a scale to capture its relevance to the initial query. This rating provided insight into how LLM outputs compared to traditional retrieval methods in a practical clinical context.

Weakly Correlated Evidence

One of the challenges identified during evaluation was that the model sometimes surfaced evidence that, while plausible, had weak connections to the patient’s diagnosis. While the model might have retrieved something that made sense from a general standpoint, it did not necessarily apply to the individual patient, thus limiting its utility.

Future Directions for Research

The findings highlight an area for future exploration: improving how LLMs can better support clinicians without leading to the fabrication of irrelevant or inaccurate information. Enhancing the model’s ability to distinguish between likely and unlikely scenarios could provide a pathway to mitigate the hallucination issue.

Final Thoughts

Overall, the use of LLMs in extracting evidence from EHRs has shown potential to aid radiologists in their diagnosis processes. However, it is crucial to address the concerns surrounding output accuracy and relevance. Through continued research, improvements can be made that not only enhance clinician workflows but also contribute to better patient care outcomes. As we look to the future, the intersection of advanced technology and healthcare holds promise, but careful implementation and evaluation will be necessary to ensure that these tools fulfill their intended purpose effectively.

Original Source

Title: Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

Abstract: Unstructured data in Electronic Health Records (EHRs) often contains critical information -- complementary to imaging -- that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs as a mechanism to efficiently retrieve and summarize unstructured evidence in patient EHR relevant to a given query. Our method entails tasking an LLM to infer whether a patient has, or is at risk of, a particular condition on the basis of associated notes; if so, we ask the model to summarize the supporting evidence. Under expert evaluation, we find that this LLM-based approach provides outputs consistently preferred to a pre-LLM information retrieval baseline. Manual evaluation is expensive, so we also propose and validate a method using an LLM to evaluate (other) LLM outputs for this task, allowing us to scale up evaluation. Our findings indicate the promise of LLMs as interfaces to EHR, but also highlight the outstanding challenge posed by "hallucinations". In this setting, however, we show that model confidence in outputs strongly correlates with faithful summaries, offering a practical means to limit confabulations.

Authors: Hiba Ahsan, Denis Jered McInerney, Jisoo Kim, Christopher Potter, Geoffrey Young, Silvio Amir, Byron C. Wallace

Last Update: 2024-06-10 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2309.04550

Source PDF: https://arxiv.org/pdf/2309.04550

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles