The Role of LLMs in Medical Diagnosis
Examining the potential of AI in patient illness prediction.
Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew Churpek, Majid Afshar
― 6 min read
Table of Contents
Diagnosing a patient’s illness is not as easy as asking “What hurts?” It’s a complicated process that involves looking at many factors to figure out what might be wrong. Doctors have to consider various diseases based on how a patient looks and what they say. They start by gathering basic information, which helps them guess the chances of certain illnesses before even running any tests. As doctors get more test results, they adjust their guesses.
The Role of Doctors
Typically, doctors use their medical knowledge, pattern recognition skills, and experience to make quick guesses about what’s wrong with a patient. But sometimes, their brains play tricks on them, leading to mistakes in diagnosis. This happens when they rely too much on common shortcuts, called cognitive biases, rather than thinking deeply about the situation.
While it’s great when doctors can think quickly, analytic thinking, which involves careful consideration of evidence, takes more time and is often impossible in busy hospitals. Doctors are trained to estimate how likely a diagnosis is and to use test results effectively. However, those quick guesses can sometimes lead to misjudging how likely a certain illness is, which can be harmful.
Can Technology Help?
Lately, there has been a lot of talk about using Large Language Models (LLMs) to help doctors with their decision-making. These are advanced computer programs that can generate human-like responses and even come up with possible Diagnoses based on the information they receive. Some recent models, like GPT-4, are performing comparably to real doctors when it comes to suggesting what might be wrong with patients.
However, there’s a catch! While these models can suggest things like "the patient might have pneumonia," they often don’t say how likely that diagnosis is. This is crucial because a 20% chance of pneumonia means something very different than a 90% chance. While the latest LLMs have shown some promise in predicting illness probabilities better than some doctors, they still don’t do great overall.
The Challenge of Uncertainty
LLMs work differently than doctors. They don’t give direct probabilities for diagnoses like a doctor might; they create likelihoods based on sequences of words. This brings up an important question: how can we turn the word outputs from these models into meaningful probabilities that doctors can use? If we don’t solve this problem, there’s a risk that doctors may misinterpret the model's suggestions or blindly trust them without understanding the uncertainty involved.
To make LLMs better at expressing uncertainty, researchers have turned to information theory, which looks at how uncertain a predicted outcome is. There are techniques available to assess the uncertainty in these models, like checking how likely each token (or word) is to come next in a sentence. But there’s a catch! Sometimes, the answers that these models give don’t always match their actual understanding, leading to incorrect conclusions.
This study aims to see how well LLMs can estimate the likelihood of diseases based on real patient data. To do this, researchers looked at two different LLMs, called Mistral and Llama, to see how well they could predict the chances of serious conditions in patients.
The Study Setup
Researchers worked with data from a medical center with a large number of patient records. This data included vital signs, lab results, and assessments made by nurses. The focus was on three major health issues: Sepsis, Arrhythmia, and Congestive Heart Failure (CHF).
The team compared their findings from LLMs with results from a traditional machine learning model, called eXtreme Gradient Boosting (XGB), which has been used successfully in clinical predictions. They aimed to see how well the LLMs could predict diagnoses when given structured health records.
How Did They Do It?
The researchers tested a few methods to see how well the LLMs could predict the chances of a diagnosis. They started by converting structured data, like numbers and facts from medical records, into a simple text format that the model could easily understand.
The first method asked the LLM to provide a yes or no answer about whether a patient had a certain condition. They used a math trick called softmax to calculate the probabilities of “yes” or “no” based on the LLM's answers.
Another approach involved asking the LLM a more open-ended question: “How likely is it that this patient has this diagnosis?” This allowed the model to respond with a percentage estimate, giving a clearer probability of a diagnosis.
Research also used features from the LLM, like the last layer of its model, and paired those with the XGB classifier to see if they could improve predictions.
What Were the Results?
The results of the study revealed some interesting trends. The LLMs, when combined with the XGB method, showed promising results in terms of predicting the chance of Sepsis. But stand-alone methods, like the yes/no queries or percentage estimates, didn’t perform as well, especially with rare conditions.
When researchers compared the predictions from LLMs with the baseline results from the XGB classifier, they found the methods relying purely on LLMs had weaker correlations, indicating they weren’t as consistent. The method that combined LLM embeddings with XGB generally performed better, but overall, the LLMs struggled to provide reliable estimates, particularly for less common illnesses.
The Role of Patient Demographics
Interestingly, the demographic information of patients, such as sex or race, affected how well these models performed. The models often showed bias, which means their predictions could vary unfairly based on patient characteristics. This is a significant concern, as it underscores the need for LLMs to be trained on a diverse range of data.
Conclusion: What’s Next?
In short, the study showed that while LLMs like Mistral and Llama can be helpful in medical diagnosis, they are not yet reliable enough to use on their own for predicting illness probabilities. Doctors can’t fully rely on them to make safe decisions.
To improve these systems, future research could explore ways to combine LLMs with other methods that can handle numbers and risk better. Addressing biases in these models is vital to ensure they provide fair predictions. Until then, it seems doctors will need to continue using their knowledge and experience, along with any helpful tech, to make the best decisions for their patients.
So, while LLMs may not yet be the superhero sidekicks of the medical world, they may one day help doctors fight the battle against illness with better and more reliable information. But for now, it looks like human intuition and experience still hold the crown in the diagnosis realm.
Title: Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability
Abstract: Large language models (LLMs) are being explored for diagnostic decision support, yet their ability to estimate pre-test probabilities, vital for clinical decision-making, remains limited. This study evaluates two LLMs, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks. We examined three current methods of extracting LLM probability estimations and revealed their limitations. We aim to highlight the need for improved techniques in LLM confidence estimation.
Authors: Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew Churpek, Majid Afshar
Last Update: 2024-11-07 00:00:00
Language: English
Source URL: https://www.medrxiv.org/content/10.1101/2024.11.06.24316848
Source PDF: https://www.medrxiv.org/content/10.1101/2024.11.06.24316848.full.pdf
Licence: https://creativecommons.org/licenses/by-nc/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to medrxiv for use of its open access interoperability.