Using Language Models to Summarize PET Reports
Study reveals language models can generate useful PET report impressions.
― 6 min read
Table of Contents
Radiologists create reports to explain the results of medical imaging tests. These reports are crucial for sharing important information about a patient's condition with other doctors and the healthcare team. Among various imaging tests, whole-body PET scans are known for being longer and more complex. In a PET report, the Findings section lists many observations from the scan, while the impression section gives a summary of the most important points. Since other doctors rely heavily on the impression section for treatment decisions, it must be both accurate and complete. However, writing these Impressions can take a lot of time and may lead to mistakes. Large language models (LLMs) offer a new way to speed up this process by automatically writing impressions based on the findings.
Background
While LLMs have been used to summarize findings from various imaging tests, they have not been widely applied to whole-body PET reports. PET reports are significantly longer than those for other tests, often containing 250 to 500 words in the findings section. This length brings challenges, as there is a higher chance of missing key information during impression generation. Moreover, individual doctors have different styles of reporting that need to be taken into account for more personalized results. Adapting LLMs to summarize PET reports involves specific challenges.
Evaluating the success of LLMs in producing these impressions is also tricky, as there can be many valid ways to summarize the same information. Expert Evaluation is considered the best way to assess quality, but it is not practical for doctors to review the output of every model. To tackle this, recent studies have developed evaluation metrics to measure how well these models summarize medical documents. However, it has not been determined how effective these metrics are when it comes to PET impressions and how closely they align with the views of Physicians.
The Study
The aim of this study was to see if LLMs trained on a large number of PET reports could accurately summarize findings and create impressions for practical use. Researchers trained 12 different language models using a Dataset of PET reports and assessed their performance using various evaluation metrics. The best-performing model was then tested for its ability to produce clinically useful impressions.
Dataset Collection
A total of 37,370 PET reports collected from one hospital between 2010 and 2022 were used in the study. These reports were anonymized to protect patient information. The data was divided into groups for training, validation, and testing. An additional 100 reports from a different source were also collected for external testing.
Report Preprocessing
Two types of language models were tested: encoder-decoder models and decoder-only models. The first set required specific formatting, where the first lines included details about the scan and the doctor's identity. The second type used a different approach, starting with an instruction that asked the model to generate the impression based on the given report. The actual clinical impressions from the reports were used for model training and evaluation.
Language Models for PET Reports
The study focused on summarization, wherein the models are expected to interpret findings instead of just repeating parts of the findings section. Researchers trained multiple encoder-decoder models and decoder-only models to see which ones performed best in generating accurate impressions. The fine-tuned models were then assessed using various evaluation metrics to identify the model that performed the best for expert assessment.
Evaluating Performance
To determine which evaluation metrics correlated best with physician preferences, the researchers presented model-generated impressions to two physicians who rated them. The metrics with the strongest correlation were used to select the top-performing model.
In the expert evaluation phase, three nuclear medicine physicians reviewed a total of 24 reports, assessing the quality of model-generated impressions. They used specific criteria to rate these impressions and also compared them to impressions originally written by themselves and other physicians.
Results
The study found that two metrics, named BARTScore and PEGASUSScore, had the highest correlations with physician preferences. The PEGASUS model was identified as the top performer. When physicians reviewed the impressions generated by PEGASUS in their own style, a significant number were considered clinically acceptable, suggesting that this model can produce useful outcomes for real-world applications.
When physicians evaluated the impressions generated for their own reports, 89% were scored as clinically acceptable. However, the average utility score was slightly lower than the impressions they originally wrote. This difference was attributed to areas needing improvement, such as factual correctness and clarity.
Furthermore, when evaluating impressions from other physicians, the scores were again lower than those of their own work, highlighting the strong preference that physicians have for their unique reporting styles. Despite the slight differences in satisfaction levels, the overall utility of PEGASUS-generated impressions was considered comparable to those from other physicians.
Challenges Faced by the Model
While the majority of the impressions generated by PEGASUS were acceptable, some common issues were identified. Factual inaccuracies were a frequent problem, with instances of misinterpretation evident. Additionally, the model sometimes produced diagnoses that were overly confident without sufficient evidence. Recommendations made by the model could also be vague, making it difficult for physicians to use them in clinical practice. These issues emphasize the need for thorough review and editing by physicians before finalizing reports.
Limitations of the Study
Several limitations were identified throughout the study. For example, when fine-tuning some models, only a simple method of domain adaptation was used due to limited computational resources. The study also only manipulated one element of the input to adjust the style of generated impressions, leaving other potential methods unexplored.
External testing showed a noticeable drop in evaluation scores, suggesting that differences in reporting styles between the internal training group and external physicians affected performance. Lastly, since the dataset came from a single institution, future research should aim to involve multiple institutions to enhance the findings.
Conclusion
This study examined how large language models could automate the generation of impressions for whole-body PET reports. The results indicated that the best-performing model, PEGASUS, can create personalized and clinically useful impressions in most cases. Given its performance, the model could be integrated into clinical settings to help speed up PET reporting by automatically preparing initial impressions based on the findings available.
The study acknowledges support from various funding sources, while also making clear that the views expressed in the work are those of the authors and do not necessarily reflect the positions of any sponsoring organization.
In conclusion, while challenges remain, the potential for LLMs to improve the process of creating medical reports is promising and can lead to better efficiency in healthcare settings.
Title: Automatic Personalized Impression Generation for PET Reports Using Large Language Models
Abstract: In this study, we aimed to determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's rank correlations (0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). In conclusion, personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.
Authors: Xin Tie, Muheon Shin, Ali Pirasteh, Nevein Ibrahim, Zachary Huemann, Sharon M. Castellino, Kara M. Kelly, John Garrett, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw
Last Update: 2023-10-17 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2309.10066
Source PDF: https://arxiv.org/pdf/2309.10066
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://huggingface.co/xtie/PEGASUS-PET-impression
- https://github.com/xtie97/PET-Report-Summarization
- https://huggingface.co/xtie/BARTScore-PET
- https://github.com/xtie97/PET-Report-Expert-Evaluation
- https://doi.org/10.2967/jnumed.112.112177
- https://doi.org/10.1148/rg.2020200020
- https://arxiv.org/abs/1809.04698
- https://arxiv.org/abs/2204.00203
- https://arxiv.org/abs/2211.08584
- https://arxiv.org/abs/2306.08666
- https://doi.org/10.1148/radiol.231259
- https://arxiv.org/abs/2304.08448
- https://doi.org/10.1038/s41597-019-0322-0
- https://arxiv.org/abs/2112.09925
- https://arxiv.org/abs/2004.09167
- https://arxiv.org/abs/2305.17364
- https://doi.org/10.1056/NEJMoa2206660
- https://arxiv.org/abs/2212.10560
- https://github.com/tatsu-lab/stanford_alpaca
- https://arxiv.org/abs/1910.13461
- https://arxiv.org/abs/1912.08777
- https://arxiv.org/abs/1910.10683
- https://arxiv.org/abs/2109.01652
- https://arxiv.org/abs/2204.03905
- https://doi.org/10.18653/v1/2022.findings-emnlp.398
- https://doi.org/10.18653/v1/2022.acl-long.151
- https://arxiv.org/abs/1909.08593
- https://arxiv.org/abs/2205.01068
- https://arxiv.org/abs/2302.13971
- https://arxiv.org/abs/2106.09685
- https://arxiv.org/abs/2106.11520
- https://arxiv.org/abs/2303.01258
- https://doi.org/10.1186/gb-2008-9-s2-s2
- https://aclanthology.org/W04-1013/
- https://arxiv.org/abs/1904.09675
- https://arxiv.org/abs/2305.13693
- https://doi.org/10.18653/v1/W18-5623
- https://arxiv.org/abs/2201.11838
- https://arxiv.org/abs/1907.11692
- https://arxiv.org/abs/1711.05101
- https://doi.org/10.3115/1073083.1073135
- https://doi.org/10.18653/v1/W15-3049
- https://arxiv.org/abs/1411.5726
- https://arxiv.org/abs/1508.06034
- https://doi.org/10.18653/v1/D19-1053
- https://doi.org/10.18653/v1/2020.emnlp-main.8
- https://doi.org/10.18653/v1/W17-4510
- https://doi.org/10.18653/v1/2022.emnlp-main.131
- https://arxiv.org/abs/1909.01610
- https://doi.org/10.3115/1220575.1220668
- https://doi.org/10.18653/v1/2020.acl-main.124
- https://doi.org/10.18653/v1/N18-1065
- https://doi.org/10.1162/tacl_a_00373