Synthetic Data Generation for Clinical Language Models

Table of Contents

The Need for Clinical Data
Exploring Synthetic Data
How Rephrasing Works
The LLMs Used for Rephrasing
Evaluating Perplexity
Fine-Tuning with Real and Synthetic Notes
Promising Results
Future Directions
Conclusion
Samples of Rephrased Notes
The Future of Clinical Language Models
Original Source
Reference Links

Clinical language models play a big role in healthcare by helping with tasks like decision support and understanding patient data. But developing these models requires access to a lot of clinical text, which can be hard to gather due to patient privacy rules. This study looks at how we can rephrase existing Clinical Notes using large language models (LLMs) to create synthetic training data. By doing this, we hope to help healthcare institutions develop better models without needing to rely solely on real clinical notes.

The Need for Clinical Data

In healthcare, language models are becoming more important as they can improve various applications. However, for these models to work well, they need to be trained with clinical data. This training process, called Pretraining, helps the models adapt to the specific needs of healthcare. Unfortunately, privacy and compliance rules surrounding Electronic Health Records (EHRs) make it difficult to obtain sufficient clinical notes for this purpose.

While some large healthcare organizations can use their own EHR data for training, this isn’t an option for smaller institutions. The result is a slowdown in research geared towards better language models that could improve healthcare outcomes.

Exploring Synthetic Data

To tackle the scarcity of clinical data, researchers have looked into using synthetic data for various clinical tasks. Some existing methods work well but are mostly focused on specific tasks and not general training. One recent approach tried using ChatGPT to create clinical summaries based on patient profiles found in medical literature. While this method shows promise for generating synthetic clinical notes, it heavily relies on the LLM's existing knowledge, which can lead to inaccuracies.

Instead of starting from scratch, this study proposes taking real clinical notes and rephrasing them using LLMs. This method is inspired by previous work that showed how rephrasing web data can benefit general language models. By using existing EHR data, we can create a more reliable synthetic training dataset.

How Rephrasing Works

For our approach, we use various LLMs to rephrase clinical notes. The goal is to create pretraining data that can help models better understand clinical language. We developed three different prompts to guide how the LLMs should rephrase these notes, focusing on clarity, professionalism, and medical accuracy.

Prompt 1: Asks the LLM to create a diverse paraphrase in high-quality English like what you would find on Wikipedia.
Prompt 2: Similar to Prompt 1, but specifically requests a professional medical tone.
Prompt 3: Builds on Prompt 2 by asking the LLM to explain any medical terms used.

Using these prompts, we divide the clinical notes into manageable chunks for the LLMs to process. It’s important to keep these chunks reasonably small-around 300 tokens-to ensure the LLM doesn’t lose important information during rephrasing.

The LLMs Used for Rephrasing

We examined four smaller LLMs, all under 10 billion parameters, to see how well they could handle clinical text. This included Llama-3.1, Mistral-0.3, Qwen-2, and Gemma-2. We avoided using larger models because they tend to require more resources and weren't as efficient for our needs.

For our source data, we utilized discharge summaries from the MIMIC-III database. These summaries provide a comprehensive overview of patient care, making them a valuable resource for generating diverse and meaningful clinical data.

Evaluating Perplexity

To see how well our rephrasing method worked, we measured the perplexity of the language models on the synthetic data they produced. Lower perplexity scores indicate better performance in understanding and generating language. Our results showed that the rephrasing method significantly outperformed previous synthetic data methods that did not use real clinical notes.

Interestingly, we found different LLMs responded uniquely to the prompts. For instance, Qwen-2 performed better with medically focused prompts, while Mistral-0.3 did well with prompts designed for general paraphrasing.

Fine-Tuning with Real and Synthetic Notes

We then explored how encoder-based language models could be fine-tuned using both real and synthetic clinical notes. This helps bridge the gap where healthcare institutions might not have enough approved EHR data to train their models.

We tested our models on several clinical NLP tasks, like natural language inference and named entity recognition. The data revealed that models augmented with synthetic notes generally performed better than standard models, highlighting the benefits of our rephrasing strategy.

Promising Results

Through our experiments, we demonstrated that combining synthetic data generated by various prompts can lead to stronger performance. Interestingly, while some prompts hindered performance in perplexity tests, they boosted fine-tuning results. This suggests that certain prompts might be better suited for specific tasks.

Our approach is particularly exciting as it allows for a much smaller resource and token budget compared to traditional methods, while still achieving superior results.

Future Directions

While this study focused on the quantitative effectiveness of rephrasing, we recognize the importance of qualitative analysis as well. Understanding how well the rephrased notes retain the original meaning and structure will be essential for future research.

It's important to ensure that when LLMs rephrase clinical notes, they do not unintentionally change the meaning or introduce inaccuracies into the information. Future studies will look into how different prompts impact the quality of rephrasing and whether they lead to biases or inaccuracies in the generated text.

Additionally, we aim to expand our dataset by incorporating more types of clinical notes, which will help create stronger models for a variety of healthcare applications.

Conclusion

Our research highlights the potential of using LLMs to rephrase clinical notes for generating pretraining datasets for language models. By exploring this method further and scaling it up, we can improve the development of effective clinical language models that can enhance patient care and support healthcare professionals.

Samples of Rephrased Notes

For a glimpse into our process, we have rephrased examples from the four LLMs based on real clinical text. Each model produced slightly different outputs, showcasing their individual strengths and styles. Some maintained the structure of the original note, while others were more succinct.

Understanding these stylistic differences will be crucial as we work to refine our methods and improve the quality of the synthetic data we produce.

The Future of Clinical Language Models

The landscape of healthcare is ever-changing, and the need for reliable, efficient tools to process clinical information continues to grow. As we advance our understanding and techniques for generating training data, the potential for improving healthcare outcomes becomes clearer.

By focusing on rephrasing existing clinical notes, we not only respect patient privacy but also create valuable resources that can help propel the next generation of clinical language models forward. The combination of real and synthetic data holds promise for more effective, scalable solutions that can meet the needs of healthcare professionals and support better patient care.

As we move forward with this research, we thank our reviewers for their insightful feedback, which helped enhance this work. We look forward to releasing larger datasets to further investigate these findings and contribute to the ongoing development of clinical language models in the healthcare field.

Synthetic Data Generation for Clinical Language Models

The Need for Clinical Data

Exploring Synthetic Data

How Rephrasing Works

The LLMs Used for Rephrasing

Evaluating Perplexity

Fine-Tuning with Real and Synthetic Notes

Promising Results

Future Directions

Conclusion

Samples of Rephrased Notes

The Future of Clinical Language Models

Reference Links

Referenced Topics

More from authors

Similar Articles

Synthetic Data Generation for Clinical Language Models

#The Need for Clinical Data

#Exploring Synthetic Data

#How Rephrasing Works

#The LLMs Used for Rephrasing

#Evaluating Perplexity

#Fine-Tuning with Real and Synthetic Notes

#Promising Results

#Future Directions

#Conclusion

#Samples of Rephrased Notes

#The Future of Clinical Language Models

Reference Links

Referenced Topics

More from authors

Similar Articles

The Need for Clinical Data

Exploring Synthetic Data

How Rephrasing Works

The LLMs Used for Rephrasing

Evaluating Perplexity

Fine-Tuning with Real and Synthetic Notes

Promising Results

Future Directions

Conclusion

Samples of Rephrased Notes

The Future of Clinical Language Models