Using LLMs to Enhance Reading Comprehension Datasets
This article discusses how LLMs can create new datasets for reading comprehension tasks.
― 6 min read
Table of Contents
Large Language Models (LLMs) have shown strong abilities in various language tasks. One interesting use of LLMs is to make new Synthetic Datasets for Reading Comprehension tasks. This is especially useful when there is not enough data available. In this article, we look at how LLMs like GPT-4 can help improve reading comprehension datasets that have a limited number of examples. These models can simplify the process of creating datasets, which usually takes a lot of time and effort.
The Importance of Reading Comprehension
Reading comprehension is the process where systems answer questions based on given text. This ability is important in many areas like healthcare, customer service, and understanding policies. Previous models, especially BERT-based ones, have performed very well when trained with large datasets. However, their performance drops when they face subjects where there isn’t enough data, such as emerging topics like COVID-19.
Data Augmentation
The Role ofData augmentation is a technique used to improve model performance in situations where there isn’t enough data. In the context of question answering, most data augmentation methods rely on finding unlabeled texts, such as those found on Wikipedia, to create new context-question-answer pairs. However, this approach faces challenges in specialized areas where relevant texts are rare. LLMs can generate meaningful text that mirrors human writing style. This feature can be used to create both new contexts and the related questions and answers.
Our Approach
We use GPT-4 to enhance low-resource reading comprehension datasets. Our method focuses on generating new contexts, questions, and answers to add to existing training sets. We start by providing examples from the original datasets to GPT-4, allowing it to learn from these samples. This helps in producing data that closely reflects the original materials.
After generating the data, we apply a filtering technique to select the highest quality examples. We test our method on three specific low-resource datasets: CovidQA, PolicyQA, and TechQA. The results show that our approach improves performance on the CovidQA dataset by 23% and on the PolicyQA dataset by 5%.
Related Work
LLMs have been crucial in generating synthetic datasets for different language tasks. Previous models, including GPT-2, have been used in various applications like understanding languages, creating dialogues, and reasoning. Recent models have greatly enhanced the quality of synthetic data, leading to improved performance in various tasks.
Past work mainly focused on creating questions from passages found online, like those from Wikipedia. We are among the first to use LLMs for creating full context, questions, and answers for low-resource reading comprehension tasks.
Low-Resource Datasets
In our study, we use three reading comprehension datasets:
- CovidQA: This dataset includes 2,019 question-answer pairs about COVID-19-related topics.
- PolicyQA: This dataset has 12,102 question-answer pairs revolving around U.S. immigration and travel policies.
- TechQA: This dataset consists of 1,808 examples focused on technical support issues in computing.
These datasets are well-suited for our experiments as they represent different fields while having small training sizes.
Methodology
We outline our methodology using PolicyQA as an example. Our data generation process follows two main steps:
1. Context Generation
In this step, we give GPT-4 one or two examples of contexts from the original training set. These examples help GPT-4 understand the style and content of the data. After this, we generate new contexts by prompting GPT-4 to write additional paragraphs.
2. Question-Answer Generation
Next, we create synthetic question-answer pairs based on the new contexts. Again, we provide one or two examples from the original dataset to help GPT-4 grasp the format of the question-answer pairs. After that, we ask GPT-4 to generate questions and answers that relate to the synthetic contexts we created.
This two-step process allows us to generate datasets that maintain the characteristics of the original data. We create different amounts of synthetic data, ranging from one to ten times the size of the original datasets, to see how it affects performance.
Round Trip Filtering
To improve the quality of the generated question-answer pairs, we implement a technique called round trip filtering. After GPT-4 creates a question and answer, we provide the question back to the model without the answer. Then, we check if the new answer matches the original one. If they match, we keep the pair; if not, we discard it. This filtering helps us retain only the most reliable pairs.
Training the Model
For our experiments, we train an extractive reading comprehension model using the RoBERTA-Base model. We follow standard practices in setting learning rates, batch sizes, and the number of epochs. For every experiment, we measure the F1 score and Exact Match scores.
As a baseline for question-answer generation, we use a T5-based model trained on the SQuAD dataset.
Experimental Results
In testing, we found that adding synthetic data from GPT-4 improved performance on the CovidQA dataset. Starting from the original training examples, both one-shot and two-shot synthetic examples boosted performance in terms of exact match and F1 scores. The best results came from one-shot data generation combined with the round trip filtering process.
For the PolicyQA dataset, the largest of our datasets, using one-shot synthetic data without filtering achieved the best performance. This approach improved scores compared to only using the original examples. The size of the PolicyQA dataset made high precision filtering less critical, allowing the model to benefit from the variety that synthetic data offered.
On the TechQA dataset, the smallest of the three, the results were less clear-cut. The baseline model performed well with just the original examples, while different configurations of synthetic data did not show consistent improvements. The small size of the dataset likely hindered effective generalization.
Conclusion
Our results indicate that large language models can effectively generate synthetic data to enhance reading comprehension tasks. In the CovidQA and PolicyQA areas, where moderate amounts of training data exist, augmenting with synthetic examples consistently led to better performance. This highlights the potential of LLMs in broadening datasets while minimizing the need for human labor in labeling.
However, challenges remain, especially in areas where data is extremely limited. In such cases, LLMs may struggle to produce useful examples. There is a pressing need for improvements in few-shot learning, as well as mechanisms for better filtering of synthetic data to ensure quality and diversity.
In summary, while LLMs like GPT-4 show promise in overcoming data limitations, future research must focus on refining these tools to make them effective across diverse scenarios. The field is evolving rapidly, and continued work will determine how well LLMs can support improved learning in language tasks with limited data.
Title: Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges
Abstract: Large Language Models (LLMs) have demonstrated impressive zero shot performance on a wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A relevant application is to use them for creating high quality synthetic datasets for downstream tasks. In this work, we probe whether GPT-4 can be used to augment existing extractive reading comprehension datasets. Automating data annotation processes has the potential to save large amounts of time, money and effort that goes into manually labelling datasets. In this paper, we evaluate the performance of GPT-4 as a replacement for human annotators for low resource reading comprehension tasks, by comparing performance after fine tuning, and the cost associated with annotation. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and challenges. Additionally, we release augmented versions of low resource datasets, that will allow the research community to create further benchmarks for evaluation of generated datasets.
Authors: Vinay Samuel, Houda Aynaou, Arijit Ghosh Chowdhury, Karthik Venkat Ramanan, Aman Chadha
Last Update: 2024-07-09 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2309.12426
Source PDF: https://arxiv.org/pdf/2309.12426
Licence: https://creativecommons.org/publicdomain/zero/1.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.