Improving NLP Model Robustness with LLM-TTA
A new approach to enhance NLP model performance on unseen data.
― 4 min read
Table of Contents
Machine learning models often perform well on familiar data but face challenges with new, unseen inputs. Many methods aimed at improving model performance on unusual data require access to the model's internal workings. This can be an issue when models are treated as black boxes-like when their weights are fixed or when they are accessed through an API. A technique called Test-Time Augmentation (TTA) helps improve model performance by gathering predictions from various altered versions of the test input. However, TTA has not been effectively applied in natural language processing (NLP) due to difficulties creating suitable text augmentations.
In this work, we introduce a method called LLM-TTA, which uses augmentations generated by large language models (LLMs) to improve TTA. Our experiments show that LLM-TTA enhances performance on various tasks without reducing the model's effectiveness on familiar data.
Purpose of Study
Text Classification models set for real-world use must handle familiar inputs well and also be robust against unfamiliar ones. Improving robustness for new, unseen data is critical in sensitive areas like content moderation and healthcare. The complexity of natural language data, along with the potential for adversarial examples, makes this a significant challenge.
Typically, improving robustness requires access to model weights or entails modifying the model. This can be difficult, especially when retraining is costly or when there aren't enough labels for unusual data. Thus, focusing on the inputs to the model becomes vital.
Test-Time Augmentation
TTA allows for better predictions by assembling multiple predictions over augmented versions of the test input. Choosing the right augmentation function is essential because these augmentations must remain diverse while keeping the original meaning intact, a task that conventional methods struggle with.
Advancements in LLMs in areas like translation and paraphrasing make them suitable for creating high-quality text augmentations. In our study, we compare two methods: zero-shot paraphrasing, where the LLM generates new versions of the text without prior examples, and In-Context Rewriting (ICR), which involves rewriting the text to resemble provided examples.
Key Findings
LLM-TTA Enhances Robustness: ICR boosts a BERT classifier's accuracy on unusual data. The increase averages around 4.86% for sentiment analysis and 6.85% for toxicity detection while minimally affecting familiar data performance.
Conventional Methods May Hurt Performance: In contrast, using traditional augmentation methods generally reduces performance for both familiar and unfamiliar data.
Selective Augmentation Improves Efficiency: By selectively augmenting inputs based on model prediction uncertainty, we can cut down on the number of expensive LLM augmentations, leading to significant cost reductions while maintaining performance levels.
Methodology
We assess LLM-TTA's impact on different NLP tasks, focusing on short-form text classification in a black-box setting. Our methodology explores several datasets across sentiment analysis, toxicity detection, and news topic classification.
For each task, we train models on familiar data, then test how well they handle various unfamiliar datasets. Employing BERT and T5 architectures, we utilize both TTA with conventional augmentations and LLM-TTA to compare results.
Efficiency and Data Availability
Through our experiments, we examine whether LLM-TTA can work efficiently in both data-rich and data-scarce environments. Results indicate the method's effectiveness across different scales of data.
We find that while LLM-TTA can enhance robustness even in cases with limited examples, the overall performance gains tend to be small in low-resource settings. However, these findings confirm that LLM-TTA works well across varying scales of data.
Conclusions
In summary, LLM-TTA stands out as an effective way to enhance model robustness for NLP tasks. It allows for improvements without needing direct access to model weights or extensive retraining. By leveraging entropy to focus augmentations on uncertain predictions, we can further optimize performance while reducing costs. Though LLM-TTA provides clear benefits, ongoing work is necessary to ensure models can fully adapt to shifts in data distribution.
Title: Improving Black-box Robustness with In-Context Rewriting
Abstract: Machine learning models for text classification often excel on in-distribution (ID) data but struggle with unseen out-of-distribution (OOD) inputs. Most techniques for improving OOD robustness are not applicable to settings where the model is effectively a black box, such as when the weights are frozen, retraining is costly, or the model is leveraged via an API. Test-time augmentation (TTA) is a simple post-hoc technique for improving robustness that sidesteps black-box constraints by aggregating predictions across multiple augmentations of the test input. TTA has seen limited use in NLP due to the challenge of generating effective natural language augmentations. In this work, we propose LLM-TTA, which uses LLM-generated augmentations as TTA's augmentation function. LLM-TTA outperforms conventional augmentation functions across sentiment, toxicity, and news classification tasks for BERT and T5 models, with BERT's OOD robustness improving by an average of 4.48 percentage points without regressing average ID performance. We explore selectively augmenting inputs based on prediction entropy to reduce the rate of expensive LLM augmentations, allowing us to maintain performance gains while reducing the average number of generated augmentations by 57.74\%. LLM-TTA is agnostic to the task model architecture, does not require OOD labels, and is effective across low and high-resource settings. We share our data, models, and code for reproducibility.
Authors: Kyle O'Brien, Nathan Ng, Isha Puri, Jorge Mendez, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi, Thomas Hartvigsen
Last Update: 2024-08-04 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2402.08225
Source PDF: https://arxiv.org/pdf/2402.08225
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://huggingface.co/facebook/wmt19-en-de
- https://huggingface.co/facebook/wmt19-de-en
- https://www.reuters.co.uk/financeQuoteLookup.jhtml?ticker=MDT.N
- https://huggingface.co/princeton-nlp/sup-simcse-roberta-large
- https://huggingface.co/datasets/Kyle1668/LLM-TTA-Augmentation-Logs
- https://github.com/Kyle1668/In-Context-Domain-Transfer-Improves-Out-of-Domain-Robustness
- https://github.com/Kyle1668/LLM-TTA
- https://huggingface.co/collections/Kyle1668/
- https://github.com/goodfeli/dlbook_notation
- https://openreview.net/forum?id=XXXX