Improving NLP Model Robustness with LLM-TTA

A new approach to enhance NLP model performance on unseen data.

2025-09-08T10:58:00+00:00 ― 4 min read

Table of Contents

Purpose of Study
Test-Time Augmentation
Key Findings
Methodology
Efficiency and Data Availability
Conclusions
Original Source
Reference Links

Machine learning models often perform well on familiar data but face challenges with new, unseen inputs. Many methods aimed at improving model performance on unusual data require access to the model's internal workings. This can be an issue when models are treated as black boxes-like when their weights are fixed or when they are accessed through an API. A technique called Test-Time Augmentation (TTA) helps improve model performance by gathering predictions from various altered versions of the test input. However, TTA has not been effectively applied in natural language processing (NLP) due to difficulties creating suitable text augmentations.

In this work, we introduce a method called LLM-TTA, which uses augmentations generated by large language models (LLMs) to improve TTA. Our experiments show that LLM-TTA enhances performance on various tasks without reducing the model's effectiveness on familiar data.

Purpose of Study

Text Classification models set for real-world use must handle familiar inputs well and also be robust against unfamiliar ones. Improving robustness for new, unseen data is critical in sensitive areas like content moderation and healthcare. The complexity of natural language data, along with the potential for adversarial examples, makes this a significant challenge.

Typically, improving robustness requires access to model weights or entails modifying the model. This can be difficult, especially when retraining is costly or when there aren't enough labels for unusual data. Thus, focusing on the inputs to the model becomes vital.

Test-Time Augmentation

TTA allows for better predictions by assembling multiple predictions over augmented versions of the test input. Choosing the right augmentation function is essential because these augmentations must remain diverse while keeping the original meaning intact, a task that conventional methods struggle with.

Advancements in LLMs in areas like translation and paraphrasing make them suitable for creating high-quality text augmentations. In our study, we compare two methods: zero-shot paraphrasing, where the LLM generates new versions of the text without prior examples, and In-Context Rewriting (ICR), which involves rewriting the text to resemble provided examples.

Key Findings

LLM-TTA Enhances Robustness: ICR boosts a BERT classifier's accuracy on unusual data. The increase averages around 4.86% for sentiment analysis and 6.85% for toxicity detection while minimally affecting familiar data performance.
Conventional Methods May Hurt Performance: In contrast, using traditional augmentation methods generally reduces performance for both familiar and unfamiliar data.
Selective Augmentation Improves Efficiency: By selectively augmenting inputs based on model prediction uncertainty, we can cut down on the number of expensive LLM augmentations, leading to significant cost reductions while maintaining performance levels.

Methodology

We assess LLM-TTA's impact on different NLP tasks, focusing on short-form text classification in a black-box setting. Our methodology explores several datasets across sentiment analysis, toxicity detection, and news topic classification.

For each task, we train models on familiar data, then test how well they handle various unfamiliar datasets. Employing BERT and T5 architectures, we utilize both TTA with conventional augmentations and LLM-TTA to compare results.

Efficiency and Data Availability

Through our experiments, we examine whether LLM-TTA can work efficiently in both data-rich and data-scarce environments. Results indicate the method's effectiveness across different scales of data.

We find that while LLM-TTA can enhance robustness even in cases with limited examples, the overall performance gains tend to be small in low-resource settings. However, these findings confirm that LLM-TTA works well across varying scales of data.

Conclusions

In summary, LLM-TTA stands out as an effective way to enhance model robustness for NLP tasks. It allows for improvements without needing direct access to model weights or extensive retraining. By leveraging entropy to focus augmentations on uncertain predictions, we can further optimize performance while reducing costs. Though LLM-TTA provides clear benefits, ongoing work is necessary to ensure models can fully adapt to shifts in data distribution.

Improving NLP Model Robustness with LLM-TTA

A new approach to enhance NLP model performance on unseen data.

#Purpose of Study

#Test-Time Augmentation

#Key Findings

#Methodology

#Efficiency and Data Availability

#Conclusions

Reference Links

Referenced Topics