Aligning Open LLMs with Human Evaluation
A new method improves LLM performance in personalized evaluations with limited data.
Javad Seraj, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi
― 5 min read
Table of Contents
Automatic evaluation using large language models (LLMs) is a hot topic today. However, evaluating tasks can be subjective and can be affected by different factors, making it hard to adapt. Many studies show that top proprietary LLMs can do well compared to human evaluators, yet they often have trouble adjusting to preferences over time. This adjustment is necessary for personalized evaluation.
Numerous attempts have been made to apply open LLMs as evaluators, but many of these miss the issue of working with limited data. Personalized judgment often comes from situations with few data points, which is common in real-life situations.
This paper proposes a Data Augmentation method to choose a more effective sample from limited data to align an open LLM with human preferences. The results show about a 7% improvement in Pearson correlation with a reference judge compared to the baseline and a 30% improvement over the base model in mathematical reasoning.
The human evaluation process is subjective and can vary greatly depending on the evaluator's mood. For example, grading students' papers can change from one semester to the next, reflecting the teacher's mood or situation. This variability must be considered when trying to model or mimic an evaluator's behavior.
Automatic Evaluations often face limitations due to the small amount of Feedback typically available. This makes it important to look for effective training methods for assessments in limited-data situations.
This paper shows a way to align an open LLM with a reference evaluator in a data-starved setting, focusing on personalized judgment in tasks like math and general question-answering.
LLM-based evaluation has become a scalable and cost-effective way to assess both machine-generated and human-generated text. LLMs provide feedback with a score, indicating quality.
Previous studies using proprietary LLMs as evaluators have shown high correlation with human judgments, improved speed, and cost-effectiveness. These models tend to do well in static judgment, where scoring is based on fixed criteria. However, personalizing these models for specific evaluator preferences is challenging, and they often lack dynamic judgment.
Dynamic judgment means an evaluator's ability to learn from few samples and adjust evaluation policies over time, which is crucial for personalized evaluation. This work presents an effective way to align an open LLM with a reference evaluator in a limited data setting.
The aim is to adjust the judgment of the LLM to match that of the human judge. The proposed method shows approximately 9% and 7% higher Pearson correlation for math and general question-answering evaluations, respectively. This shows that selecting more effective data helps the approach outperform baseline methods.
Contributions
- Proposed a method to simplify dynamic judgment for open LLMs, which is a challenge that has not been fully addressed before.
- Introduced a technique to augment data aimed at improving the reasoning ability of the judge model using the chain of thought (CoT) method.
- Introduced a method to select effective instances from reference judgments, focusing on reducing bias in the aligned model.
Related Works
Naive Data Creation
Different methods are used for preference data creation. The naive data creation approach uses direct feedback from a reference judge.
LLM as a Judge
Using LLMs as judges has gained attention for their ability to mimic human evaluation accuracy. Many use proprietary models like GPT-4, which have shown strong agreement with human assessments.
Human Preference Alignment
LLMs are great at generating text but struggle with instruction following and aligning with human expectations. Supervised Fine-Tuning (SFT) has become a key method for this alignment. Several methods have emerged based on reinforcement learning from human feedback (RLHF).
Data Efficient Alignment
The size and quality of data have a major impact on LLM training time and cost. Efficient data use can reduce training iterations. Some studies focus on improving data quality by filtering out low-quality data.
Data-Efficient Judgment
This section presents an approach for aligning an LLM with a reference judge. While the focus is on machine-generated text, it could be extended to human text as well.
Data Curation and Augmentation
Assessment tasks require strong reasoning skills to ensure fair and accurate decisions. However, studies have shown that LLMs like Llama-3.1-8B-Instruct are not very effective as evaluators.
Seed for Preference Dataset
Starting with a question and response dataset, feedback and scores from a reference judge are collected. This seed dataset aims to improve the LLM's judgment performance.
Naive Data Creation Approach
In this method, the base LLM generates feedback and scores for responses. The generated feedback is assumed to be of lower quality compared to feedback from the reference judge.
Pool of Feedback Approach
Multiple feedback and score pairs are generated for each response using the base LLM, which leverages its reasoning abilities. This allows the LLM to produce better feedback.
Efficient Sampling Approach
This method selects more effective samples from the reference judge. Instead of using all feedback, a subset is chosen based on similarity.
Experiment Setup
Describes the size of created data and alignment datasets extracted from feedback datasets. The results show improved alignment with the reference judge.
Evaluation Setup
The experimental setup for assessing evaluator LMs involves using Pearson, Spearman, and Kendall-Tau for performance metrics against the reference evaluator. The results are compared across three methods, highlighting the importance of the chosen data sampling strategy.
Results
Findings show that the proposed approach yields significant improvements in alignment with human evaluators. However, the study is limited by data availability and focuses on specific tasks, which may affect its broader applicability.
Conclusion
While LLMs have potential for automatic evaluation, personalizing them for subjective tasks in limited-data situations remains a challenge. The proposed methods show significant improvements and potential for better aligning LLMs with human evaluations. Future work could focus on expanding the range of tasks and increasing data diversity for greater generalizability.
Original Source
Title: Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation
Abstract: Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.
Authors: Javad Seraj, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi
Last Update: Dec 10, 2024
Language: English
Source URL: https://arxiv.org/abs/2412.07429
Source PDF: https://arxiv.org/pdf/2412.07429
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.