Aligning Open LLMs with Human Evaluation

A new method improves LLM performance in personalized evaluations with limited data.

Table of Contents

Contributions
Related Works
Data-Efficient Judgment
Data Curation and Augmentation
Seed for Preference Dataset
Naive Data Creation Approach
Pool of Feedback Approach
Efficient Sampling Approach
Experiment Setup
Evaluation Setup
Results
Conclusion
Original Source
Reference Links

Automatic evaluation using large language models (LLMs) is a hot topic today. However, evaluating tasks can be subjective and can be affected by different factors, making it hard to adapt. Many studies show that top proprietary LLMs can do well compared to human evaluators, yet they often have trouble adjusting to preferences over time. This adjustment is necessary for personalized evaluation.

Numerous attempts have been made to apply open LLMs as evaluators, but many of these miss the issue of working with limited data. Personalized judgment often comes from situations with few data points, which is common in real-life situations.

This paper proposes a Data Augmentation method to choose a more effective sample from limited data to align an open LLM with human preferences. The results show about a 7% improvement in Pearson correlation with a reference judge compared to the baseline and a 30% improvement over the base model in mathematical reasoning.

The human evaluation process is subjective and can vary greatly depending on the evaluator's mood. For example, grading students' papers can change from one semester to the next, reflecting the teacher's mood or situation. This variability must be considered when trying to model or mimic an evaluator's behavior.

Automatic Evaluations often face limitations due to the small amount of Feedback typically available. This makes it important to look for effective training methods for assessments in limited-data situations.

This paper shows a way to align an open LLM with a reference evaluator in a data-starved setting, focusing on personalized judgment in tasks like math and general question-answering.

LLM-based evaluation has become a scalable and cost-effective way to assess both machine-generated and human-generated text. LLMs provide feedback with a score, indicating quality.

Previous studies using proprietary LLMs as evaluators have shown high correlation with human judgments, improved speed, and cost-effectiveness. These models tend to do well in static judgment, where scoring is based on fixed criteria. However, personalizing these models for specific evaluator preferences is challenging, and they often lack dynamic judgment.

Dynamic judgment means an evaluator's ability to learn from few samples and adjust evaluation policies over time, which is crucial for personalized evaluation. This work presents an effective way to align an open LLM with a reference evaluator in a limited data setting.

The aim is to adjust the judgment of the LLM to match that of the human judge. The proposed method shows approximately 9% and 7% higher Pearson correlation for math and general question-answering evaluations, respectively. This shows that selecting more effective data helps the approach outperform baseline methods.

Contributions

Proposed a method to simplify dynamic judgment for open LLMs, which is a challenge that has not been fully addressed before.
Introduced a technique to augment data aimed at improving the reasoning ability of the judge model using the chain of thought (CoT) method.
Introduced a method to select effective instances from reference judgments, focusing on reducing bias in the aligned model.

Related Works

Naive Data Creation

Different methods are used for preference data creation. The naive data creation approach uses direct feedback from a reference judge.

LLM as a Judge

Using LLMs as judges has gained attention for their ability to mimic human evaluation accuracy. Many use proprietary models like GPT-4, which have shown strong agreement with human assessments.

Human Preference Alignment

LLMs are great at generating text but struggle with instruction following and aligning with human expectations. Supervised Fine-Tuning (SFT) has become a key method for this alignment. Several methods have emerged based on reinforcement learning from human feedback (RLHF).

Data Efficient Alignment

The size and quality of data have a major impact on LLM training time and cost. Efficient data use can reduce training iterations. Some studies focus on improving data quality by filtering out low-quality data.

Data-Efficient Judgment

This section presents an approach for aligning an LLM with a reference judge. While the focus is on machine-generated text, it could be extended to human text as well.

Data Curation and Augmentation

Assessment tasks require strong reasoning skills to ensure fair and accurate decisions. However, studies have shown that LLMs like Llama-3.1-8B-Instruct are not very effective as evaluators.

Seed for Preference Dataset

Starting with a question and response dataset, feedback and scores from a reference judge are collected. This seed dataset aims to improve the LLM's judgment performance.

Naive Data Creation Approach

In this method, the base LLM generates feedback and scores for responses. The generated feedback is assumed to be of lower quality compared to feedback from the reference judge.

Pool of Feedback Approach

Multiple feedback and score pairs are generated for each response using the base LLM, which leverages its reasoning abilities. This allows the LLM to produce better feedback.

Efficient Sampling Approach

This method selects more effective samples from the reference judge. Instead of using all feedback, a subset is chosen based on similarity.

Experiment Setup

Describes the size of created data and alignment datasets extracted from feedback datasets. The results show improved alignment with the reference judge.

Evaluation Setup

The experimental setup for assessing evaluator LMs involves using Pearson, Spearman, and Kendall-Tau for performance metrics against the reference evaluator. The results are compared across three methods, highlighting the importance of the chosen data sampling strategy.

Results

Findings show that the proposed approach yields significant improvements in alignment with human evaluators. However, the study is limited by data availability and focuses on specific tasks, which may affect its broader applicability.

Conclusion

While LLMs have potential for automatic evaluation, personalizing them for subjective tasks in limited-data situations remains a challenge. The proposed methods show significant improvements and potential for better aligning LLMs with human evaluations. Future work could focus on expanding the range of tasks and increasing data diversity for greater generalizability.

Aligning Open LLMs with Human Evaluation

Contributions

Related Works

Naive Data Creation

LLM as a Judge

Human Preference Alignment

Data Efficient Alignment

Data-Efficient Judgment

Data Curation and Augmentation

Seed for Preference Dataset

Naive Data Creation Approach

Pool of Feedback Approach

Efficient Sampling Approach

Experiment Setup

Evaluation Setup

Results

Conclusion

Reference Links

Referenced Topics

More from authors

Similar Articles

Aligning Open LLMs with Human Evaluation

#Contributions

#Related Works

#Naive Data Creation

#LLM as a Judge

#Human Preference Alignment

#Data Efficient Alignment

#Data-Efficient Judgment

#Data Curation and Augmentation

#Seed for Preference Dataset

#Naive Data Creation Approach

#Pool of Feedback Approach

#Efficient Sampling Approach

#Experiment Setup

#Evaluation Setup

#Results

#Conclusion

Reference Links

Referenced Topics

More from authors

Similar Articles

Contributions

Related Works

Naive Data Creation

LLM as a Judge

Human Preference Alignment

Data Efficient Alignment

Data-Efficient Judgment

Data Curation and Augmentation

Seed for Preference Dataset

Naive Data Creation Approach

Pool of Feedback Approach

Efficient Sampling Approach

Experiment Setup

Evaluation Setup

Results

Conclusion