What does "Outcome-Based Reward Models" mean?
Table of Contents
Outcome-Based Reward Models (ORMs) are tools used to help language models decide when to improve their answers. These models look at the final answer produced by the language model and make a guess about whether it is correct or not. By doing this, ORMs can signal to the model when it should try to refine or change its response for better accuracy.
How ORMs Work
ORMs are trained on examples where human feedback indicates if answers are correct. By learning from these examples, ORMs aim to predict the correctness of new answers. When a language model gets a signal from an ORM that its answer might be wrong, it can go back and adjust its reasoning to provide a better solution.
Benefits of ORMs
Using ORMs helps improve the performance of language models, especially in tasks that require reasoning, like math or science questions. By knowing when to refine answers, models become more effective in generating accurate responses. This kind of feedback is crucial for improving the overall quality of the answers produced by language models.