Improving Memory Efficiency in Reinforcement Learning with Human Feedback

Table of Contents

Original Source

Reinforcement Learning With Human Feedback (RLHF) has changed how we train language models to better reflect what people want. But a key part of this process, called Proximal Policy Optimization (PPO), uses a lot of memory. It can require three times more memory than traditional methods. This makes it hard for many people to use it effectively. To fix this problem, we looked closely at how much memory these methods use, how well they perform, and how long they take to train.

We introduced a new approach called Hydra-RLHF. This approach combines different models and turns off certain parts during training to save memory. Our tests showed two main things: First, using a technique called LoRA during PPO cuts down memory use below traditional methods while making the model align better with human preferences based on four different tests. Second, our Hydra-PPO approach reduces the time it takes for each sample by up to 65% without losing performance. This makes it easier for more people to use RLHF in their work.

Since models like ChatGPT, GPT-4, and Llama-2 became popular, they have amazed users with how helpful they can be across various tasks. One crucial aspect of their success comes from using RLHF to align these models with human expectations. Training large language models gives them a lot of knowledge, but they often struggle to apply that knowledge correctly. This mismatch can lead to errors and potential harm. To manage this, alignment adjusts the models to behave in expected ways. It is now a vital part of making sure these models are safe and useful.

However, while RLHF does improve this alignment, it also presents challenges. It can be very complex and needs a lot of memory to run multiple models at the same time during PPO. Since RLHF is still a new area of research, there is a strong need to assess its different forms in terms of speed and effectiveness.

To meet that need, we focused on the training steps and structures of standard RLHF-PPO. We found big chances to cut memory and computation costs by sharing models between Reference, Reward, Actor, and Critic models.

Our comparisons showed how much memory and time different methods used when tested on a specific model. We also presented a detailed overview of how many models are needed in different PPO methods, demonstrating that our Hydra-PPO method uses fewer models in memory, making it more efficient.

Stages of the RLHF Process

The RLHF method consists of three major stages:

Supervised Fine-Tuning (SFT): This stage involves training a language model on a dataset to learn language patterns. There are two versions: one where all parameters are trained (Full Fine-Tuning) and another where a specific technique (LoRA) is used to reduce the number of parameters.
Reward Model (RM): Here, we modify the language model's output, focusing on predicting what humans prefer based on a set of prompt and response pairs. After training, we ensure the reward given from this model is stable to help the PPO step.
PPO: In this last stage, we train both an actor (the creative part of the model) and a critic (which evaluates the output) using the previously defined reward model. During this training, at least four models are being utilized, including a frozen Reference model to ensure stability.

Introducing Hydra-RLHF

We propose Hydra-RLHF, which modifies traditional RLHF to save on memory during the PPO phase while maintaining performance.

Hydra-SFT: This new training method uses a dataset similar to the standard reward model training, optimizing two tasks simultaneously. This method requires new data that includes paired comparisons to train effectively.
Dynamic LoRA: This approach helps save memory by turning off the LoRA weights when not needed. Since there are two identical models (actor and critic), we can recover one from the other, significantly reducing memory use while keeping performance intact.
Hydra-PPO: By using separate LoRA weights for the actor and critic, we further reduce the need for multiple models in memory during PPO.

Results and Comparisons

We tested different methods against each other to determine their performance. We found that our new methods generally outperform traditional methods on average. Hydra-PPO showed better alignment than LoRA-PPO, likely due to the improved reward model.

In terms of time, Hydra-PPO became faster as the amount of text increased. By increasing the training batch size, we were able to achieve a substantial decrease in the time taken per sample during PPO.

We also evaluated other datasets, such as StackExchange and Learning to Summarize, and found interesting patterns across the results. For instance, while standard models often perform well, PPO methods showed better recall but sometimes lagged in precision.

Challenges with Joined-Hydra-PPO

We also tested Joined-Hydra-PPO, which uses one set of LoRA weights for both the actor and critic. This method saved some memory, but its performance was not as good as Hydra-PPO. We believe this stems from the instability that arises when combining the two models into one.

Future Directions

Our research points to new pathways for improving RLHF. There is a need to balance the datasets used for SFT and RM training better. Further development could enhance the performance of methods like J-Hydra-PPO, as well as make other techniques for parameter-efficient fine-tuning more effective in RLHF settings.

Conclusion

Through our study, we have shown that it is possible to improve the efficiency of RLHF by saving memory during the PPO phase. Our Hydra-RLHF method combines models and adjusts training strategies to allow the use of larger batch sizes, leading to faster and more accessible training processes. We hope that our findings encourage wider adoption of RLHF and inspire further improvements in this exciting area of technology.

Improving Memory Efficiency in Reinforcement Learning with Human Feedback

New methods enhance memory use and speed in language model training.

Stages of the RLHF Process

Introducing Hydra-RLHF

Results and Comparisons

Challenges with Joined-Hydra-PPO

Future Directions

Conclusion

Referenced Topics

Improving Memory Efficiency in Reinforcement Learning with Human Feedback

New methods enhance memory use and speed in language model training.

#Stages of the RLHF Process

#Introducing Hydra-RLHF

#Results and Comparisons

#Challenges with Joined-Hydra-PPO

#Future Directions

#Conclusion

Referenced Topics

Stages of the RLHF Process

Introducing Hydra-RLHF

Results and Comparisons

Challenges with Joined-Hydra-PPO

Future Directions

Conclusion