Evaluating Low-Rank Adaptation in Model Training
This article compares LoRA and full finetuning on performance and memory use.
― 4 min read
Table of Contents
Low-Rank Adaptation, or LoRA, is a method used to finetune large language models (LLMs) while saving memory. This method only trains a small number of additional parts, called adapters, instead of changing the entire model. This can help in various tasks like programming and math. However, recent studies show that while LoRA can save memory, it often does not perform as well as full finetuning.
In this article, we will look at how LoRA compares to full finetuning in different tasks. We will also explore how well LoRA maintains performance on tasks outside of the target domain.
Memory Efficiency in Finetuning
Finetuning large models can be very demanding on computer memory. The traditional way involves adjusting the whole model, which can take a lot of resources. In contrast, LoRA focuses on a few adjustments, which makes it lighter in memory use. By only changing certain components, LoRA allows for efficient training, using less memory compared to full finetuning.
Comparing Performance in Programming and Math
We conducted tests to see how LoRA performs against full finetuning in two main areas: programming and mathematics. For our tests, we used two types of training data: instruction finetuning (IFT) and continued pretraining (CPT). IFT uses many question-answer pairs, while CPT focuses on large amounts of unstructured data.
Our findings show that LoRA often does not perform as well as full finetuning. In programming tasks, the gap in performance was noticeable. However, in math tasks, LoRA's results were closer to those of full finetuning.
Regularization
The Role ofLoRA has been noted for its ability to maintain the performance of the base model on unrelated tasks. This is referred to as regularization. Regularization is important because it prevents the model from forgetting what it learned before it was adapted to a new task.
In our study, we found that LoRA provides a form of regularization that is stronger than other common methods. For example, it performs better than techniques like weight decay and dropout, which are used to control overfitting.
Learning and Forgetting Effects
When finetuning models, there is often a tradeoff between learning new tasks and retaining previous knowledge, known as the learning-forgetting tradeoff. In our tests, we observed that while LoRA learns less for new tasks, it also forgets less about earlier tasks.
This indicates that while LoRA might be less effective for learning new information, it does a better job at preserving knowledge from earlier training.
Hyperparameters
Sensitivity toThe performance of both LoRA and full finetuning is greatly influenced by hyperparameters, which are settings used to control the training process. For LoRA, we found it to be more sensitive to the choice of learning rate and which parts of the model are targeted for finetuning.
For our study, we discovered that carefully selecting these hyperparameters could lead to better results with LoRA, although it still struggled against full finetuning.
Practical Recommendations for Using LoRA
From our findings, we recommend using LoRA mainly for instruction finetuning instead of continued pretraining. It's essential to choose the right learning rate, target all modules, and keep the rank low to achieve a good balance between performance and memory use. Training for at least four epochs tends to yield beneficial results.
Conclusion
LoRA offers memory efficiency and prevents forgetting, making it a valuable tool for training large models, especially when memory is a concern. However, full finetuning still outperforms LoRA in many tasks, particularly in programming. Understanding the tradeoffs, effectiveness, and best practices for using LoRA can help in making informed decisions in the realm of model training. As model sizes continue to grow, understanding these methods will become increasingly important for researchers and developers alike.
Title: LoRA Learns Less and Forgets Less
Abstract: Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approximately 100K prompt-response pairs) and continued pretraining (20B unstructured tokens) data regimes. Our results show that, in the standard low-rank settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. We show that LoRA mitigates forgetting more than common regularization techniques such as weight decay and dropout; it also helps maintain more diverse generations. Finally, we show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.
Authors: Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham
Last Update: 2024-09-20 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2405.09673
Source PDF: https://arxiv.org/pdf/2405.09673
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://github.com/goodfeli/dlbook_notation
- https://ctan.org/pkg/pifont
- https://openreview.net/forum?id=XXXX
- https://huggingface.co/datasets/open-web-math/open-web-math
- https://huggingface.co/datasets/meta-math/MetaMathQA
- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- https://gitee.com/api/v5/orgs
- https://gitee.com/api/v5/repos
- https://math.stackexchange.com/questions/222974/probability-of-getting
- https://huggingface.co/datasets/winogrande