Simple Science

Cutting edge science explained simply

# Computer Science# Machine Learning# Artificial Intelligence

Evaluating Model Performance on Diverse Tasks

This article analyzes model performance across various tasks and datasets.

― 5 min read


Model Evaluation InsightsModel Evaluation Insightsvarious tasks and updates.Analyze model performance across
Table of Contents

In this section, we look at how different models perform on various tasks and datasets. We will break down results into clear parts for easier understanding, and we will refer to figures to help identify key outcomes.

Evaluation on Different Datasets

We tested the LLaMa-2 7 billion parameter model on several tasks using a tool called eval-harness. The results show that when we reduce the model size (a process called Pruning), the effect is much more visible on some tasks compared to others, particularly on the GSM-8K task.

Similarly, we examined the Mistral 7 billion parameter model on different tasks. The findings are consistent; pruning affects some tasks notably, especially GSM-8K.

We also looked at tasks listed on the OpenLLM leaderboard to ensure we can repeat our tests. The tasks we evaluated included a variety of challenges like MMLU, GSM-8K, ARC (both easy and hard versions), BoolQ, HellaSwag, Lambada, PiQA, Toxigen, TruthfulQA, and Winogrande.

The results for both models are shown in specific figures for clarity. We only included results related to the Influence of certain model parts and the loss that comes from pruning.

From the results, it’s clear that removing a single block from the model can lead to less accuracy on tasks like GSM-8K and ARC, even if we focus on the MMLU task.

Results for LLaMa-2 7B

We compared different ways to measure influence in the layers of the LLaMa-2 7 billion parameter model. This comparison relates to how well the model does on a small validation set and on MMLU. In our findings, we observed that while the self-attention layers are more likely to be pruned, the feed-forward layers are also impacted, but to a lesser degree.

Next, we looked at how linear adapters impact the LLaMa-2 7 billion parameter model. This process involved Training with three different methods: Mean Squared Error loss, supervised fine-tuning, and logit distillation. The results are presented in the appropriate figures.

When we compared linear adapters across different tasks, we noted that using them helps the model perform better again.

Update Norms

We measured the norms of updates in the model. This helps us see how changes in the model grow over time. We looked at both block and layer update norms for LLaMa-2 7 billion and Mistral 7 billion models. The visual results are available in the figures.

Effect of Emulated Update

We explored how emulated updates affect the model's performance. These updates act as a way to recover from potential losses. The findings are also represented in a figure.

We produced statistics based on the emulated updates for both models and captured their average and standard deviation visually. For LLaMa-2, we adjusted the view to focus on the middle range of values, as we did for Mistral since its values were smaller.

Low-Rank Linear Adapters

We evaluated how linear adapters with different ranks affected LLaMa-2 7 billion and Mistral 7 billion models. For ranks of 8, 32, and 256, we trained the models using various metrics. The figures illustrate how each rank performed.

For each rank, we present both the original and relative results to compare how well the models did with and without these adapters.

Low-Rank Linear Adapter Loss Curves

We tracked the training curves for low-rank linear adapters in both models for various ranks. The results are visualized in figures to show how the models perform through training phases.

Cosine Block Influence During Training

In this section, we examined how a specific influence metric changed during training in the Pythia-2.8B model. We visually displayed these changes, using darker colors for lower blocks and lighter colors for higher blocks.

Our findings indicate that the first block maintained high influence throughout training, while the second block's influence fluctuated downwards. Interestingly, the last block started with minimal influence but gained importance by the end. This pattern aligns with earlier findings on both LLaMa-2 and Mistral, emphasizing the significance of the first and last blocks.

Conclusion and Future Work

In summary, our evaluations reveal the nuanced ways that models respond to pruning and the introduction of linear adapters. The findings underscore how specific tasks and configurations can significantly impact performance. Future studies can build upon these insights to refine models and explore new training techniques.

The excitement in continuing this research lies in uncovering further improvements and understanding how these models can adapt and grow. By focusing on different methods and metrics, we pave the way for more effective machine learning applications. The impact of our work can lead to better designs and methodologies in the field, helping models become more efficient and reliable.

Through ongoing tests and adaptations, we aim to enhance our understanding of model behavior and performance. Each discovery adds another piece to the puzzle, revealing the complexities and capabilities of modern machine learning models. We encourage further exploration and innovation in this space as we continue to learn from the ongoing evolution of these technologies.

Original Source

Title: A deeper look at depth pruning of LLMs

Abstract: Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU), which is either competitive or superior to the learning-based technique.

Authors: Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, Pavlo Molchanov

Last Update: 2024-07-23 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2407.16286

Source PDF: https://arxiv.org/pdf/2407.16286

Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles