Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language# Artificial Intelligence

Optimizing Large Language Models with SlimPajama

A study on improving training efficiency for language models using SlimPajama dataset.

― 7 min read


SlimPajama's Impact onSlimPajama's Impact onLanguage Modelsin language model training.Study reveals benefits of diverse data
Table of Contents

The main goal of this study is to look at how different data sources affect the training of large language models using SlimPajama. We have a new dataset called SlimPajama that is made up of carefully selected and cleaned sources, which has fewer repeated entries compared to a larger dataset called RedPajama. Our work, named SlimPajama-DC, examines the key features and effective ways to use SlimPajama in training large language models.

Key Observations

During our research, we made two important observations:

  1. Global vs. Local Deduplication: We looked at how removing duplicates from all data sources (global deduplication) compares to removing duplicates within each single source (local deduplication) and how this affects Model Performance.

  2. Quality of Datasets: We studied how the mix of high-quality and well-duplicated datasets impacts the overall training process. We created six different configurations of the SlimPajama dataset and trained models using these configurations. Through our tests, we found that our best setup performs significantly better than the RedPajama dataset.

Importance of Training Data

Large language models rely heavily on their training data. It's not just about having a lot of text; it's about having a variety of text from different sources. This ensures that models learn language well and understand a wide range of topics and perspectives. Various domains, like Github, Wikipedia, books, and web text, are crucial for the overall performance of these models.

In our study, we focused on two main areas: the effects of removing duplicates across different datasets and the effectiveness of various combinations of well-organized datasets. By using SlimPajama, we aimed to encourage models to learn from all sources without overlaps, while also figuring out how to balance and manage different sources of information.

Deduplication Process

What is Deduplication?

Deduplication removes repeated data points to help the model focus on unique information. This is vital for training efficiency. If a model keeps seeing the same information, it may take longer to learn and may not perform well on different tasks. By having a highly deduplicated dataset, we streamline the training and improve model performance.

Global vs. Local Deduplication

  • Global Deduplication: This method removes duplicate data across all datasets. It catches overlaps from different sources, ensuring that the model learns from a wide array of unique data.

  • Local Deduplication: This method only removes duplicates within each dataset. If two datasets share similar information, that overlap might still be present after processing.

Our observations indicated that global deduplication tends to favor better training outcomes, particularly when using data from multiple sources.

The Role of Data Combinations

A model that trains on diverse and well-deduplicated data tends to generalize better across various tasks. For instance, if the data sources reflect different cultures and perspectives, the model may become more balanced and less biased. However, if sources are too alike, the model might amplify common biases.

Combining technical data with general news or other forms of text can give the model a broad understanding, applying detailed knowledge to various tasks. Quality matters more than quantity, so we aimed to highlight the importance of thoughtful combinations in SlimPajama.

Specialization vs. Generalization

When combining many specialized datasets, we face the challenge of creating a model that might not be as skilled in specific tasks as a model trained on one specialized dataset. We explored this balance between specialization and generalization with various configurations of our datasets.

Dataset Details

SlimPajama contains a total of 627 billion tokens gathered from multiple sources. This dataset is split into training, validation, and test sets. Each configuration we tested includes around 330 billion tokens after processing.

We utilized different sampling strategies for our datasets. Some sources, like CommonCrawl, trained only once, while others, like Wikipedia and Github, trained multiple times to ensure thoroughness.

Data Source Proportions

To balance the training data, we defined the proportions of various sources in our dataset configurations. This varied by assigning different weights to different source types based on their importance and uniqueness.

Analyzing Dataset Similarity

To see how different datasets compare, we calculated the similarity between token distributions. We looked at various token types, including letters, numbers, and uncommon symbols, to understand how distinct or alike they were.

From our analysis, we found that while many datasets shared similarities, there were also clear distinctions in certain areas, such as non-alphanumeric tokens.

Processing the Dataset

SlimPajama came into existence by filtering out low-quality text and duplicates from the original RedPajama dataset. We removed very short documents that lacked useful information, ensuring that our entire dataset was robust and relevant.

Filtering Low-Quality Documents

We applied a filter to eliminate documents shorter than 200 characters. This step helped us avoid including short fragments that wouldn’t contribute meaningfully to training.

Global Deduplication Process

Every dataset included in SlimPajama had duplicates, with the highest rates found in sources like CommonCrawl and Github. We performed global deduplication to ensure an efficient combination of data, which results in better training without unnecessary overlaps.

Dataset Combinations for Training

We created and tested six configurations for SlimPajama to see how changes in data combination affected outcomes:

  1. CommonCrawl only
  2. CommonCrawl + Github
  3. CommonCrawl + Github + Books + Wikipedia
  4. CommonCrawl + Github (with adjusted sampling proportions)
  5. CommonCrawl + Wikipedia (with adjusted sampling proportions)
  6. RefinedWeb CommonCrawl only

Each configuration aimed to examine how varying data sources and proportions affected the model's performance.

Model Architecture and Training Setup

Cerebras-GPT Architecture

Our architecture resembles that of existing models but uses a consistent attention mechanism, which differs from models that combine dense and sparse approaches. Each model was built to handle a maximum sequence length of 2,048 tokens.

Training Details

We used a tokenizer based on GPT-NeoX and trained models over approximately 2.5 days. The AdamW optimizer was employed to help fine-tune model performance.

Evaluating Model Performance

Our analysis included looking at how models trained on different configurations performed on various benchmarks. We tested for reasoning, commonsense inference, multitask proficiency, and the model's reliance on inaccurate information.

The results indicated that our configurations often outperformed the original RedPajama models, with some configurations achieving top scores across specific benchmarks.

Random Guessing Score

To better understand model performance on tests like MMLU, we introduced a metric to measure how often predictions resembled random guessing. A higher score indicates a model's predictions are more reliable than chance.

Training Loss Analysis

We analyzed the loss curves for the training process of different configurations. Some key observations emerged:

  1. The configuration with the best average accuracy had the highest average loss, indicating that lower loss doesn't necessarily mean better results.
  2. A configuration mainly consisting of code data had the lowest training loss, showing a connection between data type and loss performance.

Large Batch-Size Training on the 7B Model

For a larger model of 7 billion parameters, we adjusted our data combinations to include more web text, while also incorporating additional sources to enhance diversity. We wanted to balance achieving high performance while ensuring efficient training.

Training Configuration for 7B Model

The architecture was modified to suit the larger model, maintaining a sequence length of 2,048 tokens. We used a different tokenizer and followed a distinct training optimizer pattern to suit this larger scale.

Fast Training with Large Batches

Training with larger batch sizes allowed us to achieve faster convergence, improving training efficiency. However, we also noted that larger batches could lead to overfitting in some cases. Therefore, we developed a new strategy that utilized weight decay to help mitigate these risks.

Progressive Training on Weight Decay

We introduced a new method called Progressive Training on Weight Decay (PTWD). This approach applied different levels of weight decay during various phases of training, resulting in improved convergence and better management of model performance.

Results from Pre-training and Instruction Tuning

After our initial training, we conducted instruction tuning, which led to improved scores in some benchmarks but slightly lower performance in others. Overall, the average accuracy rose significantly after this additional tuning.

Related Work and Conclusions

Our work highlights the importance of using diverse and well-managed datasets in training large language models. By focusing on the effective combination of data sources and thorough deduplication, we have shown a path toward more efficient training processes. This can lead to better performance in various tasks and shape future research in data-centric methods for language model training.

Through SlimPajama-DC, we aim to inspire further exploration into how different data combinations can enhance the training efficiency of large language models.

Original Source

Title: SlimPajama-DC: Understanding Data Combinations for LLM Training

Abstract: This paper aims to understand the impacts of various data combinations (e.g., web text, Wikipedia, GitHub, books) on the pretraining of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together. We have termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations on SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16$\times$ CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our SlimPajama-DC models are available at: https://huggingface.co/MBZUAI-LLM/SlimPajama-DC and the separate SlimPajama-DC datasets are available at: https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC.

Authors: Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, Eric Xing

Last Update: 2024-05-09 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2309.10818

Source PDF: https://arxiv.org/pdf/2309.10818

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles