The Impact of Sub-Word Segmentation on Language Models
This article examines how sub-word segmentation affects language model performance.
― 6 min read
Table of Contents
Language modeling is a key task in the field of natural language processing (NLP). It has been examined using different tools and techniques. However, there has been limited focus on how breaking down words into smaller parts, known as Sub-word Segmentation, affects the performance of language models. This article discusses the impact of sub-word segmentation on various language models, especially focusing on two popular architectures: GPT and BERT.
Sub-Word Segmentation
Sub-word segmentation refers to splitting words into smaller meaningful pieces. This method is important for processing languages, especially those with complex word structures. There are various techniques for sub-word segmentation. One common method is Byte-Pair Encoding (BPE), which groups characters into frequently used pairs. While BPE is effective, it has its drawbacks as it can miss the deeper structure of a language, which can lead to challenges during training.
When languages are rich in morphology, like Finnish or Russian, using simple word-based tokenization can lead to issues. In a word-based approach, unknown words are replaced with a placeholder, which can work only when the number of unknown words is low. However, this approach often fails in languages with diverse word forms. On the other hand, character-based tokenization treats each character as a separate token. This can make the sequences longer and more complicated, which is also not ideal.
Sub-word tokenization is often seen as a middle ground. It uses segmenters to break words into smaller units, falling between whole words and individual characters. Common segmenters include BPE, Morfessor, and StateMorph, each with its own method of breaking down words.
BPE Algorithm
BPE is popular due to its effectiveness in reducing vocabulary size. It works by starting with individual characters and gradually merging the most frequent pairs of characters until the desired vocabulary size is achieved. While BPE is widely used in many current language models, it operates in a greedy manner and may not reflect the true linguistic structure of words.
For instance, BPE might split the word "baking" into "ba" and "king," which does not provide meaningful information about the word's structure. Instead, a more informed approach would recognize "bak" as the root and "ing" as the suffix.
Morphological Segmentation
Morphological segmentation, as performed by algorithms like Morfessor and StateMorph, aims to provide more meaningful word breakdowns by focusing on the actual structure of words. Morfessor operates under the principle of Minimum Description Length, which encourages shorter and more concise segments. It builds a lexicon of sub-words based on their frequency and relationships in the language.
StateMorph also focuses on morphological structure but uses a different approach by modeling the relationships between segments through a finite-state network. This allows it to learn to produce segments that are more aligned with the morphological components of words.
Research Goals
This article sets out to examine four main questions regarding the use of morphological segmentation compared to BPE:
- Does morphological segmentation lead to lower Perplexity in language models?
- Does it help language models learn more quickly?
- Does it result in similar or improved performance on practical tasks?
- Can smaller models using morphological segmentation perform as well as larger models using BPE?
Training Language Models
To conduct the experiments, language models were trained using various segmentation methods. The languages chosen for this analysis included Finnish, Russian, English, and Turkish. Each language presents its own unique challenges and characteristics, influencing the training process.
For Finnish, training data was sourced from major news outlets, while the Russian data came from a specific corpus. English training data was primarily drawn from a large Wikipedia dump, along with additional news data. Turkish data was gathered from a different, sizeable corpus.
The training was carried out with different configurations for each language model. The models were adjusted to a common vocabulary size to ensure fair comparison. Careful attention to detail was given to ensure the preprocessing of data was consistent, including lower-casing words to maintain uniformity.
Experimental Results
Perplexity
The first area of focus was perplexity, which measures how well a language model predicts a sequence of words. Lower perplexity is indicative of a better-performing model. The results showed that models trained with morphological segmentation consistently achieved lower perplexity compared to their BPE counterparts. This suggests that the more informed structure of morphological segmentation helps models predict data more accurately.
Training Efficiency
The second focus was on the efficiency of training. It was noted that models using morphological segmentation often converged more quickly than those using BPE. This means they reached their optimal performance in fewer training steps, making them more resource-efficient.
Performance on Downstream Tasks
In addition to measuring perplexity, the study also aimed to evaluate how well the models functioned on practical tasks. These tasks included topic classification and part-of-speech tagging for Finnish, and similar classification tasks for Russian. The performance of the models using morphological segmentation was found to be comparable to, and in some cases better than, those using BPE.
Model Size and Sustainability
Finally, the research explored whether smaller models utilizing morphological segmentation could perform well alongside larger models trained with BPE. The findings indicated that smaller models with segmented vocabulary achieved competitive performance with their larger counterparts. This has significant implications for sustainability, as smaller models typically require less computational power, benefiting both training and inference phases.
Conclusions
In summary, this exploration showed that morphological segmentation positively impacts the performance of language models. Models trained using this approach tended to achieve lower perplexity, learned more efficiently, and displayed comparable or superior performance on practical tasks.
The results demonstrate the value of using more sophisticated methods for segmenting language, particularly for languages with rich morphology. While BPE remains a strong baseline, it's evident that more informed methods can lead to improved outcomes, especially for smaller models needing to balance performance with resource demands.
Future work aims to further investigate the effects of different segmentation techniques across various languages and tasks. This ongoing research is crucial in fine-tuning language models and enhancing their capabilities to process and understand the intricacies of human language.
In conclusion, the study highlights the importance of thoughtful segmentation strategies in developing effective language models, paving the way for advancements in natural language processing.
Title: Effects of sub-word segmentation on performance of transformer language models
Abstract: Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation -- Morfessor and StateMorph. We train the models for several languages -- including ones with very rich morphology -- and compare their performance with different segmentation algorithms, vocabulary sizes, and model sizes. The results show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks. Lastly, we show 4. that LMs of smaller size using morphological segmentation can perform comparably to models of larger size trained with BPE -- both in terms of (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact on sustainability of LMs, since they reduce the model cost: size and computation time. While (2) reduces cost only in the training phase, (4) does so also in the inference phase.
Authors: Jue Hou, Anisia Katinskaia, Anh-Duc Vu, Roman Yangarber
Last Update: 2023-10-26 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2305.05480
Source PDF: https://arxiv.org/pdf/2305.05480
Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://www.latex-project.org/help/documentation/encguide.pdf
- https://github.com/google/sentencepiece
- https://github.com/aalto-speech/morfessor
- https://nlp.cs.helsinki.fi/morpho
- https://urn.fi/urn:nbn:fi:lb-2017070501
- https://huggingface.co/datasets/wikipedia?library=true
- https://urn.fi/urn:nbn:fi:lb-2016101210
- https://huggingface.co/datasets/cointegrated/ru-paraphrase-NMT-Leipzig
- https://www.kaggle.com/competitions/lenta-ru-ozon-2020/leaderboard
- https://huggingface.co/wietsedv/xlm-roberta-base-ft-udpos28-ru
- https://research.csc.fi/-/mahti
- https://www.lumi-supercomputer.eu/