Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language

Advancing Somali Language Processing through Lemmatization

A system to improve Somali language technology via lemmatization.

― 5 min read


Somali LemmatizationSomali LemmatizationSystemlemmatization.Enhancing Somali language tech with
Table of Contents

Lemmatization is a method used to make text easier to understand by changing words to their basic forms. This helps in many activities like organizing text, finding information, and using language technology. This work focuses on the Somali language, which does not have many resources for language processing. We have created a system that helps in changing Somali words to their root forms, setting the stage for better use of technology for speakers of this language.

Importance of the Somali Language

Somali is spoken by over 22 million people mainly in Somalia, Djibouti, Kenya, Ethiopia, the UK, the USA, and Europe. Using one's own language makes it easier to understand and access information. However, there are big challenges for the Somali language in the digital world. There are very few datasets available for various tasks like translation, transcription, and language modeling. This makes it hard for Somali speakers to use language technology effectively.

Research Focus

In this work, we focus on lemmatization for the Somali language, especially the written dialect called "MAXAA TIRI." The goal is to develop a method to find meaningful root words from different forms of words. We designed a system that uses two methods: a Dictionary-based approach and a rule-based approach. The dictionary helps in looking up words, while the rule-based approach checks the beginning of the word to determine its root.

Creating the Lexicon

The first step in our project was to build a dictionary of root words. We gathered words from different areas like news and social media. We consulted with language experts to ensure that our method for connecting root words and their forms was accurate. The dictionary consists of both verbs and nouns. For verbs, we can often return to the command form to find the root, while for nouns, we look at the singular form.

For example:

  • The verb "cabay" (drunk) can be traced back to "cab" (drink).
  • The noun "dowladda" (the government) can be simplified to "dowlad" (government).

We created pairs of root words and their forms, storing them in a way that makes it easy to search. Our final collection includes over 8400 words made up of 1247 root words and 7173 related forms.

Developing the rules

Along with the dictionary, we also created rules to help lemmatize words that are not in the dictionary. We looked for patterns in the way words are formed. For instance, if a word starts with a specific sequence followed by certain endings, we can build a rule around that to find the root word.

This method allows for flexibility and improvements in the amount of vocabulary we can handle.

How the Lemmatization Works

The process of lemmatization is done in two main steps. First, we check if a word is in our dictionary. If we find it, we return the root word. If it’s not found, we apply the rules we have built to try to find the root. If the word cannot be resolved by either method, we label it as unresolved.

Before applying these methods, we also clean the text by removing unnecessary words (like common stop words) and punctuation so that we focus only on the important terms.

Testing the Method

We tested our lemmatization system on 120 different documents, including news articles and social media posts. We checked how well our method worked by measuring its Accuracy, which is the number of words correctly lemmatized compared to all the words we looked at.

For short documents, we found a high accuracy of about 95.87%. For slightly longer texts, such as news articles, the accuracy was around 57%. This shows that our method works best for short texts.

Example

For instance, if we take the sentence "Waxaan kula taliyey inuu casriyeeyo xirfadihiisa shaqo," which translates to "I have advised him to update his job skills," the lemmatization process would first remove the common words and punctuation. The important words would then be lemmatized to their root forms using both our dictionary and rules, allowing us to achieve normalization of the text effectively.

Challenges Faced

Creating the dictionary and rules was not without challenges. There is a lack of information about Somali language morphology, which made it difficult to ensure the quality of our work. However, our testing results show promise and indicate that we are on the right track.

We aim to gather more words and refine our rules further in future work. This could involve building an automatic system to create the dictionary as we collect more Somali language data.

Conclusion and Future Directions

This work marks the beginning of something important for the Somali language in terms of lemmatization. The system we developed, which combines a dictionary look-up method and a rule-based method, shows potential for improving how Somali is processed in technology.

As we move forward, our priority will be to expand our dictionary and refine the rules. We also plan to discover how to build the dictionary automatically from collected text, which could significantly enhance the resources available for Somali language processing.

The Path Ahead

The importance of this work cannot be overstated. With technology becoming more integral to communication and information, making tools available for under-resourced languages like Somali is crucial. We believe that with more resources and improvements in our methods, Somali speakers will be better equipped to engage in the digital world.

This foundational research opens the door for many future studies and applications in the field of natural language processing for Somali, which could lead to developing various tools that will benefit the Somali-speaking community across the globe.

Original Source

Title: Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

Abstract: Lemmatization is a Natural Language Processing (NLP) technique used to normalize text by changing morphological derivations of words to their root forms. It is used as a core pre-processing step in many NLP tasks including text indexing, information retrieval, and machine learning for NLP, among others. This paper pioneers the development of text lemmatization for the Somali language, a low-resource language with very limited or no prior effective adoption of NLP methods and datasets. We especially develop a lexicon and rule-based lemmatizer for Somali text, which is a starting point for a full-fledged Somali lemmatization system for various NLP tasks. With consideration of the language morphological rules, we have developed an initial lexicon of 1247 root words and 7173 derivationally related terms enriched with rules for lemmatizing words not present in the lexicon. We have tested the algorithm on 120 documents of various lengths including news articles, social media posts, and text messages. Our initial results demonstrate that the algorithm achieves an accuracy of 57\% for relatively long documents (e.g. full news articles), 60.57\% for news article extracts, and high accuracy of 95.87\% for short texts such as social media messages.

Authors: Shafie Abdi Mohamed, Muhidin Abdullahi Mohamed

Last Update: 2023-08-03 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2308.01785

Source PDF: https://arxiv.org/pdf/2308.01785

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

Similar Articles