Advancing Sentiment Analysis for Nigerian Languages
New dataset enhances sentiment analysis for Nigerian movie reviews in five languages.
― 5 min read
Table of Contents
Nigeria has a rich cultural heritage with over 2000 languages. However, many of these languages are not represented in natural language processing (NLP) research. This has created a gap in developing tools and resources for these languages, especially in areas like sentiment analysis. Recent efforts have been made to create labeled datasets for some of these languages, but they often focus on a single domain, limiting their effectiveness when applied to other areas.
In this study, we address the challenge of sentiment classification for Nigerian movie reviews. We developed a new dataset, called NollySenti, derived from Nollywood movie reviews and covering five widely spoken languages in Nigeria: English, Hausa, Igbo, Yoruba, and Nigerian Pidgin. We conducted extensive experiments using different machine learning methods, including traditional techniques and modern pre-trained language models.
Background
Sentiment analysis is a key task in NLP that involves determining the opinion or emotion expressed in a piece of text. Many well-established datasets exist for high-resource languages like English, allowing researchers to build effective sentiment analysis models. In contrast, datasets for Nigerian languages are scarce, with the only notable dataset being NaijaSenti, which is based on Twitter data for a few Nigerian languages. However, it is not clear how well this dataset can be applied to other domains, such as movie reviews.
Dataset Creation
NollySenti is a sentiment classification dataset created specifically for Nollywood movie reviews, which are important to Nigerian culture. Despite Nollywood being the second-largest film industry in the world, finding movie reviews in indigenous languages is difficult. Most reviews are only available in English. To create NollySenti, we collected 1,900 English-language reviews from popular movie review platforms such as IMDb, Rotten Tomatoes, and Nigerian sites like Cinemapointer and Nollyrated.
To make the dataset multilingual, we hired professional translators to translate approximately 1,000 of these reviews into four Nigerian languages. The translations were quality-checked by native speakers to ensure they were accurate and reliable.
Languages in Focus
The study focuses on four main Nigerian languages:
Hausa: A widely spoken Afro-Asiatic language with approximately 77 million speakers. It is prevalent in northern Nigeria and neighboring countries.
Igbo: A member of the Niger-Congo language family, spoken by around 31 million people. It is primarily found in southeastern Nigeria.
Yoruba: This language, also from the Niger-Congo family, has around 50 million speakers and is widely spoken in southwestern Nigeria and beyond.
Nigerian Pidgin (Naija): A Creole language with over 120 million speakers, it serves as a common linguistic bridge among various ethnic groups in Nigeria.
Methodology
We conducted various experiments to evaluate the performance of our sentiment classification models. These experiments included:
Transfer Learning: We compared the effectiveness of adapting models trained on different domains. This included transferring knowledge from social media (Twitter) to movie reviews.
Cross-Lingual Adaptation: We explored transferring knowledge from English to the other Nigerian languages and assessed how well models trained in English performed when evaluated on these languages.
Machine Translation: To address domain differences, we employed machine translation to convert English reviews into other Nigerian languages. The effectiveness of this approach was evaluated in enhancing model performance.
Results
Our evaluations revealed significant insights into the effectiveness of transfer learning and machine translation for sentiment classification in low-resource languages.
Transfer Learning
When transferring knowledge from English models to Nigerian languages, we achieved about 5% improvement in accuracy compared to models that relied on Twitter data. This indicates that models trained on English movie reviews can effectively generalize to classify sentiments in similar reviews written in other Nigerian languages.
Machine Translation
By implementing machine translation from English to other languages, we observed an additional 7% improvement in performance for sentiment classification tasks. While machine translation quality for low-resource languages is often inconsistent, our human evaluations showed that most translated sentences effectively preserved the sentiment of the original English reviews.
Human Evaluation
To further validate the machine translation quality, we hired native speakers of the focus languages to assess a sample of translated sentences. They evaluated the adequacy of the translations-how well the meaning was conveyed-and the sentiment preservation-whether the emotional tone was maintained.
The results from human evaluations showed that around 90% of the translations preserved the original sentiment. This highlights the potential of machine translation as a supportive tool for creating resources for low-resource languages.
Challenges and Limitations
While we achieved encouraging results, there are challenges and limitations to consider:
Quality of Machine Translation: The effectiveness of machine translation systems can vary greatly. In some cases, translations may be inaccurate or nonsensical, which can lead to lost sentiment or misinterpretations.
Domain-Specific Language: The style and vocabulary used in Nollywood reviews may differ from those in other domains, such as social media. This could impact model performance when adapting across different contexts.
Need for Robust Datasets: The lack of extensive labeled datasets for Nigerian languages makes it challenging to build and validate models effectively. More resources and data collection efforts are needed in the future.
Future Directions
Looking ahead, we aim to extend the creation of sentiment classification datasets to more African languages. This will help broaden the scope of NLP research in underrepresented languages and enable the development of more effective NLP tools.
Additionally, addressing the gaps in machine translation quality for low-resource languages should be a priority to improve sentiment analysis and other NLP tasks. Collaborative efforts with language experts and technology stakeholders can foster better resources and improve the state of NLP for African languages.
Conclusion
In summary, our work highlights not only the need for greater representation of Nigerian languages in NLP but also the potential of transfer learning and machine translation to improve sentiment classification. By creating a new sentiment classification dataset based on Nollywood movie reviews, we have taken a meaningful step towards enhancing the tools available for Nigerian languages. The findings underline the importance of continued research, resource development, and collaboration to support low-resource languages in the field of natural language processing.
Title: NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification
Abstract: Africa has over 2000 indigenous languages but they are under-represented in NLP research due to lack of datasets. In recent years, there have been progress in developing labeled corpora for African languages. However, they are often available in a single domain and may not generalize to other domains. In this paper, we focus on the task of sentiment classification for cross domain adaptation. We create a new dataset, NollySenti - based on the Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian-Pidgin, and Yoruba. We provide an extensive empirical evaluation using classical machine learning methods and pre-trained language models. Leveraging transfer learning, we compare the performance of cross-domain adaptation from Twitter domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from Twitter in the same language. To further mitigate the domain difference, we leverage machine translation (MT) from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While MT to low-resource languages are often of low quality, through human evaluation, we show that most of the translated sentences preserve the sentiment of the original English reviews.
Authors: Iyanuoluwa Shode, David Ifeoluwa Adelani, Jing Peng, Anna Feldman
Last Update: 2023-08-22 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2305.10971
Source PDF: https://arxiv.org/pdf/2305.10971
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://www.latex-project.org/help/documentation/encguide.pdf
- https://www.census.gov/popclock/print.php?component=counter
- https://www.masterclass.com/articles/nollywood-new-nigerian-cinema-explained
- https://github.com/IyanuSh/NollySenti
- https://www.ethnologue.com/guides/ethnologue200
- https://letterboxd.com/
- https://nollyrated.com/
- https://nollywoodpost.com/
- https://whatkeptmeup.com/
- https://arxiv.org/abs/2205.15960
- https://huggingface.co/facebook/nllb-200-distilled-600M