Advancing Sentiment Analysis for Bengali Texts
A new method improves sentiment analysis for Bengali language reviews.
― 7 min read
Table of Contents
- Why Focus on Bengali?
- The Problem with Bengali Sentiment Analysis
- Our Approach: A New Algorithm
- Creating a Lexicon Data Dictionary
- The Bangla Sentiment Polarity Score (BSPS)
- Evaluating Our Approach
- Collecting Reviews: A Tough Task
- Data Processing Steps
- Addressing Missing and Duplicate Data
- Tokenization and Normalization
- Stop Word Removal
- How Does the BSPS Algorithm Work?
- Key Components of BSPS
- Sentiment Processing Flow
- Examples to Illustrate BSPS in Action
- Classification Process
- Nine Sentiment Categories
- Fine-Tuning with BanglaBERT
- Training BanglaBERT
- Performance and Results
- Performance of the BSPS Algorithm
- Performance of BanglaBERT
- Comparing the Two Models
- Future Directions
- Original Source
- Reference Links
Sentiment analysis, or SA for short, is a way to find out how people feel about something based on what they write. Imagine reading a review of a restaurant. If someone says, "The food was amazing!" you know they had a good time. But if they say, "The food was terrible," you know they were not pleased. This process looks at the emotional tone behind the words, making sense of feelings like happiness, anger, or sadness.
Bengali?
Why Focus onEven though sentiment analysis has been done a lot in languages like English, not much research has been focused on Bengali. Bengali is a beautiful language spoken by over 250 million people. It has its own unique twists and turns that make it special. That’s why we set out to improve how we analyze sentiment in Bengali texts, especially when it comes to understanding more complex feelings.
The Problem with Bengali Sentiment Analysis
When it comes to sentiment analysis in Bengali, we face a few challenges:
- Lack of Data: Unlike English, there aren’t many large datasets of Bengali texts with emotion labels. This means it’s hard to train models that can accurately understand how people feel.
- Basic Classifications: Most analyses tend to oversimplify emotions into just positive or negative. But people can feel many shades of emotions, and we want to capture all of them.
- Language Nuances: Bengali is rich and complex. Its unique grammar and vocabulary need special attention that many existing models don’t provide.
Our Approach: A New Algorithm
To tackle these challenges, we came up with a fresh approach combining traditional rule-based systems with modern pre-trained models. We created a dataset from scratch, made up of over 15,000 reviews. Yes, we rolled up our sleeves and gathered all that data ourselves!
Creating a Lexicon Data Dictionary
We built something called a Lexicon Data Dictionary (LDD). This is like a special dictionary that lists words along with their emotional weights. We divided the dictionary into two sections: positive words (like "fantastic" and "great") and negative words (like "bad" and "terrible"). Each word got a score based on how positive or negative it is.
The Bangla Sentiment Polarity Score (BSPS)
Meet our star player, the Bangla Sentiment Polarity Score (BSPS). This is our carefully crafted algorithm designed to analyze Bengali texts. Instead of just saying a review is positive or negative, BSPS categorizes emotions into nine different classes, such as “extremely positive” or “considerably negative.” This helps in painting a clearer emotional picture.
Evaluating Our Approach
To see how well our BSPS works, we tested it against a pre-trained language model called BanglaBERT, which is like a supercharged brain for understanding Bengali. We compared the results to see which approach performed better. Spoiler alert: BSPS paired with BanglaBERT turned out to be the dream team!
Collecting Reviews: A Tough Task
To kick things off, we needed a large set of reviews for analysis. We decided to scour the Daraz Bangladesh website, a popular online shopping platform. This involved checking thousands of reviews and labeling them as positive or negative.
The results? Out of 15,194 reviews, we found that 13,344 were positive, while 1,850 were negative. That’s a good mix, right?
Data Processing Steps
After gathering the reviews, we focused on cleaning and preparing the data for analysis. Here’s what we did:
Addressing Missing and Duplicate Data
We carefully checked for any duplicate entries or missing information. Think of it as cleaning up your messy room—making sure everything is in order before you start sorting and analyzing.
Tokenization and Normalization
Next, we took the text and split it up into individual words, a process called tokenization. We also cleaned it up by removing unnecessary punctuation, which could confuse our algorithm. After that, our reviews became easier to read!
Stop Word Removal
We also got rid of "stop words." These are common words that don’t add much meaning, like "is," "the," and "and." Removing these helped us focus on the important parts of the reviews.
How Does the BSPS Algorithm Work?
The BSPS algorithm takes advantage of our Lexicon Data Dictionary and certain language rules to analyze the sentiment of each review. Here’s how it works:
Key Components of BSPS
- Positive Lexicons: Words that express positive feelings.
- Negative Lexicons: Words that express negative feelings.
- Negation Words: Words that flip the sentiment, like "not."
- Extreme Modifiers: Words that intensify emotion, such as "very."
Sentiment Processing Flow
- Tokenization: We break the input sentence into words.
- Stop Word Removal: Unimportant words are filtered out.
- Score Initialization: Start with a sentiment score of zero.
- Word Processing: Each word in the sentence is analyzed for its sentiment.
- Handling Negation: If a negation word is found, we reverse the sentiment.
- Final Calculation: We sum up scores and determine the final sentiment.
Examples to Illustrate BSPS in Action
Let’s take a look at a few sample sentences to see how BSPS works:
-
For the sentence "The food was not very good," our algorithm identifies the words and concludes that it implies the food is somewhat okay, rather than being outright bad.
-
For the phrase "So good that it can't be believed," BSPS recognizes the phrase's intensity and assigns a high positive score.
In every example, the BSPS algorithm successfully captures the emotion behind the words, demonstrating how effective it is in handling the Bengali language nuances.
Classification Process
With the sentiment scores ready, we categorized each review into one of our nine distinct classes. This classification allows us to understand not just if someone is happy or sad but to what extent!
Nine Sentiment Categories
- Extremely Positive
- Considerably Positive
- Positive
- Slightly Positive
- Neutral
- Slightly Negative
- Negative
- Considerably Negative
- Extremely Negative
Fine-Tuning with BanglaBERT
Once we had our categories, we turned to BanglaBERT to see if we could achieve even better results. We trained and tested the model using a combination of learning rates and batch sizes to find the best fit.
Training BanglaBERT
We divided our dataset into 80% for training and 20% for testing. Our goal was to ensure that BanglaBERT could effectively identify the sentiment classes based on the reviews.
Performance and Results
As we evaluated our models, we looked at how well they performed using metrics like accuracy, precision, and recall. Here’s what we found:
Performance of the BSPS Algorithm
The BSPS model achieved an impressive accuracy of 93%, which shows it was pretty good at telling positive from negative sentiments.
Performance of BanglaBERT
BanglaBERT, on the other hand, managed to score 88%. While this is still decent, it shows that our BSPS algorithm was more precise in classifying sentiments.
Comparing the Two Models
When comparing the two models, we found that the combination of BSPS for classification and BanglaBERT for evaluation worked better than just using BanglaBERT alone. This hybrid approach allowed us to get a richer understanding of emotions, making it clear that two heads are better than one!
Future Directions
So, what’s next on our list? We’re looking to improve and experiment even more. We could try out different pre-trained models or combine outputs from both BSPS and BanglaBERT to create an even better analysis tool for Bengali sentiment.
In summary, we’ve made significant strides in improving sentiment analysis for Bengali texts by developing a hybrid approach. With our BSPS algorithm working hand in hand with BanglaBERT, we believe we’re paving the way for more accurate emotional insights in the Bengali language. And who knows? Maybe someday we'll have a friendly chatbot that can make us giggle with its witty comments about our favorite restaurants!
Title: Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT
Abstract: Sentiment analysis (SA) is a process of identifying the emotional tone or polarity within a given text and aims to uncover the user's complex emotions and inner feelings. While sentiment analysis has been extensively studied for languages like English, research in Bengali, remains limited, particularly for fine-grained sentiment categorization. This work aims to connect this gap by developing a novel approach that integrates rule-based algorithms with pre-trained language models. We developed a dataset from scratch, comprising over 15,000 manually labeled reviews. Next, we constructed a Lexicon Data Dictionary, assigning polarity scores to the reviews. We developed a novel rule based algorithm Bangla Sentiment Polarity Score (BSPS), an approach capable of generating sentiment scores and classifying reviews into nine distinct sentiment categories. To assess the performance of this method, we evaluated the classified sentiments using BanglaBERT, a pre-trained transformer-based language model. We also performed sentiment classification directly with BanglaBERT on the original data and evaluated this model's results. Our analysis revealed that the BSPS + BanglaBERT hybrid approach outperformed the standalone BanglaBERT model, achieving higher accuracy, precision, and nuanced classification across the nine sentiment categories. The results of our study emphasize the value and effectiveness of combining rule-based and pre-trained language model approaches for enhanced sentiment analysis in Bengali and suggest pathways for future research and application in languages with similar linguistic complexities.
Authors: Hemal Mahmud, Hasan Mahmud
Last Update: 2024-11-29 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2411.19584
Source PDF: https://arxiv.org/pdf/2411.19584
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.