Addressing Regional Discrimination in Vietnam's Social Media
A new system detects regional discrimination in Vietnamese online comments.
An Nghiep Huynh, Thanh Dat Do, Trong Hop Do
― 7 min read
Table of Contents
Regional discrimination is a serious issue in Vietnam, and it often shows up on social media. While many smart folks have looked at hate speech in the Vietnamese language, they haven’t focused much on regional discrimination. It’s like trying to fix a car without checking the engine. This paper discusses a new system that helps detect when people post Discriminatory comments based on where someone is from.
The Big Picture
After years of conflict and division, regional discrimination has been on the rise in Vietnam. People often judge others based on where they're from, which can lead to division and hurt feelings. It’s kind of like having two rival football teams-they just can’t see eye to eye.
Social media has turned into a double-edged sword. While it connects people, it also provides a platform for spreading negativity. In December 2023, a popular news program highlighted the impact of regional discrimination on social media in Vietnam. They emphasized how this behavior could harm national unity.
Why This Matters
We live in an age where social media is everywhere. It can either bring people together or pull them apart. The negative comments can not only hurt individuals; they can also widen community divides. It’s like trying to make a sandwich without the bread-it just doesn’t work.
This study aims to build a system that helps identify and process these discriminatory comments in real-time. By doing so, we can gather data to improve our understanding of the situation and maybe even prevent it.
Related Work
There are other studies out there, especially looking at hate speech in Vietnamese. They often include careful data processing like changing everything to lowercase and removing unnecessary links. It's a bit like cleaning up your messy room before you invite friends over. A good example here is the PhoBERT-CNN model which combines different techniques for analyzing text.
These approaches give us a starting point but also highlight gaps in practical applications. Instead of just creating models, we need to find ways to apply them in the real world, particularly on social networks.
Collecting Data
We’ve developed our own dataset called ViRDC, which includes around 17,000 comments collected from social media. The goal is to study how people express regional discrimination online. This dataset is our treasure chest of insights and will help us understand the language used in these contexts.
The comments are sorted into three categories:
- Other: Comments that aren’t really significant.
- Discriminatory: Comments that directly insult or put down people based on where they’re from.
- Supportive: Comments that defend people from discrimination or show respect for different cultures.
This three-way division helps us capture the different tones and messages present in online interactions.
Preprocessing Data
Before we can analyze the data, we first have to tidy it up. This means prepping the raw text so it’s easier for the models to digest. It’s a bit like chopping vegetables before throwing them into a salad.
Here’s what we do:
- Convert everything to lowercase so "Hello" and "hello" are seen as the same.
- Remove links, tags, and icons because they just add noise.
- Eliminate extra spaces or repeated characters to keep things neat.
- Strip away punctuation, which can often confuse our models.
- Normalize the encoding for Vietnamese words to ensure consistency.
- Detect and decode teen phrases or slang to make sure we get the right meaning.
- Balance the three labels to ensure our model performs well across all categories.
After all that work, we end up with a clean dataset ready for training our models.
Building the Model
Next comes the fun part-building the models that will help classify the comments. We tried several approaches, and here are some of the key players:
Random Forest: This method builds many decision trees and combines their results. It’s like asking a group of friends for their opinions and going with the majority. Random Forest is great because it can handle various types of data and doesn’t easily get confused.
Multinomial Logistic Regression: This technique looks at many possible outcomes and helps us figure out the chances of each one. It’s perfect for our multi-class problems.
Multinomial Naive Bayes: This model assumes that words in a comment act independently, making it a solid choice for text classification. It’s like having a group of friends pick their favorite toppings for a pizza-everyone has their taste, but they all contribute to the final pie.
Transfer Learning Models: These models, like PhoBERT, use previous knowledge to tackle new challenges. Imagine a student who learns math in one country, and then moves to another-they don’t start from scratch. They can apply what they already know.
By mixing these models, we aim to create a system that can accurately spot discriminatory comments.
Conducting Experiments
Once we built our models, we had to see how well they worked. We put them through their paces and focused on two main scores: accuracy and F1-macro. While accuracy tells us how many comments were labeled correctly, the F1-macro score helps us understand how well the model performs across different categories.
It’s like playing a video game and checking not just your overall score but also how well you did in different levels.
Results and Findings
After testing, we found that Random Forest outperformed the other models. It’s very effective at finding patterns in the comments, helping it differentiate between "Discrimination" and "Other" labels. However, it sometimes struggles with comments that don’t clearly show discriminatory language.
For instance, sentences that might sound bad but aren’t intended to discriminate can confuse the model. Misspellings, awkward phrasing, or common words that appear in different contexts also present challenges.
Streaming Data
One of the coolest features of our system is that it can process data in real-time, thanks to streaming technology. This means that instead of waiting for a big batch of comments to analyze, we can examine each one as it comes in. It’s a bit like watching your favorite show live and being able to react immediately!
We use tools like Apache Kafka and Apache Spark Streaming to handle this flow of information. Here’s how it works:
Data Collection: We gather comments from social media platforms like Facebook and TikTok.
Processing: The comments pass through Kafka, where they get sorted and sent to be processed.
Classification: The best-performing model analyzes each comment and categorizes it based on our predefined labels.
Storage: The results are saved in a format that’s easy to visualize and understand.
We even created a user-friendly interface to show the results, complete with tables and charts!
Conclusion and Future Work
In summary, we have successfully developed a system to detect regional discriminatory comments on Vietnamese social media. By creating the ViRDC dataset and experimenting with various machine-learning models, we’ve put together a reliable way to analyze and process these comments in real-time.
But we’re not stopping here. Our future plans include integrating advanced natural language processing models to tackle different types of discrimination. We also want to improve our tagging process and explore deep learning methods for better performance.
Ultimately, we aim to create a system that’s easy to use and works well with existing social media platforms. We believe this effort will help in promoting understanding and acceptance among the diverse regions in Vietnam-one comment at a time!
Title: A Big Data-empowered System for Real-time Detection of Regional Discriminatory Comments on Vietnamese Social Media
Abstract: Regional discrimination is a persistent social issue in Vietnam. While existing research has explored hate speech in the Vietnamese language, the specific issue of regional discrimination remains under-addressed. Previous studies primarily focused on model development without considering practical system implementation. In this work, we propose a task called Detection of Regional Discriminatory Comments on Vietnamese Social Media, leveraging the power of machine learning and transfer learning models. We have built the ViRDC (Vietnamese Regional Discrimination Comments) dataset, which contains comments from social media platforms, providing a valuable resource for further research and development. Our approach integrates streaming capabilities to process real-time data from social media networks, ensuring the system's scalability and responsiveness. We developed the system on the Apache Spark framework to efficiently handle increasing data inputs during streaming. Our system offers a comprehensive solution for the real-time detection of regional discrimination in Vietnam.
Authors: An Nghiep Huynh, Thanh Dat Do, Trong Hop Do
Last Update: 2024-10-29 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2411.02587
Source PDF: https://arxiv.org/pdf/2411.02587
Licence: https://creativecommons.org/publicdomain/zero/1.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.