Simple Science

Cutting edge science explained simply

# Computer Science# Artificial Intelligence# Computation and Language# Computer Vision and Pattern Recognition

Detecting Toxic Content in Mixed Language Videos

A new approach to identify harmful speech in code-mixed Hindi-English videos.

― 9 min read


Toxicity Detection inToxicity Detection inVideosin mixed language videos.New methods to identify harmful speech
Table of Contents

In today's fast-paced digital world, videos have become a popular way for people to share information and communicate. However, as more people create and share videos, the challenge of finding harmful or Toxic content in these videos is growing, especially in languages that mix two or more languages, such as Hindi and English. Although there has been significant work on finding toxic text content, videos in languages that combine different languages have not been studied as much.

The aim of this work is to fill this gap by creating a unique dataset of videos and a system for detecting toxicity in video content. We put together a set of videos that includes conversations in code-mixed Hindi and English and carefully labeled each part of the video for toxic language, Severity, and Sentiment. Our goal is to make online spaces safer by training a system that can recognize and categorize this kind of harmful content.

The Rise of Video Content

The way we communicate has changed dramatically over the last few years, with social media and video platforms enabling anyone to create and share information. By 2023, it is estimated that most internet traffic is made up of videos. YouTube has become a significant platform for users to share information, with billions of hours of video watched every day.

While this vast array of content can provide valuable insights and entertainment, it also allows toxic speech to spread quickly. Toxic speech can be defined as language that is rude, disrespectful, or unreasonable, often leading to heated discussions that people may want to leave. There are many topics covered in videos, with most content being harmless. However, some videos violate community guidelines and promote harmful ideas. The presence of toxic content can lead to hostile online environments and legal challenges for the platforms hosting this content.

The Need for Detection

Current methods for detecting toxic content have primarily focused on text. The field of video content detection is not as developed. Identifying harmful content in videos requires combining information from multiple sources, including visual and audio parts. Existing methods usually rely heavily on text and have mainly focused on English content. However, as more people use languages that mix different languages, there is a greater need for detection systems that can handle these complexities.

In multilingual countries like India, it is common for people to mix Hindi and English in conversation, creating challenges for developing effective machine learning tools for detection. Although some research has looked at detecting toxic content in social media text, there is still a large gap in understanding how to handle the same issue in video format.

Our Contributions

This work aims to tackle these issues by developing a new approach for detecting toxic speech in video content. We will create a dataset that includes code-mixed Hindi-English videos and a Framework for detecting toxic speech, sentiment, and severity levels through the analysis of different video components.

  1. Dataset Creation: We introduce ToxCMM, a publicly accessible dataset that includes videos annotated for toxic speech. It contains 931 videos with 4021 utterances that are labeled for toxicity, sentiment, and severity. This dataset is designed to help researchers and developers build better systems for detecting toxic speech in code-mixed languages.

  2. Framework Development: We developed ToxVidLLM, a framework that combines multiple methods for detecting toxic videos while also analyzing sentiment and severity. The framework consists of three main parts: an encoder module that processes different types of data, a module to synchronize this data, and a multitask module that performs the actual detection tasks. Using various modalities, including video, audio, and text, allows for improved detection performance.

Dataset Creation

Data Collection

To build our dataset, we focused on YouTube, a popular platform for video sharing. Our target was videos that mixed Hindi and English conversations. We used the YouTube API to collect data from Indian web series and "roasted" videos. After initially collecting 1023 videos, we filtered them down to 931 to ensure that they were appropriate for our research.

We used a speech recognition model to create transcripts of the videos, improving their accuracy by manually correcting errors. Each video was broken into smaller clips to enable more detailed annotation.

Data Annotation

For our annotation process, we trained a group of undergraduate students familiar with Hindi and English. Our expert annotators reviewed their work to ensure consistency and quality. We provided training samples to guide our annotators in categorizing each utterance based on toxicity, sentiment, and severity.

In total, we established clear categories for each utterance. Toxicity is classified as either “toxic” or “non-toxic,” while sentiment is labeled as "positive," "negative," or "neutral." Severity is ranked on a scale from "non-harmful" to "very harmful."

Through this rigorous training and review process, we achieved high reliability scores in our annotations, confirming our dataset's quality and trustworthiness.

Dataset Statistics

The ToxCMM dataset consists of 4021 utterances, with 1697 marked as toxic and 2324 as non-toxic. Each utterance has an average length of 8.68 words and lasts about 8.89 seconds. Notably, around 68% of the words used in the dataset are in Hindi, with the rest in English.

Problem Formulation

Our main goal is to identify whether a video contains toxic content and to classify its sentiment and severity. Each video is treated as a collection of frames, sound, and a text transcript. We will use deep learning methods to create a model capable of detecting these three aspects in the videos.

ToxVidLLM Framework

To enhance our understanding of the detection process, we divided the ToxVidLLM framework into three key parts:

Encoder Module

The first part of the framework is the encoder module. This section is responsible for processing audio, video, and text data separately. We used various state-of-the-art models designed for each type of data.

  • Audio Encoder: We experimented with multiple audio models to extract meaningful features from the audio signals. Our findings showed that one model consistently outperformed the others across various tests.

  • Video Encoder: For the video data, we tested models that are designed to capture both spatial and temporal information. Similar to the audio models, one of the video models consistently delivered the best results.

  • Text Encoder: In the text portion, we used models pre-trained on Hindi-English Datasets. These models were optimized for handling code-mixed language, which further improved our detection accuracy.

Cross Modal Synchronization Module

Since we are dealing with multiple data types, it is important to synchronize them to ensure they work together effectively. The synchronization module focuses on aligning the features extracted from different modalities. This allows us to create a unified representation of the data.

We employed a strategy that links audio, video, and text features, focusing more on the text because of its importance in detecting toxicity. Through a series of steps, we were able to create a cohesive representation space that allows for better integration of the different data types.

Multitask Module

Finally, the multitask module processes the synchronized data to perform the detection tasks. It takes all the processed input and uses it to classify each video across three objectives: detecting toxicity, determining severity, and identifying sentiment.

We utilized a loss function to train our model effectively, which allows the system to learn the importance of each task. This design enables the model to have a comprehensive understanding of the video content, improving its ability to detect toxic behavior.

Experimental Setup

All experiments were conducted on a high-performance machine equipped with powerful CPUs and GPUs. We divided our dataset into training, validation, and testing sets to ensure the model could generalize well. The training process was repeated multiple times with different random splits to ensure reliable results.

Baseline Models

To evaluate our framework's effectiveness, we compared it against several baseline models. These models were designed to process data in various ways, and we measured their performance based on their ability to detect toxicity, severity, and sentiment across different configurations.

Findings from Experiments

The results of our experiments provided valuable insights:

  1. We determined that text processing was crucial for detecting toxic content. Among the individual modalities, the text-based models performed significantly better than audio and video alone.

  2. Combining text and audio data produced better results than mixing text and video, or audio and video together.

  3. Our proposed model consistently outperformed baseline models, achieving higher accuracy across all tasks. This underscored the effectiveness of combining various data types for detection.

  4. When we compared single-task models to multitask models, the multitask versions showed improved performance in toxicity detection, severity assessment, and sentiment analysis.

Statistical Analysis

To ensure the reliability of our results, we conducted statistical tests comparing our proposed models against the baselines. The findings indicated that our results were statistically significant, affirming the effectiveness of our ToxVidLLM framework.

Conclusion and Future Works

With the growing prevalence of videos, especially those containing mixed languages, our work is timely and necessary. The introduction of the ToxCMM dataset marks a significant step forward in the field of toxic content detection, providing a unique resource for researchers and developers.

Our ToxVidLLM framework has shown promise through its ability to combine multiple modalities effectively, focusing on detecting toxicity in code-mixed videos. Beyond just identifying toxic content, our dataset also provides insights into sentiment and severity, allowing for deeper exploration of issues related to online behavior.

While this work lays the foundation for future research, there are limitations, including the exclusion of indirect toxicity and the need for substantial computational resources. Addressing these issues will be essential for the continued development of effective toxic content detection systems.

In summary, as video content continues to dominate online communication, developing tools to identify and mitigate toxic behavior will be vital for creating safer digital spaces. This research aims to pave the way for more effective detection methods, ultimately fostering a more respectful online environment.

Original Source

Title: ToxVidLM: A Multimodal Framework for Toxicity Detection in Code-Mixed Videos

Abstract: In an era of rapidly evolving internet technology, the surge in multimodal content, including videos, has expanded the horizons of online communication. However, the detection of toxic content in this diverse landscape, particularly in low-resource code-mixed languages, remains a critical challenge. While substantial research has addressed toxic content detection in textual data, the realm of video content, especially in non-English languages, has been relatively underexplored. This paper addresses this research gap by introducing a benchmark dataset, the first of its kind, consisting of 931 videos with 4021 code-mixed Hindi-English utterances collected from YouTube. Each utterance within this dataset has been meticulously annotated for toxicity, severity, and sentiment labels. We have developed an advanced Multimodal Multitask framework built for Toxicity detection in Video Content by leveraging Language Models (LMs), crafted for the primary objective along with the additional tasks of conducting sentiment and severity analysis. ToxVidLM incorporates three key modules - the Encoder module, Cross-Modal Synchronization module, and Multitask module - crafting a generic multimodal LM customized for intricate video classification tasks. Our experiments reveal that incorporating multiple modalities from the videos substantially enhances the performance of toxic content detection by achieving an Accuracy and Weighted F1 score of 94.29% and 94.35%, respectively.

Authors: Krishanu Maity, A. S. Poornash, Sriparna Saha, Pushpak Bhattacharyya

Last Update: 2024-07-14 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2405.20628

Source PDF: https://arxiv.org/pdf/2405.20628

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles