Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language# Computers and Society# Machine Learning

Assessing Toxicity in Multilingual Language Models

A new dataset evaluates how language models handle harmful content across cultures.

― 5 min read


Tackling Toxicity inTackling Toxicity inLanguage Modelscultures.harmful language detection acrossNew dataset highlights challenges in
Table of Contents

Large language Models (LLMs) and small language models (SLMs) are becoming popular in many areas, but concerns about their safety remain. One vital aspect of using these models is understanding how well they can recognize harmful language, especially in different Languages and cultures. With the emergence of multilingual models, it is important to find out if we can assess their safety as quickly as they are being released.

To address this, we have created a new Dataset named RTP-LX. This dataset includes Toxic prompts and their responses in 28 languages. RTP-LX was made with careful consideration of Cultural details to find harmful language that might not be obvious at first.

We tested seven different S/LLMs to see how well they can identify harmful content across languages. Our findings show that while these models often perform well in terms of accuracy, they do not always agree with human judgments when evaluating toxicity holistically. They struggle particularly with recognizing harmful language in situations where the context matters, such as subtle insults or biases.

The Need for Toxicity Evaluation

As LLMs and SLMs are being increasingly used in various applications, the risk of generating harmful content has grown. These models learn from data available on the internet, which can often include toxic language. As we develop more capable multilingual models, we need effective ways to detect toxic language in many languages.

In this paper, we introduce RTP-LX, a specially created dataset designed to evaluate how well models can recognize toxic language across different cultures and languages. The goal is to ensure that these models can be used safely while avoiding harmful content.

What is RTP-LX?

RTP-LX, short for "RTP-Language eXpanded," is a dataset consisting of toxic prompts and the responses generated from those prompts in 28 languages. This dataset was created by carefully assessing the toxicity of language and ensuring that culturally specific harmful language was included.

The creation of RTP-LX involved both human translation and annotation. We sought the expertise of native speakers to ensure that the dataset accurately represents the language and cultural nuances. By partnering with native speakers, we ensured that the dataset would effectively capture harmful content that may be overlooked by non-native speakers.

How We Evaluated the Models

To evaluate the performance of the selected S/LLMs, we used the RTP-LX dataset and compared the models' outputs against the annotations provided by human judges. We wanted to see if the models could reliably identify harmful content, particularly in the context of different languages and cultures.

Our evaluation involved specific tasks where S/LLMs were asked to identify toxic content based on the prompts provided. We measured their performance using various metrics to see how well they matched with human judges. While the models scored well in accuracy, there were significant gaps when it came to nuanced understanding of harmful content.

Findings

General Performance

The results showed that the S/LLMs typically achieved acceptable levels of accuracy. However, there was notable disagreement with human judges when judging the overall toxicity of a prompt. This disagreement was especially pronounced in context-dependent situations where subtle forms of harm, such as microaggressions and bias, were present.

Our findings indicate that while models like GPT-4 Turbo and Gemma 7B performed the best overall, they still struggled with recognizing nuanced harmful language. Some models, especially the smaller ones like Gemma 2B, showed poorer performance in identifying toxicity.

Challenges in Detection

Detecting toxic language in a multilingual, culturally-sensitive context is complex. Many models demonstrated a tendency to overlook more subtle forms of harm. For instance, they were better at identifying clear instances of violence and sexual content but found it challenging to flag content that could be harmful in certain contexts, such as jokes or references that may offend specific groups.

This highlights a significant limitation in the current capabilities of S/LLMs. The models often output higher labels than necessary, leading to a situation where they either miss important harmful content or inaccurately classify benign content as harmful.

The Importance of Cultural Sensitivity

When evaluating language models, cultural sensitivity is crucial. The dataset RTP-LX was designed with this in mind, as many harmful expressions are deeply tied to cultural contexts. For example, a phrase that might seem harmless in one culture could be highly offensive in another due to historical or social reasons.

The process of creating RTP-LX involved gathering culturally relevant prompts that reflect the unique challenges of understanding toxicity in different languages. This ensured that the evaluation would accurately gauge each model's ability to understand these subtleties.

Future Directions

To improve the effectiveness of models in recognizing toxic language, further research is needed. There is a necessity to expand the RTP-LX dataset to include more dialects and linguistic variations. This will ensure a broader understanding of different linguistic contexts and linguistic features that influence toxicity perception.

Moreover, there needs to be a focus on improving how models are trained, particularly regarding their ability to handle subtle and context-sensitive language. Given the rapid development of these technologies, it is essential to keep pace with their safety measures to prevent harmful use.

Conclusion

RTP-LX serves as an important step towards addressing the challenges of toxic language detection in multilingual contexts. While the tested S/LLMs achieved reasonable accuracy levels, their struggles with nuanced content highlight gaps that still need to be addressed. Cultural sensitivity and language diversity must remain at the forefront of future model development and evaluations.

Our research provides valuable insights into how S/LLMs can better detect harmful content in a variety of languages and cultural settings. By continuing to refine our approaches and technologies, we can work towards safer deployment of language models and ultimately reduce the harmful impacts of toxic language in online spaces.

As we look to the future, it is clear that building more reliable systems for language understanding will be key to fostering healthier and more respectful online communication.

Original Source

Title: RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Abstract: Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

Authors: Adrian de Wynter, Ishaan Watts, Nektar Ege Altıntoprak, Tua Wongsangaroonsri, Minghui Zhang, Noura Farra, Lena Baur, Samantha Claudet, Pavel Gajdusek, Can Gören, Qilong Gu, Anna Kaminska, Tomasz Kaminski, Ruby Kuo, Akiko Kyuba, Jongho Lee, Kartik Mathur, Petter Merok, Ivana Milovanović, Nani Paananen, Vesa-Matti Paananen, Anna Pavlenko, Bruno Pereira Vidal, Luciano Strika, Yueh Tsao, Davide Turcato, Oleksandr Vakhno, Judit Velcsov, Anna Vickers, Stéphanie Visser, Herdyan Widarmanto, Andrey Zaikin, Si-Qing Chen

Last Update: 2024-12-16 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2404.14397

Source PDF: https://arxiv.org/pdf/2404.14397

Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles