Evaluating LLMs in Social Science Classification

This study assesses ChatGPT and OpenAssistant's effectiveness in classifying social media data.

2025-11-12T14:06:48+00:00 ― 4 min read

Table of Contents

Zero-shot Performance
What We Tested
Prompting Strategies
Findings
Implications for Social Media Analysis
Future Directions
Conclusion
Original Source
Reference Links

Large Language Models (LLMs) are tools that can understand and generate human language. They have shown great skill in following instructions and providing useful responses. However, training these models requires a lot of resources, which often leads to their use without any prior training on specific tasks, known as zero-shot settings. This article focuses on how well two popular LLMs, ChatGPT and OpenAssistant, perform in classifying data related to social science without any task-specific training. We also look at how different ways of asking questions, or prompting, affect their accuracy.

Zero-shot Performance

In our study, we compare the performances of ChatGPT and OpenAssistant on six different tasks related to social media. These tasks involve classifying posts and comments to gain insights into public opinions and behaviors. We examine how the way we ask the models to perform these tasks impacts their success. By changing the complexity of the prompts we give them, we aim to see if these models can maintain high levels of accuracy.

What We Tested

We looked at a number of tasks to see how well the LLMs can classify them without any fine-tuning:

Rumor Stance Detection: Identifying the stance of a reply towards a rumor, such as whether it supports, denies, questions, or comments on it.
Sarcasm Detection: Determining if a tweet is meant to be sarcastic.
Vaccine Stance: Predicting attitudes towards COVID-19 vaccination based on tweets.
Complaint Detection: Identifying if a tweet expresses a complaint.
Bragging Detection: Classifying whether a tweet involves bragging.
Hate Speech Detection: Detecting harmful speech, such as racism or sexism, in social media posts.

Each task uses specific datasets that have been labeled by humans to allow us to measure accuracy.

Prompting Strategies

The way we prompt these models greatly influences their performance. We tried several prompting strategies:

Basic Instruction: This involves a simple request without much detail about the task.
Task and Label Description: Here, we provide more information, including descriptions of the task and labels.
Memory Recall: We include titles of relevant papers to see if the models can recall them, potentially enhancing their responses.
Using Synonyms: We substitute the original labels with similar words to determine if this helps the models perform better.

Changing the type of prompt often affects how well the LLMs classify information. For example, adding definitions or using synonyms might make the models' predictions more accurate.

Findings

Our results showed interesting trends. First, the LLMs struggled to match the performance of smaller models specifically trained for the tasks, like BERT. Moreover, we found that different prompting strategies resulted in significant differences in classification accuracy. In some cases, models were able to improve their scores by over 10% just by changing the way they were asked to classify.

While the LLMs have impressive capabilities, we found that they generally performed better on simpler requests rather than more complex ones. In fact, using detailed prompts did not always lead to better results and sometimes made performance worse. This means that creating effective prompts for these models can be quite challenging.

We also learned that using synonyms for class labels improved the models' performances in most tasks. This suggests that picking the right words is crucial for achieving better accuracy.

Implications for Social Media Analysis

The findings of this research have important implications for using LLMs in analyzing social media. As platforms like Twitter are filled with nuanced opinions and complex conversations, models that can classify text effectively can help organizations manage public sentiment, reduce misinformation, and address harmful content.

Our results show that while LLMs have potential, they should not fully replace human annotators. For example, in the vaccine stance task, the agreement between two human judges was around 62%, indicating the complexity involved in interpreting social media content. However, LLMs can significantly reduce the workload of human annotators by providing initial assessments, which can then be verified by humans.

Future Directions

In the future, it will be valuable to investigate more ways to enhance the performance of LLMs in zero-shot settings. Further exploration of different prompt styles, fine-tuning techniques, and improving dataset diversity will be necessary to develop more reliable models for social science applications.

We also suggest testing these models in real-time scenarios to understand their practical value. Engaging with user feedback can clarify how well the LLMs interpret and classify real-world social media interactions.

Conclusion

In summary, while LLMs like ChatGPT and OpenAssistant show promise in understanding and classifying social media content, they still face challenges in zero-shot tasks. The way we prompt these models plays a crucial role in their performance, and there is a lot of room for improvement in how we design these prompts. As technology develops, we hope to see more refined LLMs that can assist researchers and organizations in analyzing social media data effectively and efficiently.

Evaluating LLMs in Social Science Classification

This study assesses ChatGPT and OpenAssistant's effectiveness in classifying social media data.

#Zero-shot Performance

#What We Tested

#Prompting Strategies

#Findings

#Implications for Social Media Analysis

#Future Directions

#Conclusion

Reference Links

Referenced Topics