Evaluating LLMs in Social Science Classification
This study assesses ChatGPT and OpenAssistant's effectiveness in classifying social media data.
― 4 min read
Table of Contents
Large Language Models (LLMs) are tools that can understand and generate human language. They have shown great skill in following instructions and providing useful responses. However, training these models requires a lot of resources, which often leads to their use without any prior training on specific tasks, known as zero-shot settings. This article focuses on how well two popular LLMs, ChatGPT and OpenAssistant, perform in classifying data related to social science without any task-specific training. We also look at how different ways of asking questions, or prompting, affect their accuracy.
Zero-shot Performance
In our study, we compare the performances of ChatGPT and OpenAssistant on six different tasks related to social media. These tasks involve classifying posts and comments to gain insights into public opinions and behaviors. We examine how the way we ask the models to perform these tasks impacts their success. By changing the complexity of the prompts we give them, we aim to see if these models can maintain high levels of accuracy.
What We Tested
We looked at a number of tasks to see how well the LLMs can classify them without any fine-tuning:
- Rumor Stance Detection: Identifying the stance of a reply towards a rumor, such as whether it supports, denies, questions, or comments on it.
- Sarcasm Detection: Determining if a tweet is meant to be sarcastic.
- Vaccine Stance: Predicting attitudes towards COVID-19 vaccination based on tweets.
- Complaint Detection: Identifying if a tweet expresses a complaint.
- Bragging Detection: Classifying whether a tweet involves bragging.
- Hate Speech Detection: Detecting harmful speech, such as racism or sexism, in social media posts.
Each task uses specific datasets that have been labeled by humans to allow us to measure accuracy.
Prompting Strategies
The way we prompt these models greatly influences their performance. We tried several prompting strategies:
- Basic Instruction: This involves a simple request without much detail about the task.
- Task and Label Description: Here, we provide more information, including descriptions of the task and labels.
- Memory Recall: We include titles of relevant papers to see if the models can recall them, potentially enhancing their responses.
- Using Synonyms: We substitute the original labels with similar words to determine if this helps the models perform better.
Changing the type of prompt often affects how well the LLMs classify information. For example, adding definitions or using synonyms might make the models' predictions more accurate.
Findings
Our results showed interesting trends. First, the LLMs struggled to match the performance of smaller models specifically trained for the tasks, like BERT. Moreover, we found that different prompting strategies resulted in significant differences in classification accuracy. In some cases, models were able to improve their scores by over 10% just by changing the way they were asked to classify.
While the LLMs have impressive capabilities, we found that they generally performed better on simpler requests rather than more complex ones. In fact, using detailed prompts did not always lead to better results and sometimes made performance worse. This means that creating effective prompts for these models can be quite challenging.
We also learned that using synonyms for class labels improved the models' performances in most tasks. This suggests that picking the right words is crucial for achieving better accuracy.
Implications for Social Media Analysis
The findings of this research have important implications for using LLMs in analyzing social media. As platforms like Twitter are filled with nuanced opinions and complex conversations, models that can classify text effectively can help organizations manage public sentiment, reduce misinformation, and address harmful content.
Our results show that while LLMs have potential, they should not fully replace human annotators. For example, in the vaccine stance task, the agreement between two human judges was around 62%, indicating the complexity involved in interpreting social media content. However, LLMs can significantly reduce the workload of human annotators by providing initial assessments, which can then be verified by humans.
Future Directions
In the future, it will be valuable to investigate more ways to enhance the performance of LLMs in zero-shot settings. Further exploration of different prompt styles, fine-tuning techniques, and improving dataset diversity will be necessary to develop more reliable models for social science applications.
We also suggest testing these models in real-time scenarios to understand their practical value. Engaging with user feedback can clarify how well the LLMs interpret and classify real-world social media interactions.
Conclusion
In summary, while LLMs like ChatGPT and OpenAssistant show promise in understanding and classifying social media content, they still face challenges in zero-shot tasks. The way we prompt these models plays a crucial role in their performance, and there is a lot of room for improvement in how we design these prompts. As technology develops, we hope to see more refined LLMs that can assist researchers and organizations in analyzing social media data effectively and efficiently.
Title: Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science
Abstract: Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10\%.
Authors: Yida Mu, Ben P. Wu, William Thorne, Ambrose Robinson, Nikolaos Aletras, Carolina Scarton, Kalina Bontcheva, Xingyi Song
Last Update: 2024-03-24 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2305.14310
Source PDF: https://arxiv.org/pdf/2305.14310
Licence: https://creativecommons.org/publicdomain/zero/1.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://platform.openai.com/docs/models/gpt-3-5
- https://platform.openai.com/docs/api-reference
- https://huggingface.co/OpenAssistant/oasst-sft-7-llama-30b-xor
- https://huggingface.co/bert-large-uncased
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection
- https://platform.openai.com/docs/api-reference/chat/create
- https://open-assistant.io/dashboard
- https://github.com/sohampoddar26/covid-vax-stance/tree/main/dataset
- https://platform.openai.com/docs/guides/rate-limits/overview