Unraveling Truth in Social Media Claims
A competition aims to identify claims in social media posts accurately.
Soham Poddar, Biswajit Paul, Moumita Basu, Saptarshi Ghosh
― 7 min read
Table of Contents
Social media is like a giant playground where everyone is shouting out their opinions and "facts". But, let's be real, not everything shared is true. In fact, some posts can be downright misleading or fake. That's where the big task of figuring out what claims are actually true comes into play. Just like how we sort out the toys that are fun from the ones that are broken, we need to sift through social media claims to figure out what's real and what's not.
With millions of posts popping up every day, it can be tough for humans to keep up. That's why we need machines to help us out. This brings us to a special Competition focused on a specific mission: Claim Span Identification. In simpler terms, it's about finding the exact parts of a text that are making claims.
Claim Span Identification (CSI) Explained
Think of a claim like that friend who tells tall tales. You want to figure out what they really mean without getting tangled in their stories. The Claim Span Identification (CSI) task does just that by pinpointing the phrases in posts that claim to state facts. It's not as simple as just saying, "This is a claim" or "This isn't a claim." Instead, it requires diving deeper into the text and understanding its essence.
To illustrate this, if someone tweets, "I heard eating carrots can help you see in the dark," the claim here is "eating carrots can help you see in the dark." The task is to identify that specific phrase, just like finding the treasure chest in a pirate's map.
The Competition
This competition was organized for bright minds to tackle the CSI task. Participants were given a new set of data containing thousands of posts in two languages: English and Hindi. The goal was for teams to develop methods that would help identify claim spans from these posts.
The data set consisted of 8,000 posts in English and 8,000 in Hindi, each labeled by humans who painstakingly marked which parts of the posts were making claims. Participants were tasked with coming up with solutions that could sort through these texts and pinpoint the claims accurately.
The Dataset
Imagine having a library filled with books where each book has a few sentences containing important claims. That's how the dataset was structured. It was designed to be useful and to include various kinds of claims so that the models trained on them could understand different scenarios.
The English part of the dataset included posts about COVID-19 vaccines, which are particularly sensitive given the misinformation swirling around vaccines. On the other hand, the Hindi side contained posts about fake news and hate speech, reflecting different but equally important social issues.
Humans, who were experienced and fluent in both languages, marked the claims in the posts. They were given training on how to spot claims and were paid for their work. The result? A carefully curated dataset that the competition participants could use to test their skills.
The Challenge of Claim Span Identification
Identifying a claim within a text isn't as easy as one might think. It's not just a matter of reading a sentence and making a judgment. The task is more complex, requiring attention to detail similar to a detective looking for clues.
The CSI task involves examining each word in a post. For example, if a post says, "Dogs can run faster than cats," the claim span is "Dogs can run faster than cats." However, if the post concludes with "but that's just what I heard," the challenge is to identify that earlier span without getting distracted by the qualifier at the end.
Overview of the Competition
The competition attracted teams from various regions, all eager to flex their problem-solving muscles. Participants were evaluated based on how well they could identify claim spans in both English and Hindi posts.
Teams had different approaches to tackle the challenge. Some focused more on the English posts, while others tried to balance their efforts across both languages. The evaluation criteria were strict, ensuring that the teams adhered to the guidelines and delivered the best possible solutions.
Different Competition Tracks
There were three tracks for the competition, each catering to different levels of resources and strategies:
-
Constrained English Track: Teams could only use the English training and validation sets provided for the competition. This track emphasized understanding and working within a specific framework.
-
Constrained Hindi Track: Similar to the English track, participants were limited to using only the Hindi training and validation sets for their models.
-
Unconstrained Multilingual Track: Here, teams had the freedom to use any resources they wanted, making it more competitive and diverse.
Participants could choose to compete in one or more tracks, submitting solutions for each. This allowed teams to showcase their best work across different scenarios and languages.
Performance Evaluation
All the hard work culminated in performance evaluation based on certain metrics. Using scores like Macro-F1 and Jaccard metrics, teams were judged on how accurately they could predict claim spans.
Think of it like a game of darts; the closer you are to the bullseye with your predictions, the better your score. The final scores indicated how effectively each team could identify the claim spans from the provided posts.
Participating Teams
The competition saw participation from several teams, each bringing their unique approaches and solutions to the table. While most teams were from India, there was also involvement from teams in the USA and Bangladesh.
The organizers also contributed a baseline model to compare against. Even though the organizing team had a solid foundation, the participating teams rose to the challenge, trying to outpace the baseline and each other.
Winning Solutions
Among the teams, a few stood out for their exceptional methods:
-
JU NLP: This team nailed it with their preprocessing steps. They cleaned up the data before diving into the processing phase, which helped them achieve the best results in the English and Hindi tracks. They made sure to standardize everything, from URLs to user mentions, giving their models clear data to work with.
-
FactFinders: This team didn’t settle on just one model. They fine-tuned various models, mixing and matching to see what worked best for both the English and Hindi tracks. The creativity in their approach, particularly with their use of additional Datasets, helped them achieve high scores.
-
DLRG: This team took a unique approach by using a 3-class BIO system for token classification, which meant they were breaking down the claims even further than just identifying them. This allowed them to provide more nuanced classifications and get solid results in the multilingual category.
Analysis of Results
After the competition, the organizers analyzed the results and techniques used by the teams. It became clear that transformer models like BERT, RoBERTa, and XLM-RoBERTa were the go-to choices. These models have the amazing ability to grasp the context of language, which is crucial for tasks like claim identification.
The findings showed that while the Unconstrained Multilingual track was a tough nut to crack, the structured English and Hindi tracks yielded better and more consistent results. The participants in the multilingual track struggled to beat even the baseline model.
Conclusion
The ICPR 2024 Competition on Multilingual Claim-Span Identification was a great step towards understanding how to verify claims in the vast jungle of social media. The Challenges faced highlight the complexities involved in accurately identifying claims, proving that there's still a lot of work to be done in this field.
While the participants came up with a variety of methods and techniques, none could significantly outperform the baseline model, showcasing the ongoing need for innovation in the area of claim span identification.
The organizers hope that the publicly accessible dataset can motivate future researchers to continue tackling these challenges and contribute to the evolving landscape of misinformation management. After all, we all deserve to know what's true and what's, well, just a tall tale!
Title: ICPR 2024 Competition on Multilingual Claim-Span Identification
Abstract: A lot of claims are made in social media posts, which may contain misinformation or fake news. Hence, it is crucial to identify claims as a first step towards claim verification. Given the huge number of social media posts, the task of identifying claims needs to be automated. This competition deals with the task of 'Claim Span Identification' in which, given a text, parts / spans that correspond to claims are to be identified. This task is more challenging than the traditional binary classification of text into claim or not-claim, and requires state-of-the-art methods in Pattern Recognition, Natural Language Processing and Machine Learning. For this competition, we used a newly developed dataset called HECSI containing about 8K posts in English and about 8K posts in Hindi with claim-spans marked by human annotators. This paper gives an overview of the competition, and the solutions developed by the participating teams.
Authors: Soham Poddar, Biswajit Paul, Moumita Basu, Saptarshi Ghosh
Last Update: Nov 29, 2024
Language: English
Source URL: https://arxiv.org/abs/2411.19579
Source PDF: https://arxiv.org/pdf/2411.19579
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://sites.google.com/view/icpr24-csi/home
- https://icpr2024.org/
- https://sohampoddar26.github.io/
- https://amitykolkata.irins.org/profile/376094
- https://cse.iitkgp.ac.in/
- https://sites.google.com/view/aisome/aisome
- https://sites.google.com/view/irmidis-fire2022/irmidis
- https://www.cogitotech.com/
- https://github.com/sohampoddar26/hecsi-data
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard