Revolutionizing Feedback: A New Grading Approach
Discover how technology transforms student feedback with innovative grading methods.
Pritam Sil, Bhaskaran Raman, Pushpak Bhattacharyya
― 8 min read
Table of Contents
- The Need for Personalized Feedback
- The MMSAF Problem
- What Is MMSAF?
- The MMSAF Dataset
- How Was the Dataset Created?
- Challenges in Traditional Grading
- The Role of Large Language Models (LLMs)
- Choosing the Right LLMs
- How Do LLMs Help?
- Evaluation of the LLMs
- Measuring Success
- Results of the Evaluation
- Correctness Levels
- Image Relevance
- Feedback Quality
- Expert Evaluation
- Who Came Out on Top?
- The Importance of Feedback in Learning
- Motivating Students
- Future Directions
- Expanding the Dataset
- Automating Image Annotations
- Ethical Considerations
- Conclusion
- Final Thoughts
- Original Source
- Reference Links
In education, giving students Feedback is super important. It helps them learn and grow. But what happens when you have a classroom full of learners? How do you give each one the personal touch they need? Enter technology! With the help of intelligent systems, we can now offer personalized feedback to students. This article discusses a new approach to grading short answers given by students, especially when they also include images. It's like a teacher with superpowers!
The Need for Personalized Feedback
Imagine a classroom where everyone is working on their assignments. Some students ask questions, while others struggle in silence. Addressing their individual needs can be tricky for one teacher. This is where smart tools come into play. They aim to provide unique feedback based on each student’s answer, whether it’s in writing or with a picture.
The traditional methods in education mostly focus on multiple-choice questions. These can be limiting, as they only allow students to pick answers without encouraging creativity. Instead, open-ended questions let students express their thoughts freely. However, evaluating these answers can be tough! That's where Automatic Short Answer Grading (ASAG) comes in, but with a twist. We’re now adding a new layer: feedback that recognizes images too!
The MMSAF Problem
Now, let’s dive into our main subject: the Multimodal Short Answer Grading with Feedback (MMSAF). This new approach allows teachers (and machines) to grade answers that include both text and images.
What Is MMSAF?
Think of MMSAF as a grading superhero. It takes a question, a reference answer (the "gold standard"), and the student’s answer-all with the possibility of images-and gives a grade along with useful feedback. The goal is to help students understand where they went wrong and how they can improve.
This is particularly useful in subjects like science, where diagrams and images can really enhance understanding. For example, if a student draws a picture of a plant cell and explains its parts, the system grades not just the words, but also the image they provided.
Dataset
The MMSAFTo train our grading superhero, we needed a lot of data. We created a dataset consisting of 2,197 examples taken from high school-level questions in subjects like physics, chemistry, and biology.
How Was the Dataset Created?
We didn’t just pull this data out of thin air. We used textbooks and even some help from AI to generate example answers. Each entry in our dataset includes a question, a correct answer, a student answer, and information on whether their image (if provided) was relevant. This means that our superhero has a rich understanding of what good answers look like!
Challenges in Traditional Grading
Grading open-ended questions comes with its own set of challenges. Many existing systems struggle when it comes to providing specific, insightful feedback. They might just say, "You did okay," without giving any real guidance. This can leave students feeling confused.
The MMSAF approach seeks to change all that. Not only does it evaluate the correctness of what students write, but it also considers how relevant their images are. It’s a more comprehensive way to evaluate both creativity and understanding.
Large Language Models (LLMs)
The Role ofLLMs are like the brains behind our grading superhero. These models learn from vast amounts of data, allowing them to evaluate and provide feedback on both text and images.
Choosing the Right LLMs
We didn’t just pick any model off the shelf. We selected four different LLMs to test our MMSAF approach: ChatGPT, Gemini, Pixtral, and Molmo. Each of these models has its own strengths, especially when it comes to understanding and reasoning through multimodal data-text and images combined.
How Do LLMs Help?
Think of LLMs as very smart assistants that can read, write, and analyze. They can look at a student’s answer and compare it to a reference answer. They generate levels of correctness, comment on the relevance of images, and provide thoughtful feedback that addresses common errors. This saves time for teachers who might otherwise spend hours grading assignments.
Evaluation of the LLMs
After setting up the MMSAF framework and dataset, we needed to see how well these LLMs performed. We randomly sampled 221 student responses and let our LLMs work their magic.
Measuring Success
We looked at how accurately each LLM predicted the level of correctness and the relevance of images. The main goal was to determine which model could provide the best feedback while remaining friendly and approachable-like a teacher, but with a little digital flair!
Results of the Evaluation
So, how did our LLM superheroes perform? It turned out that while some excelled in specific areas, others had certain shortcomings.
Correctness Levels
Gemini performed quite well when it came to predicting correctness levels. It reliably classified answers as correct, partially correct, or incorrect without much fuss. ChatGPT also did a good job but tended to label some incorrect answers as partially correct. Pixtral was lenient with its grading, giving some incorrect answers a pass as partially correct. On the other hand, Molmo didn’t fare as well, often marking everything as incorrect.
Image Relevance
When it came to the relevance of images, ChatGPT shone brightly. It was able to evaluate the images accurately in most cases. Meanwhile, Gemini struggled a bit, sometimes marking relevant images as irrelevant, which could leave students scratching their heads.
Feedback Quality
One of the most exciting aspects of our study was the quality of the feedback that each LLM generated. We wanted to ensure that the feedback was not only accurate but also constructive and encouraging.
Expert Evaluation
To get a better sense of how the feedback held up, we enlisted the help of subject matter experts (SMEs). These are real educators who know their subjects inside and out. They evaluated the feedback on several criteria, including grammar, emotional impact, correctness, and more.
Who Came Out on Top?
The experts rated ChatGPT as the best in terms of fluency and grammatical correctness, while Pixtral excelled in emotional impact and overall helpfulness. It turns out that Pixtral knew how to structure its feedback in a way that made it easy for students to digest.
The Importance of Feedback in Learning
Feedback is more than just a grade; it’s an opportunity for improvement. It can inspire students to dig deeper, ask questions, and truly engage with the material. In a world where students often feel overwhelmed, personalized feedback can be a game-changer.
Motivating Students
When students receive constructive feedback, it can ignite a spark of curiosity. They might think, “Hey, I never thought about it that way!” Effective feedback encourages students to learn from their mistakes and fosters a desire to keep exploring the subject matter.
Future Directions
While we’ve made great strides with the MMSAF framework and its evaluation methods, there’s still room to grow.
Expanding the Dataset
Currently, our dataset is primarily focused on high school subjects. In the future, we could expand it to include university-level courses and other subjects. This would create a more robust resource for educators and students alike.
Automating Image Annotations
Right now, some of the image-related feedback must be done manually. We could develop tools to automate this process, thus making it scalable and efficient.
Ethical Considerations
We’ve sourced our content from reputable educational resources to ensure that we meet ethical guidelines. It’s crucial to respect the boundaries of copyright and address issues of data privacy, especially when working with AI in education.
Conclusion
In summary, the MMSAF problem offers a fresh approach to assessing students’ short answers that include multimodal content. By leveraging the power of LLMs, we can help students receive valuable feedback that not only grades their work but also enhances their learning experience. With ongoing research and development, we can make educational experiences richer, more engaging, and, most importantly, more supportive for learners everywhere.
Final Thoughts
Education is more than just passing grades; it’s about nurturing curiosity and passion for learning. With tools like MMSAF and smart AI models, we stand on the brink of a new age in educational assessment. So, whether it’s a student’s text or a doodle of a cell, we’re ready to help them succeed, one grade at a time!
And who knows? Maybe one day, our grading superhero will help students learn from their homework mistakes while they laugh along the way. After all, learning should be fun!
Title: "Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)
Abstract: Personalized feedback plays a vital role in a student's learning process. While existing systems are adept at providing feedback over MCQ-based evaluation, this work focuses more on subjective and open-ended questions, which is similar to the problem of Automatic Short Answer Grading (ASAG) with feedback. Additionally, we introduce the Multimodal Short Answer grading with Feedback (MMSAF) problem over the traditional ASAG feedback problem to address the scenario where the student answer and reference answer might contain images. Moreover, we introduce the MMSAF dataset with 2197 data points along with an automated framework for generating such data sets. Our evaluations on existing LLMs over this dataset achieved an overall accuracy of 55\% on Level of Correctness labels, 75\% on Image Relevance labels and a score of 4.27 out of 5 in correctness level of LLM generated feedback as rated by experts. As per experts, Pixtral achieved a rating of above 4 out of all metrics, indicating that it is more aligned to human judgement, and that it is the best solution for assisting students.
Authors: Pritam Sil, Bhaskaran Raman, Pushpak Bhattacharyya
Last Update: Dec 27, 2024
Language: English
Source URL: https://arxiv.org/abs/2412.19755
Source PDF: https://arxiv.org/pdf/2412.19755
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://www.latex-project.org/help/documentation/encguide.pdf
- https://huggingface.co/
- https://platform.openai.com/docs/api-reference/introduction
- https://ai.google.dev/gemini-api/docs/api-key
- https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
- https://blog.google/technology/ai/google-gemini-ai/
- https://mistral.ai/news/pixtral-12b/
- https://molmo.allenai.org/blog
- https://aclweb.org/anthology/anthology.bib.gz
- https://www.ncrtsolutions.in/