AI Grading Handwritten Exams in Thermodynamics
A study on AI's role in grading thermodynamics exams reveals its strengths and weaknesses.
― 6 min read
Table of Contents
- The Challenge of Handwriting
- AI in Education
- Traditional Grading Methods
- The Potential of AI for Grading
- The Exam Setup
- Problems in the Exam
- Grading Process
- Understanding Large Language Models
- Using Cloud Infrastructure
- Exam Structure
- The Importance of Privacy
- Optical Character Recognition Challenges
- Using AI for Grading
- Different Grading Workflows
- Observations from Grading
- Results of the Study
- Recommendations for Future Exams
- Conclusion
- Original Source
In this study, we looked at how artificial intelligence (AI) can help grade handwritten Exams in thermodynamics. We focused on a high-stakes exam with 252 Students and four problems. Our main challenge was getting handwritten answers into a format that the AI could read. We also found that how we set up the Grading criteria affected how well the AI performed.
The Challenge of Handwriting
One of the biggest hurdles was making handwritten answers machine-readable. Students often use different handwriting styles, and the way they write can make it hard for the software to understand what they mean. We discovered that grading complex answers, like drawings or diagrams, was particularly hard for AI. While the AI could identify exams that passed, it still needed human graders for those that failed.
AI in Education
The rise of AI has opened up new options for education, including grading. Since AI systems, like language models, can analyze large amounts of data quickly, they show promise in giving feedback on open-ended responses. In past studies, AI systems showed some agreement with human graders, but those studies did not capture the full complexities of real exams.
Traditional Grading Methods
Traditionally, grading exams in physics requires detailed analysis. Teachers assess the final answers and the process students follow to get there. This includes evaluating logic, concepts, and math skills. While computers can help with grading some answers, human judgment is necessary for thorough evaluations, especially when students take different paths to solve a problem.
The Potential of AI for Grading
AI could offer a scalable way to provide feedback on exam papers. Recent advancements have made it possible for AI systems to analyze student answers and provide preliminary grades or classifications. However, many challenges still exist, particularly when it comes to understanding handwritten text. We explored different ways to use AI for grading and focused on how these methods performed in real-world situations.
The Exam Setup
The thermodynamics exam we studied included standard topics such as energy, entropy, and enthalpy. Students had 15 minutes to read the problems and two hours to complete them. They were allowed to use reference materials and calculators, and their answers had to be handwritten. With 252 of the 434 students agreeing to participate, we gathered a rich data set.
Problems in the Exam
The exam featured four problems, each with different parts. The first problem dealt with a reactor's steady-state operation. The second problem focused on the operation of an aircraft engine, while the third involved a hot gas and a solid-liquid system. The fourth problem centered on a freeze-drying process for food preservation. Each problem required students to provide detailed solutions, often including derivations and calculations.
Grading Process
We developed various workflows for grading. The first step was to scan the exams and convert them into a format that AI could interpret. We used tools like MathPix for optical character recognition (OCR) to transform handwriting into a machine-readable format. Following that, we employed a language model, GPT-4, to analyze the graded text.
Understanding Large Language Models
Large language models create responses based on probabilities. They produce text in sequences, similar to how people construct sentences. However, responses can vary widely depending on settings which can make them either more predictable or more creative-though not necessarily correct. For this study, we maintained a standard approach but adjusted for specific tasks, like grading.
Using Cloud Infrastructure
We accessed OpenAI models through a service that ensured processing was done in Swiss data centers. This setup was crucial for maintaining data privacy and reliability. We evaluated the exams based on various criteria to ensure a fair and thorough grading process.
Exam Structure
In the thermodynamics exam, we set parameters to guide the grading. Students had to provide clear, comprehensive solutions, and each exam problem was assigned two teaching assistants to ensure rigorous grading. The grading was based on a point system, with detailed criteria set to reflect basic understanding of the subject matter.
The Importance of Privacy
To maintain student privacy, we separated consent forms from exam submissions. This allowed for a blind grading process, which helped to avoid bias. However, this also introduced complexities since the graders were unaware of which students had consented to be part of the study.
Optical Character Recognition Challenges
The OCR process presented its own set of difficulties. Students wrote on various types of paper, some decorated with logos and headers that complicated the recognition process. The quality of handwriting also varied widely; some students had neat writing, while others were more difficult to read. This inconsistency impacted the grading accuracy.
Using AI for Grading
After converting exam papers into a machine-readable format, we used AI to grade the answers. Our approach involved using a fine-grained grading rubric, which assigned points for each step in a student's answer. This fine detail added complexity to the grading process and occasionally led to errors.
Different Grading Workflows
We explored four distinct workflows in grading:
- Workflow 1: Used detailed grading rubrics.
- Workflow 2: Graded by problem parts, making it easier for the AI to keep track.
- Workflow 3: Assessed responses by giving a total grade for the whole problem, which reduced accuracy.
- Workflow 4: Focused only on processing without rubrics, which led to greater variability.
The first workflow often resulted in bookkeeping errors, while the second workflow showed a better correlation between AI and human grading. Each method had its strengths and weaknesses, leading us to valuable conclusions about AI's capabilities.
Observations from Grading
When we graded the exams, we found that problems with complex diagrams were often misunderstood by AI. The AI’s descriptions of these graphical responses were vague and could not be relied upon for accurate grading. For mathematical derivations, however, AI showed promise in assessing student work with reasonable accuracy.
Results of the Study
Overall, while AI provided promising results in identifying which students passed, the tools were not ready to fully replace human graders. High-stakes exams still require human oversight to ensure fair evaluations. The AI struggled with complex cases and often needed verification on low-scoring exams.
Recommendations for Future Exams
To improve the grading process in future exams, several changes could be made:
- Use plain paper to minimize confusion during the OCR process.
- Provide specific exam sheets with clear headers to assist with processing.
- Encourage students to write more detailed answers to capture their thought processes.
- Avoid using pens that scribble out mistakes; they complicate OCR accuracy.
Conclusion
The exploration of AI in grading handwritten thermodynamics exams revealed valuable insights into its potential and limitations. While AI can assist in the grading process, it is clear that human evaluators remain essential. The learning from this study can guide future efforts in education technology to better integrate AI in grading systems, helping to create more effective and reliable evaluation processes.
By addressing the challenges encountered and implementing recommendations, we can work towards more efficient grading that benefits both students and educators in the long run.
Title: Grading Assistance for a Handwritten Thermodynamics Exam using Artificial Intelligence: An Exploratory Study
Abstract: Using a high-stakes thermodynamics exam as sample (252~students, four multipart problems), we investigate the viability of four workflows for AI-assisted grading of handwritten student solutions. We find that the greatest challenge lies in converting handwritten answers into a machine-readable format. The granularity of grading criteria also influences grading performance: employing a fine-grained rubric for entire problems often leads to bookkeeping errors and grading failures, while grading problems in parts is more reliable but tends to miss nuances. We also found that grading hand-drawn graphics, such as process diagrams, is less reliable than mathematical derivations due to the difficulty in differentiating essential details from extraneous information. Although the system is precise in identifying exams that meet passing criteria, exams with failing grades still require human grading. We conclude with recommendations to overcome some of the encountered challenges.
Authors: Gerd Kortemeyer, Julian Nöhl, Daria Onishchuk
Last Update: 2024-06-25 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2406.17859
Source PDF: https://arxiv.org/pdf/2406.17859
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.