AI Grading Handwritten Exams in Thermodynamics

Table of Contents

Original Source

In this study, we looked at how artificial intelligence (AI) can help grade handwritten Exams in thermodynamics. We focused on a high-stakes exam with 252 Students and four problems. Our main challenge was getting handwritten answers into a format that the AI could read. We also found that how we set up the Grading criteria affected how well the AI performed.

The Challenge of Handwriting

One of the biggest hurdles was making handwritten answers machine-readable. Students often use different handwriting styles, and the way they write can make it hard for the software to understand what they mean. We discovered that grading complex answers, like drawings or diagrams, was particularly hard for AI. While the AI could identify exams that passed, it still needed human graders for those that failed.

AI in Education

The rise of AI has opened up new options for education, including grading. Since AI systems, like language models, can analyze large amounts of data quickly, they show promise in giving feedback on open-ended responses. In past studies, AI systems showed some agreement with human graders, but those studies did not capture the full complexities of real exams.

Traditional Grading Methods

Traditionally, grading exams in physics requires detailed analysis. Teachers assess the final answers and the process students follow to get there. This includes evaluating logic, concepts, and math skills. While computers can help with grading some answers, human judgment is necessary for thorough evaluations, especially when students take different paths to solve a problem.

The Potential of AI for Grading

AI could offer a scalable way to provide feedback on exam papers. Recent advancements have made it possible for AI systems to analyze student answers and provide preliminary grades or classifications. However, many challenges still exist, particularly when it comes to understanding handwritten text. We explored different ways to use AI for grading and focused on how these methods performed in real-world situations.

The Exam Setup

The thermodynamics exam we studied included standard topics such as energy, entropy, and enthalpy. Students had 15 minutes to read the problems and two hours to complete them. They were allowed to use reference materials and calculators, and their answers had to be handwritten. With 252 of the 434 students agreeing to participate, we gathered a rich data set.

Problems in the Exam

The exam featured four problems, each with different parts. The first problem dealt with a reactor's steady-state operation. The second problem focused on the operation of an aircraft engine, while the third involved a hot gas and a solid-liquid system. The fourth problem centered on a freeze-drying process for food preservation. Each problem required students to provide detailed solutions, often including derivations and calculations.

Grading Process

We developed various workflows for grading. The first step was to scan the exams and convert them into a format that AI could interpret. We used tools like MathPix for optical character recognition (OCR) to transform handwriting into a machine-readable format. Following that, we employed a language model, GPT-4, to analyze the graded text.

Understanding Large Language Models

Large language models create responses based on probabilities. They produce text in sequences, similar to how people construct sentences. However, responses can vary widely depending on settings which can make them either more predictable or more creative-though not necessarily correct. For this study, we maintained a standard approach but adjusted for specific tasks, like grading.

Using Cloud Infrastructure

We accessed OpenAI models through a service that ensured processing was done in Swiss data centers. This setup was crucial for maintaining data privacy and reliability. We evaluated the exams based on various criteria to ensure a fair and thorough grading process.

Exam Structure

In the thermodynamics exam, we set parameters to guide the grading. Students had to provide clear, comprehensive solutions, and each exam problem was assigned two teaching assistants to ensure rigorous grading. The grading was based on a point system, with detailed criteria set to reflect basic understanding of the subject matter.

The Importance of Privacy

To maintain student privacy, we separated consent forms from exam submissions. This allowed for a blind grading process, which helped to avoid bias. However, this also introduced complexities since the graders were unaware of which students had consented to be part of the study.

Optical Character Recognition Challenges

The OCR process presented its own set of difficulties. Students wrote on various types of paper, some decorated with logos and headers that complicated the recognition process. The quality of handwriting also varied widely; some students had neat writing, while others were more difficult to read. This inconsistency impacted the grading accuracy.

Using AI for Grading

After converting exam papers into a machine-readable format, we used AI to grade the answers. Our approach involved using a fine-grained grading rubric, which assigned points for each step in a student's answer. This fine detail added complexity to the grading process and occasionally led to errors.

Different Grading Workflows

We explored four distinct workflows in grading:

Workflow 1: Used detailed grading rubrics.
Workflow 2: Graded by problem parts, making it easier for the AI to keep track.
Workflow 3: Assessed responses by giving a total grade for the whole problem, which reduced accuracy.
Workflow 4: Focused only on processing without rubrics, which led to greater variability.

The first workflow often resulted in bookkeeping errors, while the second workflow showed a better correlation between AI and human grading. Each method had its strengths and weaknesses, leading us to valuable conclusions about AI's capabilities.

Observations from Grading

When we graded the exams, we found that problems with complex diagrams were often misunderstood by AI. The AI’s descriptions of these graphical responses were vague and could not be relied upon for accurate grading. For mathematical derivations, however, AI showed promise in assessing student work with reasonable accuracy.

Results of the Study

Overall, while AI provided promising results in identifying which students passed, the tools were not ready to fully replace human graders. High-stakes exams still require human oversight to ensure fair evaluations. The AI struggled with complex cases and often needed verification on low-scoring exams.

Recommendations for Future Exams

To improve the grading process in future exams, several changes could be made:

Use plain paper to minimize confusion during the OCR process.
Provide specific exam sheets with clear headers to assist with processing.
Encourage students to write more detailed answers to capture their thought processes.
Avoid using pens that scribble out mistakes; they complicate OCR accuracy.

Conclusion

The exploration of AI in grading handwritten thermodynamics exams revealed valuable insights into its potential and limitations. While AI can assist in the grading process, it is clear that human evaluators remain essential. The learning from this study can guide future efforts in education technology to better integrate AI in grading systems, helping to create more effective and reliable evaluation processes.

By addressing the challenges encountered and implementing recommendations, we can work towards more efficient grading that benefits both students and educators in the long run.

AI Grading Handwritten Exams in Thermodynamics

A study on AI's role in grading thermodynamics exams reveals its strengths and weaknesses.

The Challenge of Handwriting

AI in Education

Traditional Grading Methods

The Potential of AI for Grading

The Exam Setup

Problems in the Exam

Grading Process

Understanding Large Language Models

Using Cloud Infrastructure

Exam Structure

The Importance of Privacy

Optical Character Recognition Challenges

Using AI for Grading

Different Grading Workflows

Observations from Grading

Results of the Study

Recommendations for Future Exams

Conclusion

Referenced Topics

AI Grading Handwritten Exams in Thermodynamics

A study on AI's role in grading thermodynamics exams reveals its strengths and weaknesses.

#The Challenge of Handwriting

#AI in Education

#Traditional Grading Methods

#The Potential of AI for Grading

#The Exam Setup

#Problems in the Exam

#Grading Process

#Understanding Large Language Models

#Using Cloud Infrastructure

#Exam Structure

#The Importance of Privacy

#Optical Character Recognition Challenges

#Using AI for Grading

#Different Grading Workflows

#Observations from Grading

#Results of the Study

#Recommendations for Future Exams

#Conclusion

Referenced Topics

The Challenge of Handwriting

AI in Education

Traditional Grading Methods

The Potential of AI for Grading

The Exam Setup

Problems in the Exam

Grading Process

Understanding Large Language Models

Using Cloud Infrastructure

Exam Structure

The Importance of Privacy

Optical Character Recognition Challenges

Using AI for Grading

Different Grading Workflows

Observations from Grading

Results of the Study

Recommendations for Future Exams

Conclusion