Simple Science

Cutting edge science explained simply

# Computer Science # Computers and Society # Software Engineering

Using AI to Improve Coding Education with Synthetic Data

Research shows LLMs can generate useful synthetic code for teaching.

Juho Leinonen, Paul Denny, Olli Kiljunen, Stephen MacNeil, Sami Sarsa, Arto Hellas

― 6 min read


AI in Coding Education AI in Coding Education learning. LLMs generate synthetic code to aid
Table of Contents

In the world of teaching computing, having data is as important as having a good cup of coffee on a Monday morning. It's essential for figuring out how students learn, improving support systems, and creating better assessment tools. But here's the kicker: not much data is shared openly. This is often due to privacy rules and the stress of ensuring that student identities remain hidden.

Synthetic Data and Large Language Models

Now, there is good news on the horizon! Large language models (LLMs) like GPT-4o might just be the superheroes we need. These models can generate large amounts of fake but realistic data that maintains student privacy. This kind of data can help researchers tackle issues in computing education and test new learning tools without the risk of revealing anyone's secrets.

Creating Synthetic Buggy Code

Our aim was to use LLMs to create synthetic buggy code Submissions for beginners in Programming. We compared how often these synthetic submissions failed against actual student submissions from two different courses. The goal was to see how well the synthetic data mimics the real student experience.

The results showed that LLMs can create synthetic code that isn't too different from real student data when it comes to how often the code fails tests. This means that LLMs could be a valuable tool for researchers and educators, allowing them to focus on teaching while worrying less about protecting student data.

Broadening Horizons in Computing Education

With the rise of LLMs, computing education is changing in ways we didn't think possible. These models are fantastic at handling simple programming tasks and have recently demonstrated their ability to tackle more complex issues too. While it’s impressive that they can generate correct solutions, what’s even more interesting is that they could also be used to create incorrect code on purpose.

The Importance of Incorrect Code

Generating incorrect code might sound counterintuitive, but it holds promise. Wrong code can be used in debugging exercises, which research shows helps students learn better. Furthermore, creating mixed sets of code with both correct and incorrect solutions could help educators prepare better datasets for assessing students' work.

However, creating such datasets is tough. Many programming education resources are scarce due to strict privacy rules. That's where LLMs step in, offering a fresh solution for generating the kind of data that researchers can use without compromising anyone's privacy.

Investigating Prompting Strategies

To get the best results from LLMs, we looked into various strategies for asking them to generate code. Our research focused on identifying which prompts would guide these models to create code submissions that best resemble real student work.

We targeted some beginning programming problems to see how well the generated code matched what actual students had done. This study used two programming languages: C and Dart.

Context and Data Collection

C Programming Context

First, we gathered data from a six-week intro to C programming course at a university in New Zealand. Students worked on coding tasks individually in a lab, receiving immediate feedback from an automated system. After the course, we analyzed the final project submissions to see how many passed all tests and how many failed.

Dart Programming Context

Next, we examined ten exercises from an online course platform at a university in Finland. This included both introductory and advanced courses, with programming tasks that ranged from simple to complex. We collected the submissions to get insights into the students' performance.

Prompting the Models

When we asked the LLM to generate incorrect code, we provided specific instructions to ensure the generated solutions would have Bugs. We didn’t want the models to produce code that just wouldn’t work; we wanted code that looked almost right but contained some errors.

We created three types of prompts: one straightforward prompt, one that included specific test cases, and another that helped the model understand the frequency of test case failures. These prompt variations aimed to see how well the model could align its outputs with actual student errors.

Analyzing Results

After generating 1500 synthetic submissions, we compared the results. We focused particularly on how often each piece of code passed or failed the unit tests. This analysis allowed us to measure the similarities and differences between the real student submissions and the synthetic submissions from the model.

Findings

We found some fascinating trends. For certain exercises, the model struggled to generate bugs that only partially failed tests. In contrast, real student submissions often showcased more nuanced errors. This suggests that while the LLMs can generate faulty code, they don't always capture the subtlety of real student mistakes.

Surprisingly, when comparing the different prompt strategies, we didn’t see much difference in the outputs for Dart. This means that no matter how we asked the model, the results were quite similar. For C, however, different prompts led to varied results, indicating that the model might need more help to generate code that is closer to actual student submissions.

The Effectiveness of Different Prompts

Interestingly, the prompts that provided the LLM with information about test cases and failure frequencies did not significantly improve the quality of the generated code for Dart. However, the same prompts did make a noticeable difference for C submissions. This reveals that the effectiveness of prompting strategies can depend on the particular programming context.

Common Issues

While we learned a lot about generating synthetic code, we faced some challenges. Our focus on incorrect code meant we missed the chance to see whether the model could produce correct and realistic code too. Since many student submissions pass all tests, our research only touched on a portion of their submissions.

Another issue was that the tests for some Dart exercises were not very thorough. This could mean that some bugs didn’t get caught by the tests, making our analysis a bit incomplete.

Conclusion: What’s Next?

In summary, our research shows that generative AI can create synthetic code submissions that are similar to actual student mistakes, particularly concerning test case failures. This opens doors for educators to use synthetic data in various ways, such as for preparing debugging exercises.

However, we need to explore further how well LLMs can mimic the nuances of real student code. Looking into correct code generation and other factors that make real submissions unique will offer deeper insights into improving computer education.

With the right approaches, we could see a future where educators wield the power of AI to enhance student learning experiences while keeping everyone’s secrets safe. It’s like giving a magic wand to teachers-no more worrying about data privacy as they sprinkle AI-generated coding tasks across their classrooms!

Original Source

Title: LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education

Abstract: There is a great need for data in computing education research. Data is needed to understand how students behave, to train models of student behavior to optimally support students, and to develop and validate new assessment tools and learning analytics techniques. However, relatively few computing education datasets are shared openly, often due to privacy regulations and issues in making sure the data is anonymous. Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data, which can be used to explore various aspects of student learning, develop and test educational technologies, and support research in areas where collecting real student data may be challenging or impractical. This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o. We compare the distribution of test case failures between synthetic and real student data from two courses to analyze the accuracy of the synthetic data in mimicking real student data. Our findings suggest that LLMs can be used to generate synthetic incorrect submissions that are not significantly different from real student data with regard to test case failure distributions. Our research contributes to the development of reliable synthetic datasets for computing education research and teaching, potentially accelerating progress in the field while preserving student privacy.

Authors: Juho Leinonen, Paul Denny, Olli Kiljunen, Stephen MacNeil, Sami Sarsa, Arto Hellas

Last Update: 2024-10-31 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2411.10455

Source PDF: https://arxiv.org/pdf/2411.10455

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles