Advancing Mutation Testing with Language Models
Using LLMs to enhance mutation testing effectiveness and software quality.
― 6 min read
Table of Contents
Mutation Testing is a method used to check the quality of a test suite. It works by adding small mistakes, called mutations, to a program and seeing if the test suite can catch these mistakes. By doing this, it helps improve the testing process.
Traditionally, mutation testing uses a limited set of changes to create these mistakes. For example, it might replace a plus sign with a minus sign or remove the content of a function. However, this approach can miss some real-world bugs because it doesn't cover every possible mistake. Some mistakes, like calling the wrong method on an object, won't be caught because current tools don’t account for such errors.
This method discusses a new way of generating mutations by using a Large Language Model (LLM). The model is prompted to suggest different mistakes to be made in the code. This tool can be applied to JavaScript code and evaluated on various projects.
What is Mutation Testing?
Mutation testing is based on the idea that many buggy programs are similar to correct ones. The concept suggests that if a test can detect a simple mistake, it can likely detect larger, more complex issues.
In typical mutation testing:
- Small changes are made to the code.
- The modified code is run through the test suite.
- The tool checks if the tests pass or fail.
If tests fail, it means the mutation was detected, which is good. If they pass, it means the change didn’t get caught, which indicates a weakness in the test suite.
Limitations of Traditional Mutation Testing
Most current mutation tools rely on a small set of operators to create mutations. While these tools have been effective, they don't account for the complexity of real-world code. For instance, some changes that lead to bugs, such as altering method calls or changing variable references, are not included.
Adding more mutation types could make testing more thorough. However, doing so can also increase the time and resources needed to run the tests. This is significant because real-world software projects already face challenges with runtime efficiency.
Large Language Models
IntroducingThe new approach leverages LLMs to generate more diverse mutations without needing extensive training. The idea is to replace parts of the source code with a placeholder and ask the model to propose various code snippets to replace the placeholder. This method capitalizes on the vast amount of programming knowledge that LLMs have been trained on.
How It Works
The new mutation testing tool works in a few key steps:
Prompting the LLM: The tool generates prompts that consist of source code, where specific fragments are replaced with a placeholder. The prompt also includes background information on mutation testing.
Generating Mutants: The LLM is asked to suggest variations for the placeholder, creating potential code changes. Each suggestion also comes with an explanation of how it affects the program's behavior.
Evaluating the Mutants: The suggested changes are then processed by a modified mutation testing tool. This tool can apply the generated changes and classify them as either detected (killed), not detected (survived), or timing out during the tests.
Interfacing the Results: Finally, the results are presented in an interactive format that allows users to inspect the mutations generated.
Experimenting with Diverse Mutations
To test the effectiveness of using LLMs, experiments were conducted on real-world JavaScript applications. The goal was to see how many new types of mutations could be created and how effectively they were detected by existing tests.
Key Questions Addressed
Several important questions guide the evaluation process:
- How many mutants does the tool create?
- How many of the surviving mutants are equivalent to the original code?
- What is the effect of different temperature settings on the LLM's output?
- How does the prompting strategy impact the number of generated mutations?
- What is the cost of running the tool on different applications?
Results from Experiments
Number of Mutations Created
The tool was able to generate a substantial number of mutants across various projects. Each run of the tool produced a range of suggestions, with many resulting in new mutants. However, it’s essential to know that not all generated changes were syntactically valid.
Equivalence of Mutants
EvaluatingA significant challenge in mutation testing is determining whether a mutant behaves the same as the original code. Some mutations might seem different but perform the same tasks under certain conditions. In the evaluation, a manual analysis of surviving mutants was done to classify them as equivalent, near-equivalent, or not equivalent.
Understanding Temperature Settings
The LLM has a setting called temperature that affects how creative its output is. A lower temperature generally leads to more predictable and stable outputs, while a higher temperature can produce highly varied results. Experiments revealed that as the temperature increases, the number of distinct mutants also increases, but at the same time, some mutants may be syntactically invalid.
Impact of Prompting Strategy
The way prompts are structured can significantly impact the number of mutants generated. Different variations of prompts were tested to see how they affect results. Using comprehensive prompts tended to produce better outcomes compared to minimal ones.
Comparing Performance of Different LLMs
The LLM choice also influences mutation generation. Various models were tested, and while one might consistently generate more mutants than another, the number of surviving mutants may vary between models.
Cost of Running the Tool
Finally, the tool's cost-effectiveness is an essential consideration. The expenses incurred when using the commercial version of the LLM were calculated based on the number of tokens used for both prompts and outputs. The total cost for running the tool across several projects turned out to be reasonable compared to traditional methods.
Conclusion
In summary, leveraging LLMs for mutation testing presents a promising advancement in software testing practices. The ability to generate diverse and relevant mutants could lead to better testing processes and higher quality software. Despite the challenges faced, such as determining mutant equivalence and managing runtime efficiency, this method opens new doors for improvement in the field of software testing. Future work will focus on refining the model's ability to generate meaningful mutations and further exploring its practical applications in various programming environments.
Future Directions
Future research will look into:
- Reducing the number of equivalent mutants created by the model.
- Finding ways to generate tests that can detect surviving mutants.
- Evaluating how well the mutants created couple with real-world faults.
This approach has the potential to enhance mutation testing practices significantly and improve software reliability overall.
Title: LLMorpheus: Mutation Testing using Large Language Models
Abstract: In mutation testing, the quality of a test suite is evaluated by introducing faults into a program and determining whether the program's tests detect them. Most existing approaches for mutation testing involve the application of a fixed set of mutation operators, e.g., replacing a "+" with a "-" or removing a function's body. However, certain types of real-world bugs cannot easily be simulated by such approaches, limiting their effectiveness. This paper presents a technique where a Large Language Model (LLM) is prompted to suggest mutations by asking it what placeholders that have been inserted in source code could be replaced with. The technique is implemented in LLMorpheus, a mutation testing tool for JavaScript, and evaluated on 13 subject packages, considering several variations on the prompting strategy, and using several LLMs. We find LLMorpheus to be capable of producing mutants that resemble existing bugs that cannot be produced by StrykerJS, a state-of-the-art mutation testing tool. Moreover, we report on the running time, cost, and number of mutants produced by LLMorpheus, demonstrating its practicality.
Authors: Frank Tip, Jonathan Bell, Max Schäfer
Last Update: 2024-04-15 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2404.09952
Source PDF: https://arxiv.org/pdf/2404.09952
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://github.com/githubnext/llmorpheus
- https://stryker-mutator.io/docs/mutation-testing-elements/supported-mutators/
- https://stryker-mutator.io/docs/General/faq/
- https://openai.com/pricing
- https://docs.arcmutate.com/docs/extended-operators.html
- https://github.com/infusion/Complex.js/tree/d995ca105e8adef4c38d0ace50643daf84e0dd1c
- https://github.com/manuelmhtr/countries-and-timezones/tree/241dd0f56dfc527bcd87779ae14ed67bd25c1c0e
- https://gitlab.com/autokent/crawler-url-parser/tree/202c5b25ad693d284804261e2b3815fe66e0723e
- https://github.com/quilljs/delta/tree/5ffb853d645aa5b4c93e42aa52697e2824afc869
- https://gitlab.com/demsking/image-downloader/tree/19a53f652824bd0c612cc5bcd3a2eb173a16f938
- https://github.com/felixge/node-dirty/tree/d7fb4d4ecf0cce144efa21b674965631a7955e61
- https://github.com/rainder/node-geo-point/tree/c839d477ff7a48d1fc6574495cbbc6196161f494
- https://github.com/jprichardson/node-jsonfile/tree/9c6478a85899a9318547a6e9514b0403166d8c5c
- https://github.com/swang/plural/tree/f0027d66ecb37ce0108c8bcb4a6a448d1bf64047
- https://github.com/pull-stream/pull-stream/tree/29b4868bb3864c427c3988855c5d65ad5cb2cb1c
- https://github.com/kriskowal/q/tree/6bc7f524eb104aca8bffde95f180b5210eb8dd4b
- https://gitlab.com/cptpackrat/spacl-core/tree/fcb8511a0d01bdc206582cfacb3e2b01a0288f6a
- https://github.com/maugenst/zip-a-folder/tree/d2ea465b20dc33cf8c98c58a7acaf875c586c3e1