Assessing ChatGPT's Role in Python Security Code Review

Table of Contents

Previous Research
Dataset Used for Testing
Using ChatGPT API for Detection
Evaluating Results
Challenges and Limitations
Conclusion
Original Source
Reference Links

In recent times, artificial intelligence (AI) has become more important in many areas of life. One area where AI can help is in checking code for security issues. This is known as security code review. People are now looking at AI tools to help find problems in code, and one tool that has gained a lot of attention is ChatGPT. ChatGPT is known for its ability to follow instructions and provide detailed answers. This paper looks at whether ChatGPT can be used to find security problems in Python code.

With the rise of technology, the amount of code being written has also increased. For example, the number of projects on GitHub has doubled in just a few years, growing from 100 million in 2018 to 200 million in 2022. This increase in code means there are more chances for security issues to arise. Research shows that the number of Vulnerabilities in software has also been rising, which highlights the importance of finding and fixing these problems. Vulnerabilities are weaknesses that can be exploited, leading to data leaks or even service outages.

To find vulnerabilities in code, many tools use a method called static source code analysis. This method examines the code without actually running it, allowing testers to find security issues efficiently. Common tools for this purpose include Bandit, Semgrep, and SonarQube. While these tools are helpful, they do have some limitations, such as producing a lot of false positives (incorrect alerts) and false negatives (missing actual issues). When tools generate too many false alarms, it can take a lot of time and effort to check everything. On the other hand, missing real problems can have serious consequences.

In recent years, Machine Learning and deep learning have made great strides in many fields, including understanding human language. Since code is similar to natural language, researchers are interested in using deep learning methods for tasks related to finding vulnerabilities in code. Machine learning models can learn from data and identify patterns that may indicate security issues. Studies have suggested that these models can produce fewer false positives compared to traditional tools. Some research has even shown that deep learning models outperform several existing open-source tools in finding issues in C/C++ code.

Now, ChatGPT, which is based on AI and uses natural language processing, has attracted attention for its potential in business and other areas. It can automate tasks that usually need human effort, saving both time and resources. The model has been trained on a large dataset up until 2021, which equips it with knowledge about various patterns, including those found in code. This paper evaluates how well ChatGPT can identify security vulnerabilities in Python code compared to popular security tools like Bandit, Semgrep, and SonarQube.

Python has become one of the most popular programming languages, often ranking among the top three according to different surveys. It's widely used in many areas, not just in machine learning and data science, but also in web development with frameworks like Django and Flask. Because of its popularity, ensuring the security of Python applications is critical.

This paper is organized into several sections. The first section provides a brief review of previous research in this field. The following section discusses the datasets used for testing. After that, we explain the details of the experiments performed with ChatGPT. Next, we present the evaluation of the results obtained. Lastly, there is a discussion on the factors that could impact the validity of the results, followed by a conclusion.

Previous Research

In the past, many studies have focused on finding vulnerabilities using different AI models. Most of these studies have followed a method called supervised learning. In this method, various machine learning models utilize features like the number of lines in a code or the complexity of that code. Research has shown that models based on text often perform better than those that rely mainly on code features.

More recently, the focus has shifted towards deep learning. Researchers have explored different deep learning models, such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. Some researchers have even experimented with different types of code property graphs. A few studies have looked specifically at how deep learning models work in finding vulnerabilities, showing that fine-tuning models can lead to better performance.

One study suggested that transformer-based models perform better than graph-based models. For example, a model called VulBERTa was developed using the RoBERTa model to identify vulnerabilities in C/C++ code. Other studies have explored the use of BERT architecture for detecting code vulnerabilities, finding that transformer models can be more effective than traditional deep learning models.

Recently, there has been research evaluating ChatGPT for finding vulnerabilities in Java code. However, there's a gap in studies comparing ChatGPT with existing security tools specifically for Python, which this paper aims to fill.

Dataset Used for Testing

Our testing dataset consists of 156 Python code files. Out of these, 130 files are taken from a security evaluation dataset that represents 75 different vulnerability types. The remaining 26 files are from a project that focused on detecting vulnerabilities in Python. A security expert had to review the files to mark the specific lines of code that contained vulnerabilities.

Using ChatGPT API for Detection

For our experiments, we utilized the ChatGPT GPT-3.5-turbo model, which allows for a more advanced interaction compared to previous versions. This model can process a series of messages and retain context, making it easier to ask it questions related to specific code files. We conducted four types of experiments using this model.

In the first experiment, we provided the model with vulnerable files and asked whether they contained security issues without specifying any known types of vulnerabilities. The goal was to determine whether the model could identify vulnerabilities by simply indicating the line numbers of the code.
In the second experiment, we provided a list of known vulnerability types and asked the model to identify which types were present in the vulnerable code files. The responses were formatted in JSON to facilitate comparison with results from other tools.
The third experiment involved giving the model labels from existing security tools and asking it to confirm whether specific vulnerabilities were present in each file. Here, we also had the option to include any additional vulnerabilities the model might identify.
In the final experiment, we did not provide the model with any labels and asked it to identify vulnerabilities based on its knowledge. The responses followed the same JSON format.

The choice of prompts used to interact with the model was critical, as it could greatly affect the results. We adjusted the way we presented the prompts to optimize the model's performance.

Evaluating Results

To assess the effectiveness of ChatGPT compared to the established tools, we calculated various metrics based on the accuracy of identifying vulnerabilities. These metrics included how many true positives, true negatives, false positives, and false negatives the model produced. We then compared ChatGPT's results with those from Bandit, Semgrep, and SonarQube.

In the first experiment, ChatGPT did not perform better than the other tools regarding precision or recall. In the second experiment, despite using the provided vulnerability labels, the model's results were comparable to the SAST tools.

In the third experiment, where we acted as an assistant to SAST tools, the results showed notable improvement. Notably, when we disregarded new labels identified by the model, ChatGPT's results were significantly better than those of the existing tools.

When we allowed the model to rely on its knowledge without any input labels, its performance remained similar to that of the SAST tools.

Challenges and Limitations

Several factors could influence the outcomes of our study. The main challenge was in selecting appropriate prompts for ChatGPT, which could greatly affect its performance. The size and diversity of the dataset also play a role, as does the coverage of different vulnerabilities. We only compared ChatGPT with three security tools for Python, and additional tools might provide different insights.

Finally, we focused on the GPT-3.5 version of ChatGPT, and it is possible the newer versions could yield even better results in future studies.

Conclusion

In summary, we conducted various experiments to test ChatGPT's ability to identify security vulnerabilities in Python code. While the results showed that ChatGPT can contribute valuable insights, especially when used as an assistant to existing security tools, it is not yet a replacement for traditional methods. The findings of our study, while limited, indicate potential for future work in using advanced AI models for improving code security. As newer models are released, further research could help to achieve even better results in identifying vulnerabilities across different programming languages.

Assessing ChatGPT's Role in Python Security Code Review

Exploring ChatGPT's effectiveness in identifying vulnerabilities in Python code.

Previous Research

Dataset Used for Testing

Using ChatGPT API for Detection

Evaluating Results

Challenges and Limitations

Conclusion

Reference Links

Referenced Topics

Assessing ChatGPT's Role in Python Security Code Review

Exploring ChatGPT's effectiveness in identifying vulnerabilities in Python code.

#Previous Research

#Dataset Used for Testing

#Using ChatGPT API for Detection

#Evaluating Results

#Challenges and Limitations

#Conclusion

Reference Links

Referenced Topics

Previous Research

Dataset Used for Testing

Using ChatGPT API for Detection

Evaluating Results

Challenges and Limitations

Conclusion