Simple Science

Cutting edge science explained simply

# Computer Science# Machine Learning# Artificial Intelligence# Cryptography and Security

Improving Intrusion Detection with Feature Selection Methods

This article examines feature selection techniques for enhancing intrusion detection systems.

― 6 min read


Feature Selection in IDSFeature Selection in IDSmethods.with effective feature selectionBoost intrusion detection efficiency
Table of Contents

Cybersecurity is crucial for protecting data and systems from attacks. Intrusion Detection Systems (IDS) are tools that help identify and prevent these threats. These systems analyze computer and network data to find signs of malicious activity. Recently, machine learning (ML) and deep learning (DL) techniques have been used to improve IDS models. Popular methods include Random Forest (RF) and deep neural networks (DNN).

One important aspect of building effective IDS models is Feature Selection, which involves choosing the most relevant data points to use in the analysis. By selecting the right features, models can run faster and yield more accurate results. This article compares three different feature selection techniques: RF information gain, correlation feature selection using a Bat Algorithm, and correlation feature selection using the Aquila Optimizer.

Our research shows that the Bat Algorithm-based feature selection is the most efficient method, taking only 55% of the time required by the best Random Forest model while maintaining almost the same accuracy. As cyber threats continue to rise, finding effective and efficient methods for intrusion detection is critical.

Cybersecurity Overview

Cybersecurity is an expanding area of focus due to the growing number of cyber threats. For example, in 2022, there were more than 1.3 billion malware programs identified. Additionally, data breaches can be very costly; the average expense of a data breach is around $4.24 million. A significant part of cybersecurity is threat detection, which identifies harmful activities. Network-based IDS (NIDS) aims to monitor network connections for signs of malicious traffic. Given that many serious attacks target organizations through their networks, developing NIDS is an important area of research.

Types of Intrusion Detection Systems

Intrusion detection systems can generally be categorized into two types: signature-based and anomaly-based systems. Signature-based IDS look for known attack patterns. They create a model based on past data and use that model to identify current threats, similar to how antivirus software works. However, these systems can struggle with new or unknown attacks.

In contrast, anomaly-based IDS identify unusual patterns in the data. This method can be more effective in revealing novel attacks, especially when dealing with large datasets that don't have clear correlations. Hybrid systems combine both approaches to improve overall performance.

Data Sources for Research

In our research, we utilized real or simulated network data to test the various IDS models. Some common datasets include NSL-KDD, KDD-Cup'99, UNSW-NB15, and CSE-CIC-IDS2018. Our focus was on the CSE-CIC-IDS2018 dataset, as it contains a wide range of attacks, including zero-day attacks that often occur in newly set-up networks. This dataset is valuable for research due to its variety and recent updates.

Machine Learning Techniques

To build efficient intrusion detection systems, machine learning and deep learning techniques are employed. Machine learning focuses on statistical methods that derive patterns from known behaviors. Within this scope, classification methods are essential for determining whether a user is attempting an attack and identifying the nature of the attack. Since the data is often unbalanced, we chose to use Random Forest for our analysis.

Random Forest works by creating multiple decision trees that classify data points based on specific decision boundaries. It balances low variance and low bias, making it a useful method for our purposes.

Deep Neural Networks aim to model complex relationships by connecting layers of nodes through activation functions. They are beneficial for training with large datasets and consistently delivering strong performance compared to traditional machine learning techniques.

Feature Selection Methods

Feature selection is critical for improving the performance of intrusion detection systems. By narrowing down the features fed into the model, we can enhance speed and effectiveness. There are three major types of feature selection methods: filter methods, wrapper methods, and embedded methods.

Filter methods apply predefined criteria to assess the usefulness of features. Wrapper methods involve building and comparing many models based on subsets of features. Embedded methods train a model that then determines which features are valuable.

In our study, we focused on two filter methods (CFS-BA and CFS-AO) and one embedded method (RF information gain). CFS-BA is a correlation-based method that quickly assesses the relationships between features.

Bat Algorithm

The Bat Algorithm is a metaheuristic optimization technique based on how bats use echolocation to hunt. This algorithm works in two main phases: exploration, which aims to cover a wide range of possible solutions, and exploitation, which focuses on finding the best solution within a specific area.

In our study, we applied the Bat Algorithm to find the best subset of features based on their correlation with the target variable. This method provided excellent results when tested with the CSE-CIC-IDS2018 dataset.

Aquila Optimizer

The Aquila Optimizer is a newer metaheuristic algorithm that aims to outperform previous methods in speed and efficiency. While it may take longer to converge on the best solution, it has shown strong results in feature selection across various benchmarks.

In this research, we compared the performance of the Aquila Optimizer against the Bat Algorithm to evaluate their effectiveness in selecting features for intrusion detection systems.

Assessment Metrics

To measure the success of our intrusion detection models, we analyzed a set of performance metrics. These included accuracy, precision, F1 score, and the false alarm rate (FAR). For binary classification, we used a confusion matrix to determine how well our models performed in predicting malicious versus benign activity.

For multi-class classification, we calculated metrics by treating each class individually and determining overall accuracy. The goal was to obtain a thorough understanding of how well each model performed using different subsets of features.

Data Preparation

We used the CSE-CIC-IDS2018 dataset, which was created to simulate network data for intrusion detection system research. The dataset includes simulated attacks over ten days and contains numerous numerical inputs.

Before analysis, we cleaned the data by removing irrelevant features and normalizing the remaining predictors. We selected a 50/50 train-test split to ensure we had enough data for thorough testing and validation.

Results and Analysis

After running our models using refined feature subsets, we found that both the Bat Algorithm and RF information gain methods significantly outperformed models using the full set of features. The Bat Algorithm reduced the model build time significantly while maintaining high levels of accuracy.

In terms of performance, the Random Forest model achieved the highest accuracy with the fewest features. The deep neural network model also performed well but faced some challenges with specific types of attacks.

Confusion matrices revealed patterns of misclassification between certain types of attacks, such as denial-of-service and brute force attacks, indicating areas where models could improve.

Conclusion

This research demonstrated that feature selection methods, particularly the Bat Algorithm and RF information gain, provide meaningful benefits for intrusion detection systems. The models that incorporated these methods significantly reduced the number of features while improving classification performance.

As cybersecurity threats continue to evolve, employing efficient and effective IDS models is essential. Future research may further explore different feature selection methods, neural network architectures, and assessment metrics to enhance the performance and explainability of intrusion detection systems. With continued advancements, we can better safeguard our digital environments against emerging threats.

Original Source

Title: Feature Reduction Method Comparison Towards Explainability and Efficiency in Cybersecurity Intrusion Detection Systems

Abstract: In the realm of cybersecurity, intrusion detection systems (IDS) detect and prevent attacks based on collected computer and network data. In recent research, IDS models have been constructed using machine learning (ML) and deep learning (DL) methods such as Random Forest (RF) and deep neural networks (DNN). Feature selection (FS) can be used to construct faster, more interpretable, and more accurate models. We look at three different FS techniques; RF information gain (RF-IG), correlation feature selection using the Bat Algorithm (CFS-BA), and CFS using the Aquila Optimizer (CFS-AO). Our results show CFS-BA to be the most efficient of the FS methods, building in 55% of the time of the best RF-IG model while achieving 99.99% of its accuracy. This reinforces prior contributions attesting to CFS-BA's accuracy while building upon the relationship between subset size, CFS score, and RF-IG score in final results.

Authors: Adam M. Lehavi, Seongtae Kim

Last Update: 2023-03-22 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2303.12891

Source PDF: https://arxiv.org/pdf/2303.12891

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles