Sample: editing, formatting, proofreading
DATA MINING FOR NETWORK INTRUSION DETECTION AND CYBER THREAT INTELLIGENCE
ABSTRACT
With the growing cases of network and system breaches all around, it has become very important to be more security conscious and put everything in place to curtail these attacks. Knowing what the attack is and how it is perpetrated is as important as how to prevent the attack. Intrusion detection systems monitor networks to find any vulnerabilities. Threat intelligence sharing highlights ways of disseminating any information relating to these vulnerabilities. This work uses the Cross Industry Standard Process for Data Mining (CRISP-DM) data science approach. Three machine learning algorithms, namely Bayes, Decision Tree and K-Nearest Neighbor were used in the modeling of the intrusion detection system. The Kaggle Network Intrusion Detection dataset was used to model the algorithms. When an attack is noticed, the algorithm with the highest accuracy is used to detect the attack and then it can be sent to concerned persons. This aspect is cyber threat intelligence sharing.
CHAPTER ONE
INTRODUCTION
1.1 Background to the Study.
As is well known, one of the main issues in the modern world is cybersecurity. There could be a variety of causes, including obsolete software and default usernames and passwords that encourage cybercrimes. As technology advances, cybercrimes get increasingly complex. It is essential to be trained in suitable cyber security in an era where cyberspace is recognized as one of the five domains of warfare.
A software or hardware monitor called an intrusion detection system (IDS) examines data to find any attacks on a network or system. While big data is involved, conventional techniques for intrusion detection system render the framework more complex and lacks efficiency since the properties of analysis process is difficult and time-consuming. The lengthy analysis process renders the system vulnerable to harm for a while before receiving any alerts (Tchakoucht & Ezziyyani, 2018).
A company employs cyber threat intelligence (CTI) information to comprehend all dangers and lessen their effects. The sensitive data is protected from cyber threats by using information from the Cyber Threat Intelligence database. Cyber Threat Intelligence sharing is the distribution of these information to the organisations either in real time or remotely as a form a alert system. This sharing is important to help these organisations put strategies in place to mitigate some of these attacks as they come.
It plays a significant role in the defensive architecture of contemporary networks. By enabling the early detection of threats, it provides the chance to counteract them. Additionally, they enable the detection of numerous threats, including Man in the Middle (MitM) and Denial of Service (DoS). IDSs have the ability to track and record any and all network activity as needed. Security administrators can better comprehend what transpired because IDSs may provide information on assaults as they happen. As a result, the security of the system can be adjusted to recognize and stop similar attacks in the future.
There are two types of intrusion detection systems: host intrusion detection systems (HIDS) and network intrusion detection systems (NIDS) (Checkpoint, 2022):
Network Intrusion Detection Systems: These IDSs are strategically positioned throughout the network. To determine whether the network is being used for nefarious purposes, NIDS examines the network's overall traffic. NIDS is a vital part of the security of the majority of business networks and aids in the detection of assaults from your own hosts.
Host Intrusion Detection Systems: Most the client machines (hosts) in the network contain these detecting systems. In contrast to NIDS, HIDS examines a single host's traffic and activity, and if it notices any unusual behavior, it will sound an alarm.
Intrusion detection systems are currently not adequate for businesses wanting to fend off attacks. IDS is gradually being replaced by "intrusion prevention systems" (IPS) in more and more scenarios. An IPS has active components that block assaults before they succeed, just like an IDS does. An IPS is frequently made up of IDS controls on a firewall. As contrasted with IDS, IPSs are put in-line, which allows them to continuously watch over the traffic that passes through them. Because of this, an IPS needs to be speedy and powerful enough to minimize network latency issues that could reduce network performance for its users.
A false-positive attack detection is one of the key drawbacks of many IPS. False positives can be frustrating for an IDS, but they can also result in Denial of Service for an IPS because they block valid traffic. Additionally, since Intrusion Prevention Systems, particularly Network Intrusion Prevention Systems (NIPSs), create just one point of failure for the network, they must be extremely stable and attack-resistant (Vanin et al., 2022).
1.2Problem Statement
An essential instrument used in cyber security to track and identify intrusion attacks is the intrusion detection system (IDS). Machine learning (ML) makes use of methods of statistical modeling to identify patterns in historical data before making predictions based on fresh information. As a result, IDS has been subjected to Machine Learning algorithm employing anomaly-based methodology. Building a model that can provide high precision with a small false alarm rate is the challenge at hand, as was already said (Kok et al., 2019).
This research seeks to assess contemporary IDS research using ML techniques including Support Vector Machine, Random Forest and K-nearest Neighbor with an emphasis on datasets, ML algorithms, and metrics. Dataset selection is essential to ensuring that the model is appropriate for IDS application. The dataset structure may also have an effect on how effectively the machine learning technique works. The structure of the dataset thus influences the selection of the ML method. Then, Metric will provide a numerical evaluation of the machine learning techniques for a certain dataset.
To improve the exchange of cyber threat intelligence and research that makes use of effective intrusion detection systems, it is necessary to first examine the IDS domains and identify significant traits. To enhance the network and reduce security concerns, it is also crucial to analyze, refine, and organize all information that comes from the network.
CHAPTER TWO
LITERATURE REVIEW
2.1Definition of Cyber Intelligence
Threat intelligence, also referred to as "Cyber Threat Intelligence" (CTI), is data acquired about existing or possible attacks towards a corporation through a variety of channels. The collected information is then arranged, cleaned up, and subjected to evaluation in order to minimize and address cybersecurity concerns.
Threat intelligence's primary goal is to inform businesses about the numerous hazards that encounter due to outside adversaries, such as "zero-day threats" and "advanced persistent threats (APTs)". Threat intelligence provides in-depth knowledge and insight regarding particular threats, notably the identity of the adversary, their tactics and goals, and the indicators of compromise (IOCs). Businesses can choose the most effective line to take for guarding from the most serious breaches with the use of this knowledge.
Information used by a business to comprehend all dangers and lessen their effects is known as cyber threat intelligence. The CTI data is used to spot potential cyber threats and stop them from exploiting the sensitive data. It can be interpreted as knowledge and information based on skills and experiences about physical and cyber risks, threat evaluations, and agents that aid in reducing harmful incidents and prospective cyberattacks (Goel, 2022).
2.1.1Characteristics of Cyber Threat Intelligence
According to Figure 2.1, actionable Cyber Threat Intelligence must be complete, accurate, prioritized, just in time, effective in its reaction, and relevant.
Figure 2.1: Characteristics of CTI
Complete: The gathered information should contain everything relating to the threat.
Accurate: The information should be the correct and exact description of the threat.
Prioritize: There should be way to show which threat is more harmful and should be tackled accordingly.
Just in Time: This relates to remote updates of threats as they occur.
Effective Response: There should be successful response for prevention of each threat
Relevant: It is important to attach some importance to each threat to ensure that time is used properly for them.
2.1.2Levels of CTI
According to Recorded Future (2020), CTI has mainly three levels; Strategic, Tactical, Operational and Technical Intelligence.
Strategic CTI
The "who" and "why" of threat agents, or their motives, as they relate to the contemporary danger scenario, are emphasized. It is non-technical while showing the motives and objectives underlying the attacks, and aims to pinpoint the individual responsible for cyber-attacks and threats as well as their intended targets. The most common forms of strategic CTI production are white papers, briefings, and reports.
Tactical CTI
It aids in figuring out the specifics of the attacks by determining the "how" of the attacks, while revealing the threat agent. This contributes to the identification of incident severity as well as preparation and prevention efforts. It employs, or information that can be read by machines, such as domain names, URLs, IP addresses, file names, and others. Indicators of Compromise typically become out of date in a matter of hours. However, it is also crucial to keep in mind that declining signs, which can occasionally remain functional for a prolonged period and endanger an organization, are not excellent practices.
Operational CTI
It addresses the exact moment and type of an attack. It looks at cyber threat agents' channels of communication and foresees impending assaults using sources that are both public and private, including social networks, dark networks, chats, and other online forums.
Technical CTI
This CTI concentrates on signs that an attack is about to occur. Along with reconnaissance, weaponization, and delivery, these warning signs also include spear phishing, seduction, and social engineering. Attacks involving social engineering can be largely prevented by technical knowledge. The reason this type of intelligence is adaptable is that hackers frequently alter their tactics to take advantage of current affairs and ruses. It is occasionally combined with operational threat intelligence.
2.1.3Sharing Threat Intelligence
Organizations can assist one another in defending their assets against attacks by exchanging threat intelligence. Various information-sharing formats, including Incident Object Description Exchange Format (IODEF), Structured Threat Information Expression (STIX), and Open IOC, are examined in threat intelligence. Sharing services allow for the exchange of information. The idea of a platform for sharing threat intelligence is not particularly new, and the NATO Communications and Information Agency proposed the "Cyber Security Data Exchange and Collaboration Infrastructure (CDXI)" with the goal of facilitating exchange of information, automated processes, and the generation, improvement, and verification of data. The designers identified a number of high-grade criteria for managing cyber security data, one of which was allowing independent models of data due to the absence of community standards. This could be accomplished by using "independent topic ontologies" for every single data model to enable the connection of data components from various models.
There are many channels for sharing threat intelligence today. An evaluation of twenty-two such platforms conducted in 2016 (Clemens, 2017) revealed that different platforms' definitions of threat intelligence differ, the majority of platforms place a heavy emphasis on exchanging IOCs, and they prioritize data collection over data analysis. The study also made the important discovery that "STIX" has become the accepted term to describe intelligence about threats.
2.2 Intrusion Detection System (IDS)
The first paper that introduced the idea of an intrusion detection system (IDS) was written by James Anderson in 1980 and was named “Computer Security Threat Monitoring and Surveillance” (Anderson, 1980). James served on the U.S. Air Force's Defense Science Board Task Force on Computer Security. In his work, James outlined how audit trail analysis could be helpful to find illicit or malicious activity. For instance, examining the files' log offers crucial information to ascertain whether there has been an anomalous use. The Host Intrusion Detection System (HIDS) was built on the foundation of the paper. shortly thereafter, the first IDS was created to find threats in a database of well-known assaults.
Many systems administrators began utilizing intrusion detection systems at the end of the 1980s. They required an enormous amount of resources to constantly monitor the network, and they were prone to zero-day assaults and unable to stop them.
So as to counter the increasing frequency of attacks, a detection technique that is novel was investigated in the 1990s. This technique, known as "anomaly detection", searched for anomalous activity or behavior in a system in order to sound an alarm. However, there were many false alerts as a result of the uneven nature of the network during the 1990s and 2000s. Due to their instability, a lot of administrators abandoned IDS (Vanin et al., 2022).
2.2.3IDS Performance Evaluation
Machine learning-based IDSs are evaluated using performance measures. Using a collection of data for testing from which the actual values have been identified, a model for classification —also known as a "classifier"—performs is described in a table called a confusion matrix. The actual objective values and those predicted by the machine learning algorithm are compared in the matrix.
The phrases used for describing a confusion matrix are:
True Positive (TP): A sample of an assault has been accurately recognized as such
True Negative (TN): The proper identification of a typical sample as typical traffic was made.
False Positive (FP): An attack is being mistakenly labeled on a normal sample.
False Negative (FN): Incorrectly classifying an attack sample as regular traffic
An intrusion detection system must have a small percentage of false-positives so as to avoid erroneous alerts in the system that disrupt networks. To further stop unwanted access to a network, an appropriate rate of false-negatives is necessary.
Typically, the many metrics which are employed to measure an Intrusion Detection System's performance can be accessed by utilizing the aforementioned phrases and the matrix from figure 2.1. This is a rundown of the primary measurements that academics employ:
Table 2.1: Performance Metrics of IDS.
Predicted Class
Normal
Attack
Actual class
Normal
TN
FP
Attack
FN
TP
Precision
relates to the proportion of attack samples that were accurately predicted to all the attack instances that were forecasted.
Precision =
Recall
This relates to the proportion of attack records that were successfully predicted to all the records which correlate to an assault. The Detection Rate is another name for this measurement.
Recall =
False Alarm Rate
This represents the proportion of incorrectly anticipated attacks across all typical sample sizes. This measurement is sometimes referred to as the rate of false-positive.
False Alarm Rate =
True Negative Rate
This reflects the percentage of all typical samples that were correctly anticipated to all normal samples.
True Negative Rate =
Accuracy
It represents the proportion of classes that were correctly recognized across all samples. Assuming the dataset has equal distribution, this statistic is frequently used to assess the effectiveness of an IDS.
Accuracy =
F-Measure
It represents the recall and precision's harmonic mean. By displaying the difference between the two metrics, it helps to better evaluate the system and determine whether the outcome is reasonable. The F1-Score or the F-Score are other names for this statistic.
F = 2 x
2.3Review of Related Works
1.A Study of Network Intrusion Detection Systems Using Artificial Intelligence/Machine Learning (Vanin et al., 2022)
This article introduces the idea of IDS and offers a framework for the classification of machine learning techniques. The primary metrics for evaluating an IDS are discussed, and an overview of current IDS that used machine learning is given, describing the benefits and drawbacks of each strategy. Following a description of the characteristics of the numerous datasets used in the research, the validity of the conclusions from the reviewed study is next discussed. Observations, research roadblocks, and potential upcoming patterns are then examined.
2.Intrusion detection model using machine learning algorithm on Big Data environment (Othman et al., 2018)
This study proposed the "Spark-Chi-SVM" intrusion detection model. "ChiSqSelector" was applied in this model to choose the features, and the support vector machine (SVM) classifier was utilized to build an intrusion detection model through the Apache Spark Big Data platform. For training and assessing the model, they utilized KDD99. "Chi-SVM and Chi-Logistic Regression" classifiers were contrasted in the experiment. The results of this research demonstrated that the Spark-Chi-SVM model performs well, takes less time to train, and is effective for big data.
3. A Review of Intrusion Detection System using Machine Learning Approach (Kok et al., 2019)
Since many have used soft computing approaches in this study, it was discovered that they are receiving a lot of attention. Many researchers are also concentrating on the IDS categorization, which is useful in identifying known intrusion threats. It could be challenging to identify unusual intrusions, which could involve fresh or updated intrusion attacks. Several researchers continue to employ the nearly 20-year-old datasets "KDDCup99" and its derivative "NSL-KDD". While intrusion attacks continue to change in tandem with new technology and user behaviors, this ongoing tendency could result in stagnant advancement in IDS. IDS will eventually be rendered outdated as a cyber security technology as a consequence of this circumstance. True-Positive Rate (TPR), Accuracy, and False-Positive Rate (FPR) are the three metrics for IDS evaluations of performance that are most frequently utilized. This is understandable given that these metrics offer crucial cues that are crucial to IDS performance.
4.A Novel Network Intrusion Detection System Based on CNN (Chen et al., 2020)
An additional IDS built on CNN was suggested by Lin et al. Two pieces make up their panacea. The initial one is trained offline with Convolutionary Neural Network, whereby their model starts with an input layer of 9 by 9 and reduces it through additional layers of convolution and the highest pooling layer to an output layer of 1 by 1. They used "Suricata", a freely accessible IDS, to catch the traffic in the second stage of their system, which is the online detection stage. The model that has been trained is then applied to the network traffic to generate the detection result after the packets have been pre-processed. They applied the "CICIDS2017" dataset to the model's testing. Considering the feature dataset as well as the unprocessed traffic dataset, they ran tests. The authors correspondingly attained accuracies of 96.55% and 99.56%, demonstrating that their model performs better with actual traffic data than it does based on a generated feature set.
5. An Ensemble Approach for Intrusion Detection System Using Machine Learning Algorithms (Gautam & Doegar, 2018)
An integrated strategy to detecting intrusion was suggested by Rohit et al. They run three tests to demonstrate how their strategy suggested improved outcomes. They used a correlation approach to do feature selection after first normalizing the KDDCup99 dataset. Lastly, they adopted a collaborative approach that combines three algorithms: Adaptive Boost, PART and Naive Bayes for the feature selection process. Information gain was used as a deciding criterion in this process. The outcome is then determined by averaging the outputs of the several algorithms or by the majority of votes. They also employ the bagging technique to lessen variance error. Using their method, they were able to achieve an accuracy rate of 99.97% on the KDDCup99 dataset.
6. Dynamic detection of malicious intrusion in wireless network based on improved random forest algorithm (Chen & Yuan, 2022)
A wireless intrusion detection system built around the random forest algorithm was developed by Yiping et al. To capture the key aspects of signals, they first developed a model for signal identification. Next, they developed a model to identify intrusive nonlinear scrambling signals. The best identification of malicious traffic in a wireless network was carried out employing a static feature fusion and reinforcement learning method after the spectral features of the harmful signal were extracted using an enhanced random forest algorithm. Their average accuracy was 96.93%.
7. A Multiple-Layer Representation Learning Model for Network-Based Attack Detection. (Zhang et al., 2019)
To identify attacks, Zhang et al. suggested a multi-layer model. They integrated the GcForest and CNN machine learning algorithms in their solution. The GcForest is a random forest method that creates decision trees in a cascade framework. Their model is divided into two primary components. In the first section, they use a CNN algorithm to analyze the input data and identify various threats and regular traffic. They used a modified version of GoogLeNet dubbed GoogLeNetNP for their CNN algorithm. The second step entails expanding the subclasses of the assaults by utilizing a deep forest model. Their second layer, which divides the unusual classes into N-1 subclasses, increases the precision of their answer. The subsequent layer employs the gcForest cascade idea, however, XGBoost is used in place of random forest. XGBoost is similar to a random forest, but the trees are built one at a time until the desired function is optimized. They used the "UNSW-NB15 and CICIDS2017" datasets to evaluate their solution. When compared to the accuracy of the algorithms utilized separately, they achieved an aggregate precision of 99.24%.
CHAPTER THREE
MATERIALS AND METHODS
3.1Research Methodology
This study employed the Cross Industry Standard Process for Data Mining (CRISP-DM) approach. This data science procedure is built on top of this process paradigm. It goes through six phases in order:
1. Business understanding – What is the need of the business research?
2. Data understanding – Do we need data? What data is available? Is the data clean?
3. Data preparation – For modeling, how can the data be organized?
4. Modeling – Which machine learning technique(s) can serve?
5. Evaluation – Which of the models provides the best accuracy?
6. Deployment – Can the model be deployed? How?
1. Business Understanding
This is the first stage of the CRISP-DM methodology. We had to first of all set the objectives of this project in such a way that it can be completed with efficacy and in due time. Three major objectives were set and which include; to evaluate intrusion detection using three machine learning algorithms, to find the efficiency of these algorithms compared to one another, to remotely send data obtained to concerned institutions as a form of threat intelligence sharing. Three different machine learning algorithms are earmarked for this work; Random Forest, K-nearest neighbor (KNN) and Support Vector Machine (SVM). These algorithms such perfectly with our goal which is to evaluate data intrusion using machine learning algorithms.
Figure 3.1: CRISP-DM Lifecycle
2. Data Understanding
The goal of this phase is to locate, gather, and examine the datasets which will enable an achievement of the project's objectives. The Kaggle provided the dataset for this study, which is known as "Network Intrusion Detection".
The Network Intrusion Detection Dataset
The dataset was obtained from Kaggle and authored by Sampada Bhosale in 2018. A large range of attacks that are recreated in a military network setting make up the dataset that was given for auditing. By mimicking a standard US Air Force Local Area Network (LAN), it produced an environment to obtain raw TCP/IP "dump data" for a network. The LAN was attacked by numerous threats and was narrowly concentrated, much like a real environment. Data travels across networks from a source Internet Protocol (IP) address to a target Internet Protocol (IP) address under a specified protocol, and a connection is a series of TCP packets that begin and stop at a specific time interval. Every link is additionally labeled as either normal or as an attack with a single, distinct attack type. About 100 bytes make up each connection record.
For each TCP/IP connection, 41 quantitative and qualitative features are obtained from normal and attack data (3 qualitative and 38 quantitative features) .The class variable has two categories:
Normal
Anomalous
The dataset is made up of 41 columns and 22544 rows or attributes
3. Data Preparation
Here, we choose the data that will be the subject of our analysis. The dataset is used to create labeled rows (GeneratedLabelledFlows.zip) and CSV files (MachineLearningCSV.zip) for machine learning. "Data cleaning" is the process of fixing or removing inaccurate, damaged, improperly configured, repetitious, or fragmentary data from the dataset. When combining data from several sources, there are many different ways that content could be duplicated or unlabeled. While the data might be accurate, the conclusions and procedures are unreliable since it appears like the data is incorrect. There is no standardized way for prescribing the precise steps in this method of cleaning the data because the procedures will vary from dataset to dataset. Nevertheless, it is imperative to draw up a design of the cleaning process so that one can be certain that the right item is done properly each time. In most cases, a dataset's columns are not completely utilized. As a result, we will remove unneeded columns to concentrate on the important sections as well as empty spaces and rows.
4. Data Modeling
At this stage, three machine learning algorithms are used; K-nearest Neighbor (KNN), Support Vector Machine and Random Forest. The dataset will be modeled on each of these algorithms and evaluated based the results obtained for intrusion detection. These algorithms are preferred because they are supervised machine learning model for detection of outliers, classification and regression.
First of all, the dataset was first classified into training and test datasets. Eighty (80) percent of the dataset was assigned as a training dataset. After that, the models are built using each of the algorithms and Python Programming Language. Specific packages like Numpy, scikit-learn, pyplot and matplotlib are used to enhance the modeling of the dataset. A summarization and performance analysis is performed to determine which algorithm works best for intrusion detection and then an interface is creating for sharing useful intelligence data with institutions that are involved
K-Nearest Neighbor Algorithm
K-nearest Neighbors (KNN) is a classifier based on supervised learning that uses distance to generate classifications or forecasts about the clustering of one data point. It is frequently used as a technique for classification even though it may be used for both problems with classification and regression since it is predicated on the discovery of similar points that are adjacent to each other.
For matters of classification, a class label is selected by an overwhelming vote, which means that the label that is most frequently articulated about a certain data point is approved. Although this is technically "plurality voting," writers of literature more usually use the term "majority vote." The distinction within the two expressions lies in the fact "majority voting" implies a plurality of more than fifty percent, which usually is applicable when there are only two options. Whenever there are several classes, for instance four categories, one does not necessarily require fifty percent of the vote to come to a conclusion concerning a class; one may select a class label having a vote that exceeds twenty-five percent.
It is frequently employed in a variety of applications, including basic systems for recommendation, recognizing patterns, data mining, stock market forecasting, detection of intrusions, and others.
The algorithm performs the subsequent actions:
1. Load the dataset
2. Set K to the number of neighbors you choose.
3. For every illustration in the data
3.1 Using the data, determine the separation between the query example and the current example.
3.2 To an ordered collection, add the example's separation and index.
4. Sort the distances in the collection of indices and distances in ascending order, from least to largest.
5. Choose the first K items in the ordered collection.
6. Obtain the labels for the chosen K entries.
7. Return the average of the K labels in cases of regression.
8. Return the K labels' mode if classification
Support Vector Machine Algorithm
For classification, regression, and outlier detection, the support vector machine is alternative supervised machine learning framework. Finding a hyperplane which best matches a dataset's multidimensional space of data is the SVM's primary goal.
Several steps can be used to illustrate the algorithm of a support vector machine:
1. n characteristics are taken from the dataset.
2. Let the value of the coordinates equal the value of the features.
3. Plot each piece of data as a point in an n-dimensional space.
3. To perform categorization, locate a hyperplane.
Random Forest Algorithm
In problems involving regression and classification, supervised machine learning methods like random forest are widely used. It builds decision trees on numerous samples and employs the mean and greatest parameter for regression and classification, respectively. Random Forest employs an ensemble method called bagging. Bagging is the technique of building a special training subgroup using substitutions from a trial training dataset and basing the outcome on a majority vote.
The format of the Random Forest algorithm is as follows;
Step I: From a set of j entries in a data collection, i randomly chosen entries are chosen.
Step II: Decision trees are created uniquely for each sample.
Step III: There is an outcome associated with each decision tree.
Step IV: For regression, the mean is used, and for classification, the majority of votes are used.
5. Evaluation
The Evaluation phase looks more broadly at which model best meets the objectives and what to do next. After performing an evaluation of each model, the one with the best rating for easy detection of intrusions is adjudged as the algorithm of best fit. This data, together with other important data from the evaluation are then shared to other business who are stakeholders in the research. This threat intelligence sharing offers a huge boost to the security of these businesses are they can prepare before hand on what model best suits their networks and the likely areas where attacks can come from. The system will allow for sharing of these data through authorized email addressed to avoid security breaches.
6. Deployment
In our case, the deployment phase is basically the phase where reports are generated for future purposes or for referencing to forestall attacks of some sort. These reports are stored in the database and also shared with all concerned businesses cyber threat intelligence sharing via email.