ronaldseoh/cs536-groupproject-blogpost.md

## cs536-groupproject-blogpost.md

      
    Raw
  

              cs536-groupproject-blogpost.md
            
          
    Analyzing Practical Effectiveness of Neural NIDS solutions: Reproducing “Enhancing Robustness Against Adversarial Examples in Network Intrusion Detection Systems”

Members: Samuel Englert, Pinaki Mohanty, Ronald Seoh, Akshaj Uppala, and Han Zhu
Introduction

In this project, we examined practical effectiveness and applicability of deep learning-based network intrusion detection systems (NIDS). While there have been significant advances in neural NIDS lately, it is yet unclear how they achieve superiority over previous approaches such as signature-based NIDS. More specifically, we need more insights into their potential drawbacks, and how well the method could potentially fit into real-life networking scenarios. Hence, we chose the state-of-the-art neural NIDS model and the evaluation results from Hashemi and Keller 2020 to conduct our analysis: Hashemi and Keller introduced the Reconstruction from Partial Observation (RePO) technique of leveraging denoising autoencoders for more robustness against adversarial examples.
What is RePO?

Kitsune, one of the earlier SOTA for NIDS, used an autoencoder and trained it with respect to the reconstruction loss which can lead to over-generalization on the data. Reconstruction from Partial Observation (RePO) uses a denoising autoencoder to force the model to reconstruct from randomly masked inputs. The model is trained in a way to reconstruct a given input based on observing some part of it. This way, the model has to not only reconstruct the visible parts of the input but also to generate the hidden parts and can generalize better.
The loss function is defined as:

N is the number of samples, x_i is the i-th feature of a sample, and r_i is a random vector filled with 1 and 0 that is as the same length as x_i.
At evaluation time, a score is calculated the same way and we expect to see a higher reconstruction error in malicious packets. A threshold is defined on the output score with the false positive rate fixed at 0.01 and if the output exceeds the threshold, the packet is classified as malicious. Through the threshold we can control how much tolerance the model can misidentify a normal packet as malicious.
Our Methodology

We test both RePO and the updated RePO+ by Hashemi and Keller against simulated attacks in a Mininet environment. We first start by recreating the network topology seen in the packet captures of the CIC-IDS-2017 dataset, which were used to train and test the NIDS models in the original paper.

In a nutshell, what we have is two adversaries on an external network attacking twelve different machines. In a real-world attack, the attacking and target servers would be separated by many intermediate connections. However, using P4SH in Mininet, it is acceptable to have the switches be connected directly. Thus, a topology of linear,2,12 was used to represent the system.

We then integrated the inference/prediction logic for the RePO TensorFlow model inside our P4 Runtime API-based controller. We also implemented the entirety of the feature extraction logic for the RePO model within the incoming packet processing code.
Reproduction Results and Discussion

To evaluate the performance of the model, we try calculating the True Positive Rate(aka Recall or Hit Rate) for each of the attacks in order to gauge the ability of the model to correctly predict the malicious packets, for a False Positive Rate( aka Fall-Out) of 0.0100. Just as in the paper, we exclude Web Attacks( #8, Thursday) from our analysis, as during feature extraction, we do not inspect and use the payload of the packet. We do so for both RePO and RePO+. In the Adversarial Setting, per attack, it takes more than 4 hours to churn out two True Positive Values on Mininet. We have a total of 11 attacks to conduct analysis on. To save time, we limit our analysis to just the Normal setting at FPR of 0.0100.
During our analysis, we found that empirical thresholds are 0.0375 and 0.1428 for RePO and RePO+ respectively.

RePO+ beats RePO for all different kinds of attacks, except SSH-Patator. The improvement is the most aggressively visible in the case of Slowhttptest. However, certain attacks like Heartbleed and Botnet are impervious to advancement in the model. Considering the stochastic nature of neural network learning, we are successfully able to replicate results for RePO and RePO+ packet based models under Normal Settings, as in the paper.
However, we also would like to highlight potential issues with the model, which have been swept under the rug.

Looking at the confusion matrices for RePO and RePO+, we are certain that prediction is not affected when the packets are benign. Fortunately, RePO+ has higher True Positives than RePO when it comes to classifying an adversarial packet. However, in both cases, we misclassify a large chunk of adversarial packets as benign.
Clearly, accuracy is not a good judge of performance here as the data has a very prominent skew. Hence, keeping the situation in  mind, it becomes important to quantify the False Negative Rate(aka Miss Rate) and the overall correlation between the labels (Matthews Correlation Coefficient(MCC)).

Looking at the table above, FNR is fairly high for both models, despite an advanced strategy. This shows the model misclassifies at least 67% of the malicious traffic as good.
Though not exactly a performance metric, MCC, as a measure of association, can be used to evaluate the performance of the model. Even though the correlation between the labels is mild, RePO+ considerably improves the performance. Finally, the Critical Success Index( aka Threat Score) is a concoction of hits, misses, and false alarms. It happens to be a very fair metric to judge the performance of the model, as it disregards True Negatives. The performance is very weak, however, just like MCC, RePO+ does considerably better.
Conclusion

Overall there are some positives to the model, which have been highlighted in the paper. These merits have been confirmed by us during our analysis. However, on closer inspection, we notice flaws, some dismissable and some glaring, that point to a not-so-promising side. We in this report have taken this conversation ahead. While doing so, we also suggest future line of work.

Lesser Training Time for Adversarial Setting: 4 hours to replicate the results for each attack is very impractical. Attempts to bring down the time for crafting adversarial examples and training can be thought of.
High FNR: The heart of the model focuses on learning from 'normal' examples and during testing predicting how off is a packet from normalcy. Going for a traditional machine learning approach, where training data is a mix of benign and malicious packets will be helpful. However, for that, collecting more malicious examples is important to prevent biased predictions.
Better Feature Engineering and Model Interpretability: Looking for features that can discern between good and bad traffic can be helpful. This expertise usually comes from domain knowledge. Additionally, Model Analysis can be taken further with Model Interpretability/Explainability.
Multiclass classification: RePO strictly follows binary classification; packet is adversarial or not. However, there is a variety to these attacks(12 in total). Different attacks have different potencies, and hence answering whether the packet is malicious or not is not enough. Additionally, good performance in a Multiclass is considerably harder. As an add-on, Web-based attacks can be included in the analysis by extracting and utilizing the contents of the packet.
Improving on the flow-based RePO: RePO performs barely above vanilla NIDS, in a flow-based setting. Attempts can be made to counter this.