alik604/Local_setup.md

## epic_mltk_addon_by_Ali.md

      
    Raw
  

              epic_mltk_addon_by_Ali.md
            
          
    MLTK addon by Ali

link to code and test data: https://drive.google.com/drive/folders/1u1e7SFrW8shPVdttCeKViH4O08ro9-Gu?usp=sharing
link to future code repo: http://tfs.tsl.telus.com/tfs/telus/BT-GIT/_git/ID-Cust-ci-batch?path=%2Fsplunk-machine-learning
Splunk MLTK algorithms added:

Rule_based_detection.py

Working, no dependencies needed. (By far) Most important


CBRW_based_detection.py

Hard disabled till Dependencies installed, low priority


Ensemble_based_detection.py

Hard disabled till Dependencies installed, low priority


Installation

This is a modified fork of github.com/splunk/mltk-algo-contrib.
3 detectors are added.
Unneeded detectors are commented out in algos.conf, note that since you will copy & paste in my folder, you dont have to deal with that.

Install the Python for Scientific Computing

You must install the Python for Scientific Computing Add-on before installing the Machine Learning Toolkit. Please download and install the appropriate version here:  
Linux 64-bit: https://splunkbase.splunk.com/app/2882/
Windows 64-bit: https://splunkbase.splunk.com/app/2883/
Installation
To install an app within Splunk Enterprise:
Log into Splunk Enterprise.
Next to the Apps menu, click the Manage Apps icon.
Click Install app from file.
In the Upload app dialog box, click Choose File.
Locate the .tar.gz or .tar file you just downloaded, then click Open or Choose.
Click Upload.

Note, due to the size of this app, installing it via web installer/deployer may fail with a timeout error
Alt method is to Copy it to your $SPLUNK_HOME/etc/apps folder (don't forget to restart Splunk)


Install MLTK https://splunkbase.splunk.com/app/2890/#/details


Copy and paste my MLTK add-on to the equivalent of C:\Program Files\Splunk\etc\apps\, folder will have a name similar to "epic_mltk_addon_by_Ali".


[Optional - needed for full scale] Will need change default limits. We will only be running the Algos at off hours (3-6am EST)


URL end points:
/en-US/app/Splunk_ML_Toolkit/algorithm
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=Rule_based_detection
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=Ensemble_based_detection
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=CBRW_based_detection

Changes:
max_inputs to 10000000// large int; 10 Mil
max_memory_usage_mb to 10000 // one algo is multithreaded, I manage cpu&mem cost via batch processing. 5000 should be fine
max_fit_time  to 10000  // 2.66H


[Optional - only for local deployment of Splunk] Get test data


See Drive link to get all_actions_all_notMissing.csv
Upload to Splunk, change URL as needed: http://127.0.0.1:8000/en-US/manager/search/adddata
Note that the data is old, and you will need to change the "scope" Splunk searchs to get any data. As the default in 24h


[Optional] Machine learning, which is currently not scheduled to be used in Prod

There are two ways to resolve the dependency issues faced for deploying this Splunk ML app

(Temporally) install python on the Spunk Machine (OS level)

Run pip install suod pyod coupled-biased-random-walks -t "C:\Program Files\Splunk\Python-3.7\Lib\site-packages", to get the dependencies (3 packages, which have their own dependencies)


Copy, paste, and replace with my \Python-3.7\Lib\site-packages

100MB stored on Drive (lots of defaults thing updated)


Packages used:

Added coupled_biased_random_walks from https://github.com/dkaslovsky/Coupled-Biased-Random-Walks
Added PYOD from https://github.com/yzhao062/pyod
Added SUOD from https://github.com/yzhao062/SUOD

Code This is now working:

CBRW_based_dection.py
Ensemble_based_detection.py

Usage

MLTK Syntax

Resources


Docs.
ML-SPL Quick Reference Guide
MLTK QUICK REFERENCE GUIDE

For the sake of example

How parameters work

fit LocalOutlierFactor <fields>// Columns,features,variables to pass in 
[n_neighbors=<int>]            // A parameter
[p=<int>]                      // A parameter, Minkowski distance. 1 is MAE (Manhattan distance), 2 is MSE (Euclidean distance)  
[contamination=<float>]        // A parameter, [0 .. 0.5], for the about of anomalies expected in our data

Verbose explanation of full search

// Select index and get logs with an `action`, then filter  
index=cii_pingfederate action=* requester!= "nascent" requester != "-1" action != "paneView"  action != "move" action != "change" action != "preview" action != "view" action != "rowView" action != "Next+time" NOT "@ci-qa.com" NOT "@telusinternal.com"

// Get only the variables explicitly ask for, this removes many internal (parsed) time/date Fields
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time

// Derive [Country, City, Region, lat, lon] from requesterIp  
| iplocation requesterIp

// Fit on all Fields (note we used `Table`), 2 parameters
| fit Ensemble_based_detection * contamination=0.25, n_estimators=30

// (Ab)use `Table` to order
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Every IP to login to a compromised email", "Compromised emails", "Country", "tid", "Region", "lat", "lon" 

// Sort by time, note that this is is debatable, the natural order might be better for analysis
| sort _time

Map


Bring up a Map

source="all_actions_all_notMissing.csv"
| table requesterIp
| iplocation requesterIp locallimit=20 | geostats count by Country

Rule_based_detection


use (hardcoded) rules to detect anomalous traffic


Best performing detector, and it's dependencies are including in MLTK and its prerequisite

Two approaches used:

One Email, being accessed from many IPs

Get most common emails
Flag as suspicious, if for a given email, the number of "unique locations" divided by "total number of requests" is >= 0.4.


One IP attempting to log into to a variety of emails.

The IP has generated at least 4 events, with at least 3 requesters
Get the most common IPs, mark as malicious, if, all status are NOTFOUND, OR no logins are successful
If an IP successfully logs into many emails, mark emails as compromised.
Compute Every IP that interacted with compromised email


source="all_actions_all_notMissing.csv"
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit Rule_based_detection *
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Every IP to login to a compromised email", "Compromised emails", "Country", "tid", "Region", "lat", "lon" | sort _time

CBRW_based_detection


Machine learning based detector. Second best, but Rule_based_detection is by far better


Coupled Biased Random Walks (CBRW) is for identifying outliers in categorical data with diversified frequency distributions and
many noisy features.

Features (Feilds) used are: action, status, adapter, serviceType, requester, requesterIp, City, Country


Parameters


contamination = 0.20, 0.15 if 1,000,000 over datapoints

Example

source="all_actions_all_notMissing.csv"
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit CBRW_based_detection * contamination=0.20
| table "index", "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "tid", "City", "Country", "Region", "lat", "lon" | sort "index"

Ensemble_based_detection


Machine learning based detector. Third best, more expensive than CBRW_based_detection


A multithreaded Ensemble (group) of ~17 detectors

Parameters


contamination = 0.25
n_estimators = 30


[hardcoded] 3 workers (threads)

Example

source="all_actions_all_notMissing.csv" 
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit Ensemble_based_detection * contamination=0.25, n_estimators=30
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "tid", "City", "Country", "Region", "lat", "lon" | sort _time

To Analysis and investigation

Output of Rule_based_detection will be (pseudo)sorted by offending IP/email. Till final fine tuning, and ultimately trust is built.
To verify correctness we need to search every IP and email. We can manually search every offending IP or email
source="all_actions_all_notMissing.csv" cool_sara321@hotmail.com | iplocation requesterIp
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Country", "tid", "Region", "lat", "lon"

Or we can use Ali's Splunk SDK to programmatically query n number of emails or IPs, and output in a Excel sheet with n number of tabs. This would alleviate the need, to manually create browser tabs & to actively wait on Splunk to search

  
## epic_mltk_addon_development.md

      
    Raw
  

              epic_mltk_addon_development.md
            
          
    Notes on development on the Splunk app on a local Splunk install

Debugging


C:\Program Files\Splunk\var\log\splunk\mlspl.log for logging. Python's print() does not work in Splunk, you must use the logger. As I used in my code.
Docs

https://docs.splunk.com/Documentation/MLApp/5.2.0/API/UserFacingMessages
https://docs.splunk.com/Documentation/MLApp/5.2.0/API/CustomLogging


## Local_setup.md

      
    Raw
  

              Local_setup.md
            
          
    Python & Machine Learning Hello world


Quickly get up to speed

Software setup

Use Python 3.8.*. No point in 3.9, however it should be fine.
Run these to test if install is successful, and install some important packages. Please follow the errors as they come, one will ask the user to install C++ build tools, if not already installed (see below). This will take several minutes, as the dependencies are a few GBs. You must be off VPN, or set-proxy, as shown below.
Here's what to do:
Downloaded Microsoft Visual C++ Build Tools from this link: https://visualstudio.microsoft.com/downloads/
Run the installer
Select: Workloads → Visual C++ build tools.
Install options: The only necessary component is "Windows 10 SDK"

// If you wish to be on VPN, set env variables 
http_proxy = http://142.174.134.33:8080
https_proxy = https://142.174.134.33:8080
pip install TwitterApi

// Or every time 
pip install --proxy=https://t954349@198.161.14.25:8080 numpy 

// run these 
python -v 
pip -v 
python -m pip install --upgrade pip 
pip install pandas numpy matplotlib requests scikit_learn scipy splunk_sdk xmltodict pyod suod hdbscan jupyterlab notebook urllib3  geoip2 coupled_biased_random_walks

Note that coupled_biased_random_walks will throw a complatiable error, it shouldn't matter, you can just ignore it, even in production.
Run jupyter-lab . to launch Jupyter-lab. The . can be your project folder.
[optional] It's easier to copy the path from windows explorer and in CMD cd to the directory, or you can navigate folders in-webapp. I use Git bash, which let's be click "Git bash here" in the explorer context menu to open a shell in any folder.
In a notebook, you can execute a shell command via a !, examples: !pip install foobar and !ping 8.8.8.8
Learning python and data science

If you dont know python, Then watch videos, else skip to code handouts.
I alway speed up youtube videos with this extension.
Day 1 @ mental prime - 60 + 20 mins:

Watch, you don't have to follow along:


Python programming 2020 - watch the first 60 mins. Review the code handout after a gap, Ideally before bed.


Python programming 2014 - optionally, watch the first 30 mins. Review the code handout.

If you skip this, check out the code handout, after another gap, for the sake of 'spaced repetition' (learning strategy)


Day 2 @ anytime - 120 + 10 + 20 mins:


Pure Stats for Python - I recommend you code along with this. Handout

These two are videos are optional. handouts are recommended


Statistics for Data Science & Machine Learning - Till you get too bored Handout
Probability for Data Science & Machine Learning - First 30 mins. Handout

Linear algebra - necessary


Linear algebra for game devs - Skim all 3 parts

linear algebra - unneeded; overkill

Blog 1
Blog 2

Day 3 @ anytime

Now you can (almost) read python fluently!

If this has been is wayyy too fast This youtuber might be helpful

Do these quick exercises

Print the directory /Desktop with python
Print the directory /Desktop with CMD/Bash via python
Review Python - Numpy

Numpy is Arrays in python.. with a lot of useful operations


Notes
Notes - Skim

NOW it's time to start Machine learning


Introduction to Scikit-Learn - watch from 5 mins till you feel lost.
SFU CMPT 353 | site # 11 to #13 - I recommend skimming the lecture notes on the site

Locally running code


ALL this code is outdated, but expected to work

Code

The quality of code will decrease as you descend this list.


ML_based_detection.py - Script for Machine Learning based detection. Outputs a .xlsx. SEE Algorithms


rule_based_detection_driver.py - Script for rule based detection. Outputs a .xlsx


rule_based_detection_core.py - Has 3 core function, which call utility functions


Build action dataset.ipynb - Aggregate many CSVs, Preprocess, fill errors with -1, and save single large CSV file to actions_data\preprocessed\ all_action_all {DATA TIME}, then move inputs to .\_hide


attacks over different dates and geo plots.ipynb - for generating on the tables I used in presentation. Columns are a dataset. Rows are countries. Cells are evil/all authentication attempts. This is a partial mirror of the file Experiments - Plotting and other.ipynb. The folders "input_visualization" and "output_visualization" are used. Table generation is 100% automated, however ploting is manual, and may require reloading data into memory, due to the scope being limited to main loop


Experiments - Plotting and other.ipynb - Plots and Visualization. Computes the probability of a event being malicious given its Geolocation.


Experiments - Anomaly Detection.ipynb - Plots of coupled_biased_random_walks, individual PYOD detectors including the following neural networks: Single-Objective Generative Adversarial Active Learning and Variational Autoencoder


Experiments - Deep Learning.ipynb - A Categorical Variational Autoencoder using Gumbel-Softmax. and some other experiments


_All_audit.ipynb - Deprecated portion of project relating to audit logs. The output of this file would do into a file that was known as run AD.ipynb, which is not in the folder called garbage.


List of folders - Anomaly Detection


Action_data - folder to hold data before and after preprocessing


GeoLite2-City - folder to house GeoLite2-City.mmdb, the database used to map IP addresses to geolocation


input_visualization and output_visualization - The I/O for the program Experiments - Plotting and other.ipynb


List of folders


_garbage - Folder for manual storage of past files
Anomaly Detection - Where the core code is contained. Has nested readme.md
Docker_image - Docker image for a generic python environment with persistent storage
utils - Splunk search URL parser and other miscellaneous scripts
webhook - Webhook to listen for input. and ngrok.exe maps an dev-domain to localhost
fetch data

Alert - Setup a Splunk email alert, and manually copy and paste a SID-thingy from a particullar link in the email, of the form scheduler__t954349__CII__RMD556aaa6af93d5a498_at_1603137600_75868_D38856AD-91C4-4C02-AA8B-2BDDEC646C9E
Current Time - Every 5 minutes, run a query, ie.  search index=cii_pingfederate action=* earliest=-5m
Parse raw audit logs is my parser audit for logs.


Full App - driver - 5 minute loop will drive the object FetchFromSplunk, and analyze by calling child_process.py in a new process, which relives on rule_based_detection_core.py.

Algorithms


PYOD  - Models are indirectly used

PYOD | Docs |


Coupled_biased_random_walk

Coupled-Biased-Random-Walks


Combo - ensemble (Simple Detector Aggregator) of PYOD models

Combo | Docs


SUOD  - ensemble ("Scalable Unsupervised Outlier Detection") of PYOD models

SUOD | Docs |  Paper.
SUOD is an acceleration framework for large-scale unsupervised heterogeneous outlier detector training and prediction.
PROs - Multithreading built in
CONs - Weird compatibility issue.. "index error" with threads/n_jobs. bps_flag=False appears avoid the issue, at the cost of "balanced parallel scheduling"
Notes: if there is a ram issue, decrease n_jobs in SUOD. Input CSV should be about 200mb, however I have tested with a input CSV of 775mb on a laptop [i7 vpro, 32gb ram]


If there is a need to reduce computational expense. I would run Coupled_biased_random_walk, and optionally to supplement, put the models of Combo into SUOD. Have both, the new SUOD, which is both multithreaded and running low-cost detectors, and the  Coupled_biased_random_walk "Vote". and output 1 (or 3) sheets, depending on if the optional is used
Notes on data engineering and data usage

The following cases cause the row of data to be removed

The code that does this is marked for deprecation, as this is made redundant by the up-to Splunk query


data['requester'] == "nascent"
data['action] ==  any element in the set {"paneView", "move", "change", "preview", "view", "rowView", "Next+time"}

Splunk Query

    index=cii_pingfederate action=*
    requester!= "nascent" requester != "-1" action != "paneView"  action != "move" action != "change" action != "preview" action != "view" action != "rowView" action != "Next+time" NOT "@ci-qa.com" NOT "@telusinternal.com" 
    | fields action, status, adapter, serviceType, _time, requester, requesterIp, tid
    Note you must run, `Build actions dataset` to get Geolocation related fields, which were done by me in a different manner than in Splunk (using it's command)