Skip to content

Instantly share code, notes, and snippets.

@AndreiMoraru123
Created January 7, 2023 13:30
Show Gist options
  • Save AndreiMoraru123/28944a4d852c1de209f27e1a3a1fdc05 to your computer and use it in GitHub Desktop.
Save AndreiMoraru123/28944a4d852c1de209f27e1a3a1fdc05 to your computer and use it in GitHub Desktop.
Simple decision tree implemented in scikit-learn, trained on dummy malware data
file_id file_name size_mb extension is_executable label
1 file1.exe 10 exe 1 1
2 file2.exe 20 exe 1 1
3 file3.exe 30 exe 1 0
4 file4.exe 40 exe 1 0
5 file5.exe 50 exe 1 1
6 file6.exe 60 exe 1 0
7 file7.exe 70 exe 1 0
8 file8.doc 5 doc 0 0
9 file9.doc 10 doc 0 0
10 file10.doc 15 doc 0 0
11 file11.doc 20 doc 0 0
12 file12.doc 25 doc 0 0
13 file13.doc 30 doc 0 0
14 file14.docx 5 docx 0 0
15 file15.docx 10 docx 0 0
16 file16.docx 15 docx 0 0
17 file17.docx 20 docx 0 0
18 file18.docx 25 docx 0 0
19 file19.docx 30 docx 0 0
20 file20.xlsx 5 xlsx 0 0
21 file21.xlsx 10 xlsx 0 0
22 file22.xlsx 15 xlsx 0 0
23 file23.xlsx 20 xlsx 0 0
24 file24.xlsx 25 xlsx 0 0
25 file25.xlsx 30 xlsx 0 0
26 file26.pdf 5 pdf 0 0
27 file27.pdf 10 pdf 0 0
28 file28.pdf 15 pdf 0 0
29 file29.pdf 20 pdf 0 0
30 file30.pdf 25 pdf 0 0
31 file31.pdf 30 pdf 0 0
32 micro 15 exe 1 1
33 zepto 20 exe 1 1
34 cerber 25 exe 1 1
35 locky 30 exe 1 1
36 cerber3 35 exe 1 1
37 cryp1 40 exe 1 1
38 mole 45 exe 1 1
39 onion 50 exe 1 1
40 axx 55 exe 1 1
41 osiris 60 exe 1 1
42 crypz 65 exe 1 1
43 crypt 70 exe 1 1
44 locked 75 exe 1 1
45 odin 80 exe 1 1
46 ccc 85 exe 1 1
47 cerber2 90 exe 1 1
48 sage 95 exe 1 1
49 globe 100 exe 1 1
50 exx 105 exe 1 1
51 file51.txt 5 txt 0 0
52 file52.txt 10 txt 0 0
53 file53.txt 15 txt 0 0
54 file54.xls 5 xls 0 0
55 file55.xls 10 xls 0 0
56 file56.xls 15 xls 0 0
57 file57.xls 20 xls 0 1
58 file58.xls 25 xls 0 1
"""
A simple decision tree implementation in scikit-learn.
Trained on fabricated malware data.
Features: file id, name, size in MB, extension and executability.
Labels: malware or not.
Output:
Accuracy: 0.75
Predictions: [1 1 1 1 1 1 0 0 0 0 0 0]
"""
# ---------------- Importing Libraries ----------------
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# -----------------------------------------------------
# Load the data
data = pd.read_csv('malware_data.csv')
# Split the data into features and labels
X = data.drop(columns=['label', 'file_id', 'file_name'])
y = data['label']
# convert to numerical data
X = pd.get_dummies(X)
y = y.astype('int')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the decision tree model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Test the model on the testing data
accuracy = model.score(X_test, y_test)
print('Accuracy: ', accuracy)
# Predict on new, unseen data
new_data = pd.read_csv('new_malware_data.csv')
new_data = new_data.drop(columns=['file_id', 'file_name'])
new_data = pd.get_dummies(new_data)
predictions = model.predict(new_data)
print('Predictions: ', predictions)
file_id file_name size_mb extension is_executable
1 crypt.exe 10.1 exe 1
2 sage.exe 7.2 exe 1
3 locker.exe 5.5 exe 1
4 axx.exe 9.0 exe 1
5 new_file1.exe 3.5 exe 1
6 new_file2.exe 2.7 exe 1
7 new_file3.doc 1.2 doc 0
8 new_file4.pdf 0.9 pdf 0
9 new_file5.txt 0.6 txt 0
10 new_file6.xls 1.8 xls 0
11 new_file7.docx 1.1 docx 0
12 new_file8.xlsx 1.3 xlsx 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment