Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save josejesus30/89e72da06854c37ff1930b69e749a0cb to your computer and use it in GitHub Desktop.
Save josejesus30/89e72da06854c37ff1930b69e749a0cb to your computer and use it in GitHub Desktop.
En este caso de uso práctico se pretende resolver un problema de detección de malware en dispositivos Android mediante el análisis del tráfico de red que genera el dispositivo mediante el uso de conjuntos de árboles de decisión.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modelo: _Árboles de decisión para la selección del modelo_\n",
"En este caso de uso práctico se pretende resolver un problema de detección de malware en dispositivos Android mediante el análisis del tráfico de red que genera el dispositivo mediante el uso de conjuntos de árboles de decisión.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Autor: José Alamo Palomino"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Caso Práctico: Detección de malware en Android"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El sofisticado y avanzado malware de Android puede identificar la presencia del emulador utilizado por el analista de malware y, en respuesta, alterar su comportamiento para evadir la detección. Para superar este problema, instalamos las aplicaciones de Android en el dispositivo real y capturamos su tráfico de red. Vea nuestro Sandbox de Android disponible al público .\n",
"\n",
"El conjunto de datos CICAAGM se captura instalando las aplicaciones de Android en los teléfonos inteligentes reales semiautomatizados. El conjunto de datos se genera a partir de 1900 aplicaciones con las siguientes tres categorías:\n",
"\n",
"### 1. Adware (250 aplicaciones)\n",
"\n",
"* **Airpush:** diseñado para entregar anuncios no solicitados a los sistemas del usuario para el robo de información.\n",
"\n",
"* **Dowgin:** diseñado como una biblioteca de publicidad que también puede robar la información del usuario.\n",
"\n",
"* **Kemoge:** diseñado para hacerse cargo del dispositivo Android de un usuario. Este adware es un híbrido de botnet y se disfraza de aplicaciones populares a través del reempaquetado.\n",
"\n",
"* **Mobidash:** diseñado para mostrar anuncios y comprometer la información personal del usuario.\n",
"\n",
"* **Shuanet:** similar a Kemoge, Shuanet también está diseñado para hacerse cargo del dispositivo de un usuario.\n",
"\n",
"### 2. Malware general (150 aplicaciones)\n",
"\n",
"* **AVpass:** diseñado para ser distribuido bajo la apariencia de una aplicación de reloj.\n",
"\n",
"* **FakeAV:** Diseñado como una estafa que engaña al usuario para que compre una versión completa del software con el fin de mediar infecciones no existentes.\n",
"\n",
"* **FakeFlash / FakePlayer:** diseñado como una aplicación Flash falsa para dirigir a los usuarios a un sitio web (después de una instalación exitosa).\n",
"\n",
"* **GGtracker:** diseñado para el fraude por SMS (envía mensajes SMS a un número de tarifa premium) y robo de información.\n",
"\n",
"* **Penetho:** diseñado como un servicio falso (hacktool para dispositivos Android que se puede usar para descifrar la contraseña de WiFi). El malware también puede infectar la computadora del usuario a través de archivos adjuntos de correo electrónico infectados, actualizaciones falsas, medios externos y documentos infectados.\n",
"\n",
"### 3. Benigno (1500 aplicaciones)\n",
"\n",
"* 2015 GooglePlay market (top gratis popular y top gratis nuevo)\n",
"* 2016 GooglePlay market (top gratis popular y top gratis nuevo)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observación:\n",
"El conjunto de datos esta formado mendiante la instalación de aplicaciones en dispositivos android, y una vez que se han instalado esas aplicaciones lo que se hace es capturar el tráfico de red que generan esos dispositivos.\n",
"\n",
"EL objetivo es entrenar un algoritmo de Random Forest que sea capaz de diferenciar flujos de tráfico de red pertenecientes a las clases benign, asware y GeneralMalware y que cuando llegue tráfico de red nuevo, nuestro algoritmo de random forest devolverá alguna de las 3 categorias mencionadas. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">1. Importando librerías necesarias</h2>"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import RobustScaler\n",
"from sklearn.metrics import f1_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">2. Funciones auxiliares</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Función para la partición del conjunto de datos"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):\n",
" strat = df[stratify] if stratify else None\n",
" train_set, test_set = train_test_split(\n",
" df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)\n",
" strat = test_set[stratify] if stratify else None\n",
" val_set, test_set = train_test_split(\n",
" test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)\n",
" return (train_set, val_set, test_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Función para separar las características de entrada de las de salida"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def remove_labels(df, label_name):\n",
" X = df.drop(label_name, axis=1)\n",
" y = df[label_name].copy()\n",
" return (X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">3. Lectura del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" <th>calss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1020586</td>\n",
" <td>668</td>\n",
" <td>1641</td>\n",
" <td>35692</td>\n",
" <td>2276876</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>679</td>\n",
" <td>1390</td>\n",
" <td>53.431138</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>1853440</td>\n",
" <td>1640</td>\n",
" <td>668</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80794</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>998</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>187</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>62.333333</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>4</td>\n",
" <td>101888</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>189868</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>1448</td>\n",
" <td>6200</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>706</td>\n",
" <td>1390</td>\n",
" <td>160.888889</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>2722560</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>110577</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>528</td>\n",
" <td>1422</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>331</td>\n",
" <td>1005</td>\n",
" <td>132.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>155136</td>\n",
" <td>31232</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631950</th>\n",
" <td>530</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631951</th>\n",
" <td>50240627</td>\n",
" <td>23</td>\n",
" <td>24</td>\n",
" <td>4767</td>\n",
" <td>6107</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>533</td>\n",
" <td>855</td>\n",
" <td>207.260870</td>\n",
" <td>...</td>\n",
" <td>9842879.0</td>\n",
" <td>9964749</td>\n",
" <td>1.196806e+05</td>\n",
" <td>2</td>\n",
" <td>317952</td>\n",
" <td>107008</td>\n",
" <td>11</td>\n",
" <td>23</td>\n",
" <td>32</td>\n",
" <td>GeneralMalware</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631952</th>\n",
" <td>35471450</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>52</td>\n",
" <td>104</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>35300000.0</td>\n",
" <td>35290631</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>3904</td>\n",
" <td>88704</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>32</td>\n",
" <td>asware</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631953</th>\n",
" <td>41713629</td>\n",
" <td>12</td>\n",
" <td>26</td>\n",
" <td>1821</td>\n",
" <td>18643</td>\n",
" <td>40</td>\n",
" <td>40</td>\n",
" <td>489</td>\n",
" <td>1390</td>\n",
" <td>151.750000</td>\n",
" <td>...</td>\n",
" <td>20200000.0</td>\n",
" <td>32711382</td>\n",
" <td>1.770000e+07</td>\n",
" <td>2</td>\n",
" <td>227456</td>\n",
" <td>2432</td>\n",
" <td>23</td>\n",
" <td>12</td>\n",
" <td>20</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631954</th>\n",
" <td>50110119</td>\n",
" <td>20</td>\n",
" <td>23</td>\n",
" <td>4130</td>\n",
" <td>6043</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>533</td>\n",
" <td>855</td>\n",
" <td>206.500000</td>\n",
" <td>...</td>\n",
" <td>9873329.4</td>\n",
" <td>9906007</td>\n",
" <td>4.737363e+04</td>\n",
" <td>2</td>\n",
" <td>266112</td>\n",
" <td>59904</td>\n",
" <td>11</td>\n",
" <td>20</td>\n",
" <td>32</td>\n",
" <td>GeneralMalware</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>631955 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl total_bpktl \\\n",
"0 1020586 668 1641 35692 2276876 \n",
"1 80794 1 1 75 124 \n",
"2 998 3 0 187 0 \n",
"3 189868 9 9 1448 6200 \n",
"4 110577 4 6 528 1422 \n",
"... ... ... ... ... ... \n",
"631950 530 1 1 74 334 \n",
"631951 50240627 23 24 4767 6107 \n",
"631952 35471450 1 2 52 104 \n",
"631953 41713629 12 26 1821 18643 \n",
"631954 50110119 20 23 4130 6043 \n",
"\n",
" min_fpktl min_bpktl max_fpktl max_bpktl mean_fpktl ... \\\n",
"0 52 52 679 1390 53.431138 ... \n",
"1 75 124 75 124 75.000000 ... \n",
"2 52 -1 83 -1 62.333333 ... \n",
"3 52 52 706 1390 160.888889 ... \n",
"4 52 52 331 1005 132.000000 ... \n",
"... ... ... ... ... ... ... \n",
"631950 74 334 74 334 74.000000 ... \n",
"631951 52 52 533 855 207.260870 ... \n",
"631952 52 52 52 52 52.000000 ... \n",
"631953 40 40 489 1390 151.750000 ... \n",
"631954 52 52 533 855 206.500000 ... \n",
"\n",
" mean_idle max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"0 0.0 -1 0.000000e+00 2 4194240 \n",
"1 0.0 -1 0.000000e+00 2 0 \n",
"2 0.0 -1 0.000000e+00 4 101888 \n",
"3 0.0 -1 0.000000e+00 2 4194240 \n",
"4 0.0 -1 0.000000e+00 2 155136 \n",
"... ... ... ... ... ... \n",
"631950 0.0 -1 0.000000e+00 2 0 \n",
"631951 9842879.0 9964749 1.196806e+05 2 317952 \n",
"631952 35300000.0 35290631 0.000000e+00 2 3904 \n",
"631953 20200000.0 32711382 1.770000e+07 2 227456 \n",
"631954 9873329.4 9906007 4.737363e+04 2 266112 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"0 1853440 1640 668 \n",
"1 0 0 1 \n",
"2 -1 0 3 \n",
"3 2722560 8 9 \n",
"4 31232 5 4 \n",
"... ... ... ... \n",
"631950 0 0 1 \n",
"631951 107008 11 23 \n",
"631952 88704 1 1 \n",
"631953 2432 23 12 \n",
"631954 59904 11 20 \n",
"\n",
" min_seg_size_forward calss \n",
"0 32 benign \n",
"1 0 benign \n",
"2 32 benign \n",
"3 32 benign \n",
"4 32 benign \n",
"... ... ... \n",
"631950 0 benign \n",
"631951 32 GeneralMalware \n",
"631952 32 asware \n",
"631953 20 benign \n",
"631954 32 GeneralMalware \n",
"\n",
"[631955 rows x 80 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('datasets/TotalFeatures-ISCXFlowMeter.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" <h2 style=\"color:blue\">4. Visualización del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" <th>calss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1020586</td>\n",
" <td>668</td>\n",
" <td>1641</td>\n",
" <td>35692</td>\n",
" <td>2276876</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>679</td>\n",
" <td>1390</td>\n",
" <td>53.431138</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>1853440</td>\n",
" <td>1640</td>\n",
" <td>668</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80794</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>998</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>187</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>62.333333</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>101888</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>189868</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>1448</td>\n",
" <td>6200</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>706</td>\n",
" <td>1390</td>\n",
" <td>160.888889</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>2722560</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>110577</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>528</td>\n",
" <td>1422</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>331</td>\n",
" <td>1005</td>\n",
" <td>132.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>155136</td>\n",
" <td>31232</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>261876</td>\n",
" <td>7</td>\n",
" <td>6</td>\n",
" <td>1618</td>\n",
" <td>882</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>730</td>\n",
" <td>477</td>\n",
" <td>231.142857</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>926720</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>14</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>104</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>5824</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>29675</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>806635</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>239</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>59.750000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>107008</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>56620</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1074</td>\n",
" <td>719</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>592</td>\n",
" <td>667</td>\n",
" <td>358.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>128512</td>\n",
" <td>10816</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl total_bpktl \\\n",
"0 1020586 668 1641 35692 2276876 \n",
"1 80794 1 1 75 124 \n",
"2 998 3 0 187 0 \n",
"3 189868 9 9 1448 6200 \n",
"4 110577 4 6 528 1422 \n",
"5 261876 7 6 1618 882 \n",
"6 14 2 0 104 0 \n",
"7 29675 1 1 71 213 \n",
"8 806635 4 0 239 0 \n",
"9 56620 3 2 1074 719 \n",
"\n",
" min_fpktl min_bpktl max_fpktl max_bpktl mean_fpktl ... mean_idle \\\n",
"0 52 52 679 1390 53.431138 ... 0.0 \n",
"1 75 124 75 124 75.000000 ... 0.0 \n",
"2 52 -1 83 -1 62.333333 ... 0.0 \n",
"3 52 52 706 1390 160.888889 ... 0.0 \n",
"4 52 52 331 1005 132.000000 ... 0.0 \n",
"5 52 52 730 477 231.142857 ... 0.0 \n",
"6 52 -1 52 -1 52.000000 ... 0.0 \n",
"7 71 213 71 213 71.000000 ... 0.0 \n",
"8 52 -1 83 -1 59.750000 ... 0.0 \n",
"9 52 52 592 667 358.000000 ... 0.0 \n",
"\n",
" max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"0 -1 0.0 2 4194240 \n",
"1 -1 0.0 2 0 \n",
"2 -1 0.0 4 101888 \n",
"3 -1 0.0 2 4194240 \n",
"4 -1 0.0 2 155136 \n",
"5 -1 0.0 2 4194240 \n",
"6 -1 0.0 3 5824 \n",
"7 -1 0.0 2 0 \n",
"8 -1 0.0 5 107008 \n",
"9 -1 0.0 3 128512 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"0 1853440 1640 668 \n",
"1 0 0 1 \n",
"2 -1 0 3 \n",
"3 2722560 8 9 \n",
"4 31232 5 4 \n",
"5 926720 3 7 \n",
"6 -1 0 2 \n",
"7 0 0 1 \n",
"8 -1 0 4 \n",
"9 10816 1 3 \n",
"\n",
" min_seg_size_forward calss \n",
"0 32 benign \n",
"1 0 benign \n",
"2 32 benign \n",
"3 32 benign \n",
"4 32 benign \n",
"5 32 benign \n",
"6 32 benign \n",
"7 0 benign \n",
"8 32 benign \n",
"9 32 benign \n",
"\n",
"[10 rows x 80 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>min_idle</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>...</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.00000</td>\n",
" <td>631955.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>2.195245e+07</td>\n",
" <td>6.728514</td>\n",
" <td>10.431934</td>\n",
" <td>9.540172e+02</td>\n",
" <td>1.206042e+04</td>\n",
" <td>141.475727</td>\n",
" <td>44.357688</td>\n",
" <td>263.675901</td>\n",
" <td>183.248084</td>\n",
" <td>174.959706</td>\n",
" <td>...</td>\n",
" <td>1.997327e+07</td>\n",
" <td>2.031228e+07</td>\n",
" <td>2.075238e+07</td>\n",
" <td>4.663875e+05</td>\n",
" <td>2.360896</td>\n",
" <td>9.620796e+05</td>\n",
" <td>3.104519e+05</td>\n",
" <td>9.733144</td>\n",
" <td>6.72471</td>\n",
" <td>19.965713</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>1.900578e+08</td>\n",
" <td>174.161354</td>\n",
" <td>349.424019</td>\n",
" <td>8.235040e+04</td>\n",
" <td>4.824716e+05</td>\n",
" <td>157.680880</td>\n",
" <td>89.099554</td>\n",
" <td>289.644383</td>\n",
" <td>371.863224</td>\n",
" <td>162.024811</td>\n",
" <td>...</td>\n",
" <td>1.897986e+08</td>\n",
" <td>1.897902e+08</td>\n",
" <td>1.899721e+08</td>\n",
" <td>6.199704e+06</td>\n",
" <td>3.041810</td>\n",
" <td>1.705655e+06</td>\n",
" <td>6.647956e+05</td>\n",
" <td>347.877923</td>\n",
" <td>174.13813</td>\n",
" <td>14.914261</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-1.800000e+01</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.000000e+00</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>6.900000e+01</td>\n",
" <td>0.000000e+00</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>1.00000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>2.445000e+04</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.840000e+02</td>\n",
" <td>0.000000e+00</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>83.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>83.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>8.761600e+04</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>1.00000</td>\n",
" <td>32.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1.759751e+06</td>\n",
" <td>3.000000</td>\n",
" <td>1.000000</td>\n",
" <td>4.270000e+02</td>\n",
" <td>1.670000e+02</td>\n",
" <td>108.000000</td>\n",
" <td>52.000000</td>\n",
" <td>421.000000</td>\n",
" <td>115.000000</td>\n",
" <td>356.000000</td>\n",
" <td>...</td>\n",
" <td>1.013498e+06</td>\n",
" <td>1.291379e+06</td>\n",
" <td>1.306116e+06</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>3.046400e+05</td>\n",
" <td>9.049600e+04</td>\n",
" <td>1.000000</td>\n",
" <td>3.00000</td>\n",
" <td>32.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>4.431076e+10</td>\n",
" <td>48255.000000</td>\n",
" <td>74768.000000</td>\n",
" <td>4.049644e+07</td>\n",
" <td>1.039222e+08</td>\n",
" <td>1390.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>1500.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>...</td>\n",
" <td>4.431072e+10</td>\n",
" <td>4.430000e+10</td>\n",
" <td>4.431072e+10</td>\n",
" <td>8.470000e+08</td>\n",
" <td>2269.000000</td>\n",
" <td>4.194240e+06</td>\n",
" <td>4.194240e+06</td>\n",
" <td>74524.000000</td>\n",
" <td>48255.00000</td>\n",
" <td>44.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 79 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl \\\n",
"count 6.319550e+05 631955.000000 631955.000000 6.319550e+05 \n",
"mean 2.195245e+07 6.728514 10.431934 9.540172e+02 \n",
"std 1.900578e+08 174.161354 349.424019 8.235040e+04 \n",
"min -1.800000e+01 0.000000 0.000000 0.000000e+00 \n",
"25% 0.000000e+00 1.000000 0.000000 6.900000e+01 \n",
"50% 2.445000e+04 1.000000 0.000000 1.840000e+02 \n",
"75% 1.759751e+06 3.000000 1.000000 4.270000e+02 \n",
"max 4.431076e+10 48255.000000 74768.000000 4.049644e+07 \n",
"\n",
" total_bpktl min_fpktl min_bpktl max_fpktl \\\n",
"count 6.319550e+05 631955.000000 631955.000000 631955.000000 \n",
"mean 1.206042e+04 141.475727 44.357688 263.675901 \n",
"std 4.824716e+05 157.680880 89.099554 289.644383 \n",
"min 0.000000e+00 -1.000000 -1.000000 -1.000000 \n",
"25% 0.000000e+00 52.000000 -1.000000 52.000000 \n",
"50% 0.000000e+00 52.000000 -1.000000 83.000000 \n",
"75% 1.670000e+02 108.000000 52.000000 421.000000 \n",
"max 1.039222e+08 1390.000000 1390.000000 1500.000000 \n",
"\n",
" max_bpktl mean_fpktl ... min_idle mean_idle \\\n",
"count 631955.000000 631955.000000 ... 6.319550e+05 6.319550e+05 \n",
"mean 183.248084 174.959706 ... 1.997327e+07 2.031228e+07 \n",
"std 371.863224 162.024811 ... 1.897986e+08 1.897902e+08 \n",
"min -1.000000 0.000000 ... -1.000000e+00 0.000000e+00 \n",
"25% -1.000000 52.000000 ... -1.000000e+00 0.000000e+00 \n",
"50% -1.000000 83.000000 ... -1.000000e+00 0.000000e+00 \n",
"75% 115.000000 356.000000 ... 1.013498e+06 1.291379e+06 \n",
"max 1390.000000 1390.000000 ... 4.431072e+10 4.430000e+10 \n",
"\n",
" max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"count 6.319550e+05 6.319550e+05 631955.000000 6.319550e+05 \n",
"mean 2.075238e+07 4.663875e+05 2.360896 9.620796e+05 \n",
"std 1.899721e+08 6.199704e+06 3.041810 1.705655e+06 \n",
"min -1.000000e+00 0.000000e+00 2.000000 -1.000000e+00 \n",
"25% -1.000000e+00 0.000000e+00 2.000000 0.000000e+00 \n",
"50% -1.000000e+00 0.000000e+00 2.000000 8.761600e+04 \n",
"75% 1.306116e+06 0.000000e+00 2.000000 3.046400e+05 \n",
"max 4.431072e+10 8.470000e+08 2269.000000 4.194240e+06 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"count 6.319550e+05 631955.000000 631955.00000 \n",
"mean 3.104519e+05 9.733144 6.72471 \n",
"std 6.647956e+05 347.877923 174.13813 \n",
"min -1.000000e+00 0.000000 0.00000 \n",
"25% -1.000000e+00 0.000000 1.00000 \n",
"50% -1.000000e+00 0.000000 1.00000 \n",
"75% 9.049600e+04 1.000000 3.00000 \n",
"max 4.194240e+06 74524.000000 48255.00000 \n",
"\n",
" min_seg_size_forward \n",
"count 631955.000000 \n",
"mean 19.965713 \n",
"std 14.914261 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 32.000000 \n",
"75% 32.000000 \n",
"max 44.000000 \n",
"\n",
"[8 rows x 79 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 631955 entries, 0 to 631954\n",
"Data columns (total 80 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 duration 631955 non-null int64 \n",
" 1 total_fpackets 631955 non-null int64 \n",
" 2 total_bpackets 631955 non-null int64 \n",
" 3 total_fpktl 631955 non-null int64 \n",
" 4 total_bpktl 631955 non-null int64 \n",
" 5 min_fpktl 631955 non-null int64 \n",
" 6 min_bpktl 631955 non-null int64 \n",
" 7 max_fpktl 631955 non-null int64 \n",
" 8 max_bpktl 631955 non-null int64 \n",
" 9 mean_fpktl 631955 non-null float64\n",
" 10 mean_bpktl 631955 non-null float64\n",
" 11 std_fpktl 631955 non-null float64\n",
" 12 std_bpktl 631955 non-null float64\n",
" 13 total_fiat 631955 non-null int64 \n",
" 14 total_biat 631955 non-null int64 \n",
" 15 min_fiat 631955 non-null int64 \n",
" 16 min_biat 631955 non-null int64 \n",
" 17 max_fiat 631955 non-null int64 \n",
" 18 max_biat 631955 non-null int64 \n",
" 19 mean_fiat 631955 non-null float64\n",
" 20 mean_biat 631955 non-null float64\n",
" 21 std_fiat 631955 non-null float64\n",
" 22 std_biat 631955 non-null float64\n",
" 23 fpsh_cnt 631955 non-null int64 \n",
" 24 bpsh_cnt 631955 non-null int64 \n",
" 25 furg_cnt 631955 non-null int64 \n",
" 26 burg_cnt 631955 non-null int64 \n",
" 27 total_fhlen 631955 non-null int64 \n",
" 28 total_bhlen 631955 non-null int64 \n",
" 29 fPktsPerSecond 631955 non-null float64\n",
" 30 bPktsPerSecond 631955 non-null float64\n",
" 31 flowPktsPerSecond 631955 non-null float64\n",
" 32 flowBytesPerSecond 631955 non-null float64\n",
" 33 min_flowpktl 631955 non-null int64 \n",
" 34 max_flowpktl 631955 non-null int64 \n",
" 35 mean_flowpktl 631955 non-null float64\n",
" 36 std_flowpktl 631955 non-null float64\n",
" 37 min_flowiat 631955 non-null int64 \n",
" 38 max_flowiat 631955 non-null int64 \n",
" 39 mean_flowiat 631955 non-null float64\n",
" 40 std_flowiat 631955 non-null float64\n",
" 41 flow_fin 631955 non-null int64 \n",
" 42 flow_syn 631955 non-null int64 \n",
" 43 flow_rst 631955 non-null int64 \n",
" 44 flow_psh 631955 non-null int64 \n",
" 45 flow_ack 631955 non-null int64 \n",
" 46 flow_urg 631955 non-null int64 \n",
" 47 flow_cwr 631955 non-null int64 \n",
" 48 flow_ece 631955 non-null int64 \n",
" 49 downUpRatio 631955 non-null float64\n",
" 50 avgPacketSize 631955 non-null float64\n",
" 51 fAvgSegmentSize 631955 non-null float64\n",
" 52 fHeaderBytes 631955 non-null int64 \n",
" 53 fAvgBytesPerBulk 631955 non-null int64 \n",
" 54 fAvgPacketsPerBulk 631955 non-null int64 \n",
" 55 fAvgBulkRate 631955 non-null int64 \n",
" 56 bVarianceDataBytes 631955 non-null float64\n",
" 57 bAvgSegmentSize 631955 non-null int64 \n",
" 58 bAvgBytesPerBulk 631955 non-null int64 \n",
" 59 bAvgPacketsPerBulk 631955 non-null int64 \n",
" 60 bAvgBulkRate 631955 non-null int64 \n",
" 61 sflow_fpacket 631955 non-null int64 \n",
" 62 sflow_fbytes 631955 non-null int64 \n",
" 63 sflow_bpacket 631955 non-null int64 \n",
" 64 sflow_bbytes 631955 non-null int64 \n",
" 65 min_active 631955 non-null int64 \n",
" 66 mean_active 631955 non-null float64\n",
" 67 max_active 631955 non-null int64 \n",
" 68 std_active 631955 non-null float64\n",
" 69 min_idle 631955 non-null int64 \n",
" 70 mean_idle 631955 non-null float64\n",
" 71 max_idle 631955 non-null int64 \n",
" 72 std_idle 631955 non-null float64\n",
" 73 FFNEPD 631955 non-null int64 \n",
" 74 Init_Win_bytes_forward 631955 non-null int64 \n",
" 75 Init_Win_bytes_backward 631955 non-null int64 \n",
" 76 RRT_samples_clnt 631955 non-null int64 \n",
" 77 Act_data_pkt_forward 631955 non-null int64 \n",
" 78 min_seg_size_forward 631955 non-null int64 \n",
" 79 calss 631955 non-null object \n",
"dtypes: float64(24), int64(55), object(1)\n",
"memory usage: 385.7+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Longitud del conjunto de datos: 631955\n",
"Número de características del conjunto de datos: 80\n"
]
}
],
"source": [
"print(\"Longitud del conjunto de datos:\", len(df))\n",
"print(\"Número de características del conjunto de datos:\", len(df.columns))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"benign 471597\n",
"asware 155613\n",
"GeneralMalware 4745\n",
"Name: calss, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"calss\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">5. División del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Dividimos el conjunto de datos\n",
"train_set, val_set, test_set = train_val_test_split(df)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"X_train, y_train = remove_labels(train_set, 'calss')\n",
"X_val, y_val = remove_labels(val_set, 'calss')\n",
"X_test, y_test = remove_labels(test_set, 'calss')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Longitud del Training Set: 379173\n",
"Longitud del Validation Set: 126391\n",
"Longitud del Test Set: 126391\n"
]
}
],
"source": [
"print(\"Longitud del Training Set:\", len(train_set))\n",
"print(\"Longitud del Validation Set:\", len(val_set))\n",
"print(\"Longitud del Test Set:\", len(test_set))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">6. Random Forests</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Modelo entrenado con el conjunto de datos sin escalar"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n",
" criterion='gini', max_depth=None, max_features='auto',\n",
" max_leaf_nodes=None, max_samples=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=100,\n",
" n_jobs=-1, oob_score=False, random_state=42, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"clf_rnd = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)\n",
"clf_rnd.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predecimos con el conjunto de datos de validación"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['asware', 'asware', 'benign', ..., 'benign', 'asware', 'benign'],\n",
" dtype=object)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred = clf_rnd.predict(X_val)\n",
"y_pred"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.9329474731171657\n"
]
}
],
"source": [
"# En un 93.94% de las ocasiones, el algoritmo está clasificando correctamente.\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val, average='weighted'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">7. Selección del modelo</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tanto los árboles de decisión como los conjuntos de árboles (al igual que otros algoritmos más complejos) presentan un gran número de **hiperparámetros** que deben evaluarse para conseguir el mejor modelo. Una de las formas más comunes de seleccionarlos es mediante técnicas automáticas de selección del modelo."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Importamos la clase GridSearchCV de sklearn.model_selection\n",
"from sklearn.model_selection import GridSearchCV"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uso de Grid Search para selección del modelo\n",
"# Lo que se hace es generar una matriz o un grid de parámetros \n",
"param_grid = [\n",
"\n",
" # Prueba todas las combinaciones posibles en el primer diccionario\n",
" # Se prueban 9 combinaciones de hiperparámetros (3×3) \n",
" {'n_estimators': [100, 500, 1000],\n",
" 'max_leaf_nodes': [16, 24, 36]},\n",
" \n",
" # Prueba todas las combinaciones posibles en el segundo diccionario\n",
" # luego se prueban 6 (2×3) combinaciones con bootstrap establecido como False\n",
" {'bootstrap': [False], \n",
" 'n_estimators': [100, 500], \n",
" 'max_features': [2, 3, 4]},\n",
" ]\n",
"\n",
"# n_jobs=-1 -> utiliza todos los hilos de nuestro procesador\n",
"rnd_clf = RandomForestClassifier(n_jobs=-1, random_state=42) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Instanciamos la clase GridSearchCV en el objeto \"grid_search\" y le pasamos:\n",
"\n",
"# 1. el objeto \"rnd_clf\" que se usó para instanciar la clase RandomForestClassifier\n",
"# 2. nuestro grid de parámetros \"param_grid\" definido anteriormente\n",
"# 3. cv=5, esto quiere para cada combinación de hiperparámetros, se entrena el modelo:\n",
"# rnd_clf = RandomForestClassifier(n_jobs=-1, random_state=42) en cada uno de los 5 subconjuntos creados.\n",
"\n",
"# Calculará ese F1 score y hará una media de ese F1 score para decir cual ha sido el resultado de la combinación de \n",
"# hiperparámetros.\n",
"# Al entrenarse en 5 subconjuntos, nos aseguramos de que no haya overfitting.\n",
"\n",
"# Se Entrena en 5 subconjuntos, eso es un total de (9+6)*5=75 rondas de entrenamiento\n",
"grid_search = GridSearchCV(rnd_clf, param_grid, cv=5,\n",
" scoring='f1_weighted', return_train_score=True)\n",
"grid_search.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Accedemos a los mejores parámetros\n",
"grid_search.best_params_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Accedemos al mejor estimador, al algoritmo que mejor resultado a dado\n",
"grid_search.best_estimator_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sacamos por orden cual ha sido el resultado de cada una de las combinaciones de los hiperparámetros\n",
"cvres = grid_search.cv_results_\n",
"for mean_score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n",
" print(\"F1 score:\", mean_score, \"-\", \"Parámetros:\", params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Es importante entender que hay parámetros, como las combinaciones de caracteristicas que hemos visto en otros ejercicios, que pueden tratarse como hiperparámetros y evaluarse de esta misma manera**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Esta manera que hemos visto para explorar hiperparámetros esta bien si no tenemos una gran cantidad de combinaciones y tenemos claros los posibles valores. En caso contrario, es posiblemente más eficiente utilizar **RandomizedSearchCV**, que fuciona de manera similar al caso anterior, pero realizando una búsqueda sobre valores randomizados"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# En lugar de proporcionarle los valores exactos con los que tiene que realizar las combinaciones de los hiperparámetros \n",
"# como en GridSearchCV, en RandomizedSearchCV lo que se hace es proporcionarle rangos.\n",
"\n",
"from sklearn.model_selection import RandomizedSearchCV\n",
"from scipy.stats import randint\n",
"\n",
"param_distribs = {\n",
" 'n_estimators': randint(low=1, high=200), # buscar para el hiperparámetro \"n_estimators\", valores entre 1 y 200 \n",
" 'max_depth': randint(low=8, high=50), # buscar para el hiperparámetro \"max_depth\", valores entre 8 y 50 \n",
" }\n",
"\n",
"# Instanciamos la clase RandomForestClassifier en el objeto \"rnd_clf\"\n",
"rnd_clf = RandomForestClassifier(n_jobs=-1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Instanciamos la clase RandomizedSearchCV en el objeto \"rnd_search\" y le pasamos:\n",
"# 1. el objeto \"rnd_clf\" que se usó para instanciar la clase RandomForestClassifier\n",
"\n",
"# 2. la distribución de los parámetros \"param_distribs\"\n",
"\n",
"# 3. se quiere 5 iteraciones \"n_iter=5\", es decir, que busque 5 valores para cada uno\n",
"\n",
"# 4. cv=2, esto quiere decir que divida el conjunto de datos en 2 subconjuntos\n",
"\n",
"# 5. y la métrica \"scoring='f1_weighted'\", para evaluar que tal se comporta cada uno de esos entrenamientos para esa\n",
"# combinación de parámetros.\n",
"\n",
"# Se entrena en 2 subconjuntos, eso es un total de 5*2=10 rondas de entrenamiento\n",
"rnd_search = RandomizedSearchCV(rnd_clf, param_distributions=param_distribs,\n",
" n_iter=5, cv=2, scoring='f1_weighted')\n",
"rnd_search.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Accedemos a los mejores parámetros\n",
"rnd_search.best_params_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Puedo exportar de \"rnd_search\" el mejor estimador, es decir, puedo exportar el mejor algoritmo, el mejor modelo\n",
"# entrenado y listo para predecir. El mejor estimador será el que se ha entrenado con los siguientes valores:\n",
"# {'max_depth': 24, 'n_estimators': 163}\n",
"rnd_search.best_estimator_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Representamos el f1 score de cada uno de los modelos\n",
"\n",
"# OBS: F1 score: 0.9281607204544084 - Parámetros: {'max_depth': 24, 'n_estimators': 163}\n",
"# Es el resultado de entrenar 2 subconjuntos de entrenamiento, entonces F1 score: 0.9281607204544084\n",
"# será una media ponderada de ambos.\n",
"\n",
"cvres = rnd_search.cv_results_\n",
"for mean_score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n",
" print(\"F1 score:\", mean_score, \"-\", \"Parámetros:\", params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">8. Modelo final</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Una vez seleccionados los mejores hiperparámetros utilizando alguna de las técnicas utilizadas anteriormente, podemos obtener el modelo a partir del atributo **best_estimator_**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# observamos todos los parámetros del mejor modelo entre las combinaciones aleatorias \n",
"rnd_search.best_estimator_.get_params()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Seleccionamos el mejor modelo\n",
"clf_rnd = rnd_search.best_estimator_\n",
"clf_rnd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predecimos con el conjunto de datos de entrenamiento"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_train_pred = clf_rnd.predict(X_train)\n",
"y_train_pred"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Predicción con el conjunto de datos de entrenamiento\n",
"# Puede estar produciendo overfitting\n",
"print(\"F1 score Train Set:\", f1_score(y_train_pred, y_train, average='weighted'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predecimos con el conjunto de datos de validación"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_val_pred = clf_rnd.predict(X_val)\n",
"y_val_pred"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Predicción con el conjunto de datos de validación\n",
"# Vemos que se produce menos overfitting que usando árboles de decisión\n",
"print(\"F1 score Validation Set:\", f1_score(y_val_pred, y_val, average='weighted'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment