Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save josejesus30/5e956fb290eb43ec169fda7e41dd9897 to your computer and use it in GitHub Desktop.
Save josejesus30/5e956fb290eb43ec169fda7e41dd9897 to your computer and use it in GitHub Desktop.
Uso de el modelo de Support Vector Machine para la detección de URLs Fraudulentas
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modelo: _Support Vector Machine (SVM)_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_<h2 style=\"color:blue\"> Autor: José Alamo Palomino</h2>_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Caso Práctico: Detección de URLs maliciosas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Descripción\n",
"\n",
"La Web se ha convertido en una plataforma importante para actividades delictivas en línea. Las URL se utilizan como el vehículo principal en este dominio. Para contrarrestar estos problemas, la comunidad de seguridad centró sus esfuerzos en desarrollar técnicas para la mayoría de las listas negras de URL maliciosas.\n",
"\n",
"Si bien logra proteger a los usuarios de dominios maliciosos conocidos, este enfoque solo resuelve parte del problema. Las nuevas URL maliciosas que surgieron en toda la web en masa comúnmente obtienen una ventaja en esta carrera. Además de eso, según Alexa, los sitios web confiables pueden transmitir URL fraudulentas comprometidas llamadas URL de desfiguración.\n",
"\n",
"Exploramos un enfoque liviano para la detección y categorización de las URL maliciosas de acuerdo con su tipo de ataque y mostramos que el análisis léxico es efectivo y eficiente para la detección proactiva de estas URL. También estudiamos el efecto de las técnicas de ofuscación en URL maliciosas para descubrir el tipo de técnica de ofuscación dirigida a un tipo específico de URL maliciosa. Estudiamos principalmente cinco tipos diferentes de URL:\n",
"\n",
"* **URL benignas**: se recopilaron más de 35,300 URL benignas de los principales sitios web de Alexa. Los dominios se han pasado a través de un rastreador web Heritrix para extraer las URL. Inicialmente, se rastrean alrededor de medio millón de URL únicas y luego se pasan para eliminar las URL duplicadas y solo de dominio. Más tarde, las URL extraídas se han verificado a través de Virustotal para filtrar las URL benignas.\n",
"\n",
"* **URL de spam**: se recopilaron alrededor de 12,000 URL de spam del conjunto de datos de WEBSPAM-UK2007 disponible públicamente.\n",
"\n",
"* **URL de phishing**: se tomaron alrededor de 10,000 URL de phishing de OpenPhish, que es un repositorio de sitios de phishing activos.\n",
"\n",
"* **URL de malware**: Se obtuvieron más de 11,500 URL relacionadas con sitios web de malware de DNS-BH, que es un proyecto que mantiene una lista de sitios de malware.\n",
"\n",
"* **URL de desfiguración**: más de 45,450 URL pertenecen a la categoría de URL de desfiguración. Son sitios web de confianza clasificados por Alexa que alojan URL fraudulentas u ocultas que contienen páginas web maliciosas.\n",
"\n",
"La ofuscación se utiliza como un método común para enmascarar URL maliciosas. Un atacante que intenta evadir el análisis estático de las características de URL léxicas utiliza técnicas de ofuscación para que las URL maliciosas se vuelvan estadísticamente similares a las benignas. Las técnicas de ofuscación en las URL se analizan para detectar la intención de actividad maliciosa en esta investigación. Analizamos principalmente las URL de spam, phishing y malware para ver qué tipo de técnicas de ofuscación se aplican en las URL."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## OBSERVACIÓN\n",
"\n",
"<h2 style = color:red>Solo utilizarán URLs benignas y de phishing para el análisis de la máquina de soporte vectorial.</h2>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">1. Importando librerías necesarias</h2>"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"import numpy as np\n",
"from sklearn.metrics import f1_score\n",
"from sklearn.preprocessing import StandardScaler, RobustScaler\n",
"from sklearn.pipeline import Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">2. Funciones auxiliares</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Función para la partición del conjunto de datos"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):\n",
" strat = df[stratify] if stratify else None\n",
" train_set, test_set = train_test_split(\n",
" df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)\n",
" strat = test_set[stratify] if stratify else None\n",
" val_set, test_set = train_test_split(\n",
" test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)\n",
" return (train_set, val_set, test_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Representación gráfica del límite de decisión"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Se utilizará esta función más adelante\n",
"def plot_svc_decision_boundary(svm_clf, xmin, xmax):\n",
" w = svm_clf.coef_[0]\n",
" b = svm_clf.intercept_[0]\n",
"\n",
" # At the decision boundary, w0*x0 + w1*x1 + b = 0\n",
" # => x1 = -w0/w1 * x0 - b/w1\n",
" x0 = np.linspace(xmin, xmax, 200)\n",
" decision_boundary = -w[0]/w[1] * x0 - b/w[1]\n",
"\n",
" margin = 1/w[1]\n",
" gutter_up = decision_boundary + margin\n",
" gutter_down = decision_boundary - margin\n",
"\n",
" svs = svm_clf.support_vectors_\n",
" plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#FFAAAA')\n",
" plt.plot(x0, decision_boundary, \"k-\", linewidth=2)\n",
" plt.plot(x0, gutter_up, \"k--\", linewidth=2)\n",
" plt.plot(x0, gutter_down, \"k--\", linewidth=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">3. Lectura del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Querylength</th>\n",
" <th>domain_token_count</th>\n",
" <th>path_token_count</th>\n",
" <th>avgdomaintokenlen</th>\n",
" <th>longdomaintokenlen</th>\n",
" <th>avgpathtokenlen</th>\n",
" <th>tld</th>\n",
" <th>charcompvowels</th>\n",
" <th>charcompace</th>\n",
" <th>ldl_url</th>\n",
" <th>...</th>\n",
" <th>SymbolCount_FileName</th>\n",
" <th>SymbolCount_Extension</th>\n",
" <th>SymbolCount_Afterpath</th>\n",
" <th>Entropy_URL</th>\n",
" <th>Entropy_Domain</th>\n",
" <th>Entropy_DirectoryName</th>\n",
" <th>Entropy_Filename</th>\n",
" <th>Entropy_Extension</th>\n",
" <th>Entropy_Afterpath</th>\n",
" <th>URL_Type_obf_Type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>12</td>\n",
" <td>5.5</td>\n",
" <td>8</td>\n",
" <td>4.083334</td>\n",
" <td>2</td>\n",
" <td>15</td>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0.676804</td>\n",
" <td>0.860529</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.00000</td>\n",
" <td>-1.000000</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>12</td>\n",
" <td>5.0</td>\n",
" <td>10</td>\n",
" <td>3.583333</td>\n",
" <td>3</td>\n",
" <td>12</td>\n",
" <td>8</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0.715629</td>\n",
" <td>0.776796</td>\n",
" <td>0.693127</td>\n",
" <td>0.738315</td>\n",
" <td>1.00000</td>\n",
" <td>-1.000000</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>11</td>\n",
" <td>4.0</td>\n",
" <td>5</td>\n",
" <td>4.750000</td>\n",
" <td>2</td>\n",
" <td>16</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0.677701</td>\n",
" <td>1.000000</td>\n",
" <td>0.677704</td>\n",
" <td>0.916667</td>\n",
" <td>0.00000</td>\n",
" <td>0.898227</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>4.5</td>\n",
" <td>7</td>\n",
" <td>5.714286</td>\n",
" <td>2</td>\n",
" <td>15</td>\n",
" <td>10</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0.696067</td>\n",
" <td>0.879588</td>\n",
" <td>0.818007</td>\n",
" <td>0.753585</td>\n",
" <td>0.00000</td>\n",
" <td>-1.000000</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>19</td>\n",
" <td>2</td>\n",
" <td>10</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" <td>2.250000</td>\n",
" <td>2</td>\n",
" <td>9</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>0.747202</td>\n",
" <td>0.833700</td>\n",
" <td>0.655459</td>\n",
" <td>0.829535</td>\n",
" <td>0.83615</td>\n",
" <td>0.823008</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>10</td>\n",
" <td>5.5</td>\n",
" <td>9</td>\n",
" <td>4.100000</td>\n",
" <td>2</td>\n",
" <td>15</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0.732981</td>\n",
" <td>0.860529</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.00000</td>\n",
" <td>-1.000000</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>12</td>\n",
" <td>4.5</td>\n",
" <td>6</td>\n",
" <td>5.333334</td>\n",
" <td>2</td>\n",
" <td>24</td>\n",
" <td>9</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0.692383</td>\n",
" <td>0.939794</td>\n",
" <td>0.910795</td>\n",
" <td>0.673973</td>\n",
" <td>0.00000</td>\n",
" <td>-1.000000</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>11</td>\n",
" <td>3.5</td>\n",
" <td>4</td>\n",
" <td>3.909091</td>\n",
" <td>2</td>\n",
" <td>15</td>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0.707365</td>\n",
" <td>0.916667</td>\n",
" <td>0.916667</td>\n",
" <td>0.690332</td>\n",
" <td>0.00000</td>\n",
" <td>-1.000000</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>9</td>\n",
" <td>2.5</td>\n",
" <td>3</td>\n",
" <td>4.555555</td>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0.742606</td>\n",
" <td>1.000000</td>\n",
" <td>0.785719</td>\n",
" <td>0.808833</td>\n",
" <td>1.00000</td>\n",
" <td>-1.000000</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>13</td>\n",
" <td>4.5</td>\n",
" <td>6</td>\n",
" <td>5.307692</td>\n",
" <td>2</td>\n",
" <td>16</td>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0.734633</td>\n",
" <td>0.939794</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.00000</td>\n",
" <td>-1.000000</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" Querylength domain_token_count path_token_count avgdomaintokenlen \\\n",
"0 0 2 12 5.5 \n",
"1 0 3 12 5.0 \n",
"2 2 2 11 4.0 \n",
"3 0 2 7 4.5 \n",
"4 19 2 10 6.0 \n",
"5 0 2 10 5.5 \n",
"6 0 2 12 4.5 \n",
"7 0 2 11 3.5 \n",
"8 0 2 9 2.5 \n",
"9 0 2 13 4.5 \n",
"\n",
" longdomaintokenlen avgpathtokenlen tld charcompvowels charcompace \\\n",
"0 8 4.083334 2 15 7 \n",
"1 10 3.583333 3 12 8 \n",
"2 5 4.750000 2 16 11 \n",
"3 7 5.714286 2 15 10 \n",
"4 9 2.250000 2 9 5 \n",
"5 9 4.100000 2 15 11 \n",
"6 6 5.333334 2 24 9 \n",
"7 4 3.909091 2 15 6 \n",
"8 3 4.555555 2 6 3 \n",
"9 6 5.307692 2 16 9 \n",
"\n",
" ldl_url ... SymbolCount_FileName SymbolCount_Extension \\\n",
"0 0 ... -1 -1 \n",
"1 2 ... 1 0 \n",
"2 0 ... 2 0 \n",
"3 0 ... 0 0 \n",
"4 0 ... 5 4 \n",
"5 0 ... -1 -1 \n",
"6 0 ... 0 0 \n",
"7 0 ... 0 0 \n",
"8 0 ... 1 0 \n",
"9 1 ... -1 -1 \n",
"\n",
" SymbolCount_Afterpath Entropy_URL Entropy_Domain Entropy_DirectoryName \\\n",
"0 -1 0.676804 0.860529 -1.000000 \n",
"1 -1 0.715629 0.776796 0.693127 \n",
"2 1 0.677701 1.000000 0.677704 \n",
"3 -1 0.696067 0.879588 0.818007 \n",
"4 3 0.747202 0.833700 0.655459 \n",
"5 -1 0.732981 0.860529 -1.000000 \n",
"6 -1 0.692383 0.939794 0.910795 \n",
"7 -1 0.707365 0.916667 0.916667 \n",
"8 -1 0.742606 1.000000 0.785719 \n",
"9 -1 0.734633 0.939794 -1.000000 \n",
"\n",
" Entropy_Filename Entropy_Extension Entropy_Afterpath URL_Type_obf_Type \n",
"0 -1.000000 -1.00000 -1.000000 benign \n",
"1 0.738315 1.00000 -1.000000 benign \n",
"2 0.916667 0.00000 0.898227 benign \n",
"3 0.753585 0.00000 -1.000000 benign \n",
"4 0.829535 0.83615 0.823008 benign \n",
"5 -1.000000 -1.00000 -1.000000 benign \n",
"6 0.673973 0.00000 -1.000000 benign \n",
"7 0.690332 0.00000 -1.000000 benign \n",
"8 0.808833 1.00000 -1.000000 benign \n",
"9 -1.000000 -1.00000 -1.000000 benign \n",
"\n",
"[10 rows x 80 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"FinalDataset/Phishing.csv\")\n",
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Querylength</th>\n",
" <th>domain_token_count</th>\n",
" <th>path_token_count</th>\n",
" <th>avgdomaintokenlen</th>\n",
" <th>longdomaintokenlen</th>\n",
" <th>avgpathtokenlen</th>\n",
" <th>tld</th>\n",
" <th>charcompvowels</th>\n",
" <th>charcompace</th>\n",
" <th>ldl_url</th>\n",
" <th>...</th>\n",
" <th>SymbolCount_Directoryname</th>\n",
" <th>SymbolCount_FileName</th>\n",
" <th>SymbolCount_Extension</th>\n",
" <th>SymbolCount_Afterpath</th>\n",
" <th>Entropy_URL</th>\n",
" <th>Entropy_Domain</th>\n",
" <th>Entropy_DirectoryName</th>\n",
" <th>Entropy_Filename</th>\n",
" <th>Entropy_Extension</th>\n",
" <th>Entropy_Afterpath</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15096.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>...</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>15367.000000</td>\n",
" <td>13541.000000</td>\n",
" <td>15177.000000</td>\n",
" <td>15364.000000</td>\n",
" <td>15364.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>3.446021</td>\n",
" <td>2.543698</td>\n",
" <td>8.477061</td>\n",
" <td>5.851956</td>\n",
" <td>10.027461</td>\n",
" <td>5.289936</td>\n",
" <td>2.543698</td>\n",
" <td>12.659986</td>\n",
" <td>8.398516</td>\n",
" <td>1.910913</td>\n",
" <td>...</td>\n",
" <td>2.120843</td>\n",
" <td>1.124618</td>\n",
" <td>0.500813</td>\n",
" <td>-0.158782</td>\n",
" <td>0.721684</td>\n",
" <td>0.854232</td>\n",
" <td>0.634859</td>\n",
" <td>0.682896</td>\n",
" <td>0.313617</td>\n",
" <td>-0.723793</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>14.151453</td>\n",
" <td>0.944938</td>\n",
" <td>4.660250</td>\n",
" <td>2.064581</td>\n",
" <td>5.281090</td>\n",
" <td>3.535097</td>\n",
" <td>0.944938</td>\n",
" <td>8.562206</td>\n",
" <td>6.329007</td>\n",
" <td>4.657731</td>\n",
" <td>...</td>\n",
" <td>2.777307</td>\n",
" <td>2.570246</td>\n",
" <td>2.261013</td>\n",
" <td>2.535939</td>\n",
" <td>0.049246</td>\n",
" <td>0.072641</td>\n",
" <td>0.510992</td>\n",
" <td>0.502288</td>\n",
" <td>0.576910</td>\n",
" <td>0.649785</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.500000</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.419560</td>\n",
" <td>0.561913</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>5.000000</td>\n",
" <td>4.500000</td>\n",
" <td>7.000000</td>\n",
" <td>3.800000</td>\n",
" <td>2.000000</td>\n",
" <td>6.000000</td>\n",
" <td>4.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.687215</td>\n",
" <td>0.798231</td>\n",
" <td>0.709532</td>\n",
" <td>0.707165</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>8.000000</td>\n",
" <td>5.500000</td>\n",
" <td>9.000000</td>\n",
" <td>4.500000</td>\n",
" <td>2.000000</td>\n",
" <td>11.000000</td>\n",
" <td>7.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.723217</td>\n",
" <td>0.859793</td>\n",
" <td>0.785949</td>\n",
" <td>0.814038</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>11.000000</td>\n",
" <td>6.666666</td>\n",
" <td>12.000000</td>\n",
" <td>5.571429</td>\n",
" <td>3.000000</td>\n",
" <td>17.000000</td>\n",
" <td>11.000000</td>\n",
" <td>1.000000</td>\n",
" <td>...</td>\n",
" <td>3.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.757949</td>\n",
" <td>0.916667</td>\n",
" <td>0.859582</td>\n",
" <td>0.916667</td>\n",
" <td>1.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>173.000000</td>\n",
" <td>19.000000</td>\n",
" <td>68.000000</td>\n",
" <td>29.500000</td>\n",
" <td>63.000000</td>\n",
" <td>105.000000</td>\n",
" <td>19.000000</td>\n",
" <td>94.000000</td>\n",
" <td>62.000000</td>\n",
" <td>58.000000</td>\n",
" <td>...</td>\n",
" <td>24.000000</td>\n",
" <td>31.000000</td>\n",
" <td>30.000000</td>\n",
" <td>29.000000</td>\n",
" <td>0.869701</td>\n",
" <td>1.000000</td>\n",
" <td>0.962479</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 79 columns</p>\n",
"</div>"
],
"text/plain": [
" Querylength domain_token_count path_token_count avgdomaintokenlen \\\n",
"count 15367.000000 15367.000000 15367.000000 15367.000000 \n",
"mean 3.446021 2.543698 8.477061 5.851956 \n",
"std 14.151453 0.944938 4.660250 2.064581 \n",
"min 0.000000 2.000000 0.000000 1.500000 \n",
"25% 0.000000 2.000000 5.000000 4.500000 \n",
"50% 0.000000 2.000000 8.000000 5.500000 \n",
"75% 0.000000 3.000000 11.000000 6.666666 \n",
"max 173.000000 19.000000 68.000000 29.500000 \n",
"\n",
" longdomaintokenlen avgpathtokenlen tld charcompvowels \\\n",
"count 15367.000000 15096.000000 15367.000000 15367.000000 \n",
"mean 10.027461 5.289936 2.543698 12.659986 \n",
"std 5.281090 3.535097 0.944938 8.562206 \n",
"min 2.000000 0.000000 2.000000 0.000000 \n",
"25% 7.000000 3.800000 2.000000 6.000000 \n",
"50% 9.000000 4.500000 2.000000 11.000000 \n",
"75% 12.000000 5.571429 3.000000 17.000000 \n",
"max 63.000000 105.000000 19.000000 94.000000 \n",
"\n",
" charcompace ldl_url ... SymbolCount_Directoryname \\\n",
"count 15367.000000 15367.000000 ... 15367.000000 \n",
"mean 8.398516 1.910913 ... 2.120843 \n",
"std 6.329007 4.657731 ... 2.777307 \n",
"min 0.000000 0.000000 ... -1.000000 \n",
"25% 4.000000 0.000000 ... 1.000000 \n",
"50% 7.000000 0.000000 ... 2.000000 \n",
"75% 11.000000 1.000000 ... 3.000000 \n",
"max 62.000000 58.000000 ... 24.000000 \n",
"\n",
" SymbolCount_FileName SymbolCount_Extension SymbolCount_Afterpath \\\n",
"count 15367.000000 15367.000000 15367.000000 \n",
"mean 1.124618 0.500813 -0.158782 \n",
"std 2.570246 2.261013 2.535939 \n",
"min -1.000000 -1.000000 -1.000000 \n",
"25% 0.000000 0.000000 -1.000000 \n",
"50% 0.000000 0.000000 -1.000000 \n",
"75% 1.000000 0.000000 -1.000000 \n",
"max 31.000000 30.000000 29.000000 \n",
"\n",
" Entropy_URL Entropy_Domain Entropy_DirectoryName Entropy_Filename \\\n",
"count 15367.000000 15367.000000 13541.000000 15177.000000 \n",
"mean 0.721684 0.854232 0.634859 0.682896 \n",
"std 0.049246 0.072641 0.510992 0.502288 \n",
"min 0.419560 0.561913 -1.000000 -1.000000 \n",
"25% 0.687215 0.798231 0.709532 0.707165 \n",
"50% 0.723217 0.859793 0.785949 0.814038 \n",
"75% 0.757949 0.916667 0.859582 0.916667 \n",
"max 0.869701 1.000000 0.962479 1.000000 \n",
"\n",
" Entropy_Extension Entropy_Afterpath \n",
"count 15364.000000 15364.000000 \n",
"mean 0.313617 -0.723793 \n",
"std 0.576910 0.649785 \n",
"min -1.000000 -1.000000 \n",
"25% 0.000000 -1.000000 \n",
"50% 0.000000 -1.000000 \n",
"75% 1.000000 -1.000000 \n",
"max 1.000000 1.000000 \n",
"\n",
"[8 rows x 79 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 15367 entries, 0 to 15366\n",
"Data columns (total 80 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Querylength 15367 non-null int64 \n",
" 1 domain_token_count 15367 non-null int64 \n",
" 2 path_token_count 15367 non-null int64 \n",
" 3 avgdomaintokenlen 15367 non-null float64\n",
" 4 longdomaintokenlen 15367 non-null int64 \n",
" 5 avgpathtokenlen 15096 non-null float64\n",
" 6 tld 15367 non-null int64 \n",
" 7 charcompvowels 15367 non-null int64 \n",
" 8 charcompace 15367 non-null int64 \n",
" 9 ldl_url 15367 non-null int64 \n",
" 10 ldl_domain 15367 non-null int64 \n",
" 11 ldl_path 15367 non-null int64 \n",
" 12 ldl_filename 15367 non-null int64 \n",
" 13 ldl_getArg 15367 non-null int64 \n",
" 14 dld_url 15367 non-null int64 \n",
" 15 dld_domain 15367 non-null int64 \n",
" 16 dld_path 15367 non-null int64 \n",
" 17 dld_filename 15367 non-null int64 \n",
" 18 dld_getArg 15367 non-null int64 \n",
" 19 urlLen 15367 non-null int64 \n",
" 20 domainlength 15367 non-null int64 \n",
" 21 pathLength 15367 non-null int64 \n",
" 22 subDirLen 15367 non-null int64 \n",
" 23 fileNameLen 15367 non-null int64 \n",
" 24 this.fileExtLen 15367 non-null int64 \n",
" 25 ArgLen 15367 non-null int64 \n",
" 26 pathurlRatio 15367 non-null float64\n",
" 27 ArgUrlRatio 15367 non-null float64\n",
" 28 argDomanRatio 15367 non-null float64\n",
" 29 domainUrlRatio 15367 non-null float64\n",
" 30 pathDomainRatio 15367 non-null float64\n",
" 31 argPathRatio 15367 non-null float64\n",
" 32 executable 15367 non-null int64 \n",
" 33 isPortEighty 15367 non-null int64 \n",
" 34 NumberofDotsinURL 15367 non-null int64 \n",
" 35 ISIpAddressInDomainName 15367 non-null int64 \n",
" 36 CharacterContinuityRate 15367 non-null float64\n",
" 37 LongestVariableValue 15367 non-null int64 \n",
" 38 URL_DigitCount 15367 non-null int64 \n",
" 39 host_DigitCount 15367 non-null int64 \n",
" 40 Directory_DigitCount 15367 non-null int64 \n",
" 41 File_name_DigitCount 15367 non-null int64 \n",
" 42 Extension_DigitCount 15367 non-null int64 \n",
" 43 Query_DigitCount 15367 non-null int64 \n",
" 44 URL_Letter_Count 15367 non-null int64 \n",
" 45 host_letter_count 15367 non-null int64 \n",
" 46 Directory_LetterCount 15367 non-null int64 \n",
" 47 Filename_LetterCount 15367 non-null int64 \n",
" 48 Extension_LetterCount 15367 non-null int64 \n",
" 49 Query_LetterCount 15367 non-null int64 \n",
" 50 LongestPathTokenLength 15367 non-null int64 \n",
" 51 Domain_LongestWordLength 15367 non-null int64 \n",
" 52 Path_LongestWordLength 15367 non-null int64 \n",
" 53 sub-Directory_LongestWordLength 15367 non-null int64 \n",
" 54 Arguments_LongestWordLength 15367 non-null int64 \n",
" 55 URL_sensitiveWord 15367 non-null int64 \n",
" 56 URLQueries_variable 15367 non-null int64 \n",
" 57 spcharUrl 15367 non-null int64 \n",
" 58 delimeter_Domain 15367 non-null int64 \n",
" 59 delimeter_path 15367 non-null int64 \n",
" 60 delimeter_Count 15367 non-null int64 \n",
" 61 NumberRate_URL 15367 non-null float64\n",
" 62 NumberRate_Domain 15367 non-null float64\n",
" 63 NumberRate_DirectoryName 15358 non-null float64\n",
" 64 NumberRate_FileName 15358 non-null float64\n",
" 65 NumberRate_Extension 8012 non-null float64\n",
" 66 NumberRate_AfterPath 15364 non-null float64\n",
" 67 SymbolCount_URL 15367 non-null int64 \n",
" 68 SymbolCount_Domain 15367 non-null int64 \n",
" 69 SymbolCount_Directoryname 15367 non-null int64 \n",
" 70 SymbolCount_FileName 15367 non-null int64 \n",
" 71 SymbolCount_Extension 15367 non-null int64 \n",
" 72 SymbolCount_Afterpath 15367 non-null int64 \n",
" 73 Entropy_URL 15367 non-null float64\n",
" 74 Entropy_Domain 15367 non-null float64\n",
" 75 Entropy_DirectoryName 13541 non-null float64\n",
" 76 Entropy_Filename 15177 non-null float64\n",
" 77 Entropy_Extension 15364 non-null float64\n",
" 78 Entropy_Afterpath 15364 non-null float64\n",
" 79 URL_Type_obf_Type 15367 non-null object \n",
"dtypes: float64(21), int64(58), object(1)\n",
"memory usage: 9.4+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"benign 7781\n",
"phishing 7586\n",
"Name: URL_Type_obf_Type, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"URL_Type_obf_Type\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"avgpathtokenlen True\n",
"NumberRate_DirectoryName True\n",
"NumberRate_FileName True\n",
"NumberRate_Extension True\n",
"NumberRate_AfterPath True\n",
"Entropy_DirectoryName True\n",
"Entropy_Filename True\n",
"Entropy_Extension True\n",
"Entropy_Afterpath True\n",
"dtype: bool"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Comprobación de si existen valores nulos\n",
"is_null = df.isna().any()\n",
"is_null[is_null]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"argPathRatio True\n",
"dtype: bool"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Comprobación de la existencia de valores infinitos\n",
"# Pueden haber valores que son las combinación de otros valores como por ejemplo una división\n",
"is_inf = df.isin([np.inf, -np.inf]).any()\n",
"is_inf[is_inf]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Representación gráfica de dos características de entrada"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Tenemos los valores de las 15367 URLs representadas gráfiamente para las siguientes variables:\n",
"# * domainUrlRatio = Ratio del dominio de la URL (Eje X)\n",
"# * domainlength = Longitud del dominio (Eje Y)\n",
"\n",
"# Luego representamos en diferentes colores a las categorías de la variable \"URL_Type_obf_Type\":\n",
"# benign (verde) 7781 URLs\n",
"# phishing (rojo) 7586 URLs\n",
"\n",
"plt.figure(figsize=(12, 6))\n",
"plt.scatter(df[\"domainUrlRatio\"][df['URL_Type_obf_Type'] == \"phishing\"], df[\"domainlength\"][df['URL_Type_obf_Type'] == \"phishing\"], c=\"r\", marker=\".\")\n",
"plt.scatter(df[\"domainUrlRatio\"][df['URL_Type_obf_Type'] == \"benign\"], df[\"domainlength\"][df['URL_Type_obf_Type'] == \"benign\"], c=\"g\", marker=\"x\")\n",
"plt.xlabel(\"domainUrlRatio\", fontsize=13)\n",
"plt.ylabel(\"domainlength\", fontsize=13)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">4. División del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# División del conjunto de datos\n",
"train_set, val_set, test_set = train_val_test_split(df)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Longitud del Training Set: 9220\n",
"Longitud del Validation Set: 3073\n",
"Longitud del Test Set: 3074\n"
]
}
],
"source": [
"print(\"Longitud del Training Set:\", len(train_set))\n",
"print(\"Longitud del Validation Set:\", len(val_set))\n",
"print(\"Longitud del Test Set:\", len(test_set))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Para cada uno de los subconjuntos, separamos las etiquetas de las características de entrada"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.1 Conjunto de datos de entrenamiento"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"X_train = train_set.drop(\"URL_Type_obf_Type\", axis=1)\n",
"y_train = train_set[\"URL_Type_obf_Type\"].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.2 Conjunto de datos de validación"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"X_val = val_set.drop(\"URL_Type_obf_Type\", axis=1)\n",
"y_val = val_set[\"URL_Type_obf_Type\"].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.3 Conjunto de datos de pruebas"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"X_test = test_set.drop(\"URL_Type_obf_Type\", axis=1)\n",
"y_test = test_set[\"URL_Type_obf_Type\"].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">5. Preparación del conjunto de datos</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.1 Eliminamos el atributo que tiene valores infinitos"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"argPathRatio True\n",
"dtype: bool"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Verificamos que la variable \"argPathRatio\" tiene valores infinitos\n",
"is_inf = df.isin([np.inf, -np.inf]).any()\n",
"is_inf[is_inf]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# La acción de eliminar la variable \"argPathRatio\" se tomó despues de evaluar opciones como eliminar las filas o\n",
"# reeemplazar esos valores por un valor, se concluyó que eliminar el atributo es la mejor decisión para este tipo\n",
"# de algoritmos.\n",
"X_train = X_train.drop(\"argPathRatio\", axis=1)\n",
"X_val = X_val.drop(\"argPathRatio\", axis=1)\n",
"X_test = X_test.drop(\"argPathRatio\", axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.2 Rellenamos los valores nulos con la mediana"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# Importamos de sklearn la clase SimpleImputer \n",
"from sklearn.impute import SimpleImputer\n",
"\n",
"# Instanciamos la clase \"SimpleImputer\" pasándole la estrategia \"median\" al objeto imputer \n",
"imputer = SimpleImputer(strategy=\"median\")\n",
"\n",
"# Luego pasaremos a nuestro conjunto de datos el objeto \"imputer\" para que de manera automática busque los valores\n",
"# nulos y los sutituya por la mediana sin necesidad de tenerle q proporcionar el nombre de las características o\n",
"# atributos de entrada.\n",
"\n",
"# La clase imputer no admite valores categoricos, en este caso no tenemos que eliminar los atributos categóricos ya que\n",
"# todas las características de entrada son númericos (enteros o floats)."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# Reemplazamos los valores nulos con la MEDIANA y obtenemos los subconjuntos preprocesados\n",
"# El inconveniente al usar \"imputer.fit_transform\" es que los subconjuntos se convierten en arrays de numpy\n",
"X_train_prep = imputer.fit_transform(X_train)\n",
"X_val_prep = imputer.fit_transform(X_val)\n",
"X_test_prep = imputer.fit_transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"numpy.ndarray"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(X_train_prep)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"# Para evitar este inconveniente, transformamos el resultado a un DataFrame de Pandas\n",
"X_train_prep = pd.DataFrame(X_train_prep, columns=X_train.columns, index=y_train.index)\n",
"X_val_prep = pd.DataFrame(X_val_prep, columns=X_val.columns, index=y_val.index)\n",
"X_test_prep = pd.DataFrame(X_test_prep, columns=X_test.columns, index=y_test.index)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Querylength</th>\n",
" <th>domain_token_count</th>\n",
" <th>path_token_count</th>\n",
" <th>avgdomaintokenlen</th>\n",
" <th>longdomaintokenlen</th>\n",
" <th>avgpathtokenlen</th>\n",
" <th>tld</th>\n",
" <th>charcompvowels</th>\n",
" <th>charcompace</th>\n",
" <th>ldl_url</th>\n",
" <th>...</th>\n",
" <th>SymbolCount_Directoryname</th>\n",
" <th>SymbolCount_FileName</th>\n",
" <th>SymbolCount_Extension</th>\n",
" <th>SymbolCount_Afterpath</th>\n",
" <th>Entropy_URL</th>\n",
" <th>Entropy_Domain</th>\n",
" <th>Entropy_DirectoryName</th>\n",
" <th>Entropy_Filename</th>\n",
" <th>Entropy_Extension</th>\n",
" <th>Entropy_Afterpath</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2134</th>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>6.0</td>\n",
" <td>2.000000</td>\n",
" <td>2.0</td>\n",
" <td>8.666667</td>\n",
" <td>2.0</td>\n",
" <td>17.0</td>\n",
" <td>10.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-1.0</td>\n",
" <td>0.681183</td>\n",
" <td>0.827729</td>\n",
" <td>0.702637</td>\n",
" <td>0.849605</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9178</th>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>18.0</td>\n",
" <td>3.250000</td>\n",
" <td>5.0</td>\n",
" <td>1.000000</td>\n",
" <td>4.0</td>\n",
" <td>18.0</td>\n",
" <td>13.0</td>\n",
" <td>2.0</td>\n",
" <td>...</td>\n",
" <td>12.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>0.695232</td>\n",
" <td>0.820160</td>\n",
" <td>0.682849</td>\n",
" <td>0.875578</td>\n",
" <td>0.000000</td>\n",
" <td>0.778747</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13622</th>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>6.666666</td>\n",
" <td>14.0</td>\n",
" <td>4.000000</td>\n",
" <td>3.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-1.0</td>\n",
" <td>0.836006</td>\n",
" <td>0.869991</td>\n",
" <td>0.879588</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15182</th>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>3.333333</td>\n",
" <td>4.0</td>\n",
" <td>3.000000</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>-1.0</td>\n",
" <td>0.731804</td>\n",
" <td>0.796490</td>\n",
" <td>0.796658</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8013</th>\n",
" <td>74.0</td>\n",
" <td>2.0</td>\n",
" <td>13.0</td>\n",
" <td>9.500000</td>\n",
" <td>17.0</td>\n",
" <td>7.875000</td>\n",
" <td>2.0</td>\n",
" <td>21.0</td>\n",
" <td>29.0</td>\n",
" <td>26.0</td>\n",
" <td>...</td>\n",
" <td>4.0</td>\n",
" <td>5.0</td>\n",
" <td>4.0</td>\n",
" <td>3.0</td>\n",
" <td>0.653371</td>\n",
" <td>0.820569</td>\n",
" <td>0.758055</td>\n",
" <td>0.714969</td>\n",
" <td>0.712215</td>\n",
" <td>0.708031</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12408</th>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>8.333333</td>\n",
" <td>19.0</td>\n",
" <td>3.750000</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-1.0</td>\n",
" <td>0.726479</td>\n",
" <td>0.789538</td>\n",
" <td>0.800705</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>509</th>\n",
" <td>20.0</td>\n",
" <td>2.0</td>\n",
" <td>13.0</td>\n",
" <td>4.500000</td>\n",
" <td>6.0</td>\n",
" <td>3.000000</td>\n",
" <td>2.0</td>\n",
" <td>24.0</td>\n",
" <td>17.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>14.0</td>\n",
" <td>13.0</td>\n",
" <td>12.0</td>\n",
" <td>0.678515</td>\n",
" <td>0.796658</td>\n",
" <td>0.871049</td>\n",
" <td>0.695112</td>\n",
" <td>0.701662</td>\n",
" <td>0.698106</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10714</th>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>8.0</td>\n",
" <td>6.666666</td>\n",
" <td>14.0</td>\n",
" <td>4.250000</td>\n",
" <td>3.0</td>\n",
" <td>11.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>4.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-1.0</td>\n",
" <td>0.745348</td>\n",
" <td>0.869991</td>\n",
" <td>0.788921</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3986</th>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>6.0</td>\n",
" <td>6.500000</td>\n",
" <td>10.0</td>\n",
" <td>4.500000</td>\n",
" <td>2.0</td>\n",
" <td>7.0</td>\n",
" <td>7.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-1.0</td>\n",
" <td>0.760843</td>\n",
" <td>0.798231</td>\n",
" <td>0.822491</td>\n",
" <td>0.796670</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>748</th>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>8.0</td>\n",
" <td>4.000000</td>\n",
" <td>5.0</td>\n",
" <td>5.750000</td>\n",
" <td>2.0</td>\n",
" <td>14.0</td>\n",
" <td>14.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-1.0</td>\n",
" <td>0.709062</td>\n",
" <td>0.929897</td>\n",
" <td>0.884735</td>\n",
" <td>0.674994</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 78 columns</p>\n",
"</div>"
],
"text/plain": [
" Querylength domain_token_count path_token_count avgdomaintokenlen \\\n",
"2134 0.0 2.0 6.0 2.000000 \n",
"9178 0.0 4.0 18.0 3.250000 \n",
"13622 0.0 3.0 3.0 6.666666 \n",
"15182 0.0 3.0 5.0 3.333333 \n",
"8013 74.0 2.0 13.0 9.500000 \n",
"12408 0.0 3.0 4.0 8.333333 \n",
"509 20.0 2.0 13.0 4.500000 \n",
"10714 0.0 3.0 8.0 6.666666 \n",
"3986 0.0 2.0 6.0 6.500000 \n",
"748 0.0 2.0 8.0 4.000000 \n",
"\n",
" longdomaintokenlen avgpathtokenlen tld charcompvowels charcompace \\\n",
"2134 2.0 8.666667 2.0 17.0 10.0 \n",
"9178 5.0 1.000000 4.0 18.0 13.0 \n",
"13622 14.0 4.000000 3.0 1.0 1.0 \n",
"15182 4.0 3.000000 3.0 5.0 2.0 \n",
"8013 17.0 7.875000 2.0 21.0 29.0 \n",
"12408 19.0 3.750000 3.0 5.0 1.0 \n",
"509 6.0 3.000000 2.0 24.0 17.0 \n",
"10714 14.0 4.250000 3.0 11.0 5.0 \n",
"3986 10.0 4.500000 2.0 7.0 7.0 \n",
"748 5.0 5.750000 2.0 14.0 14.0 \n",
"\n",
" ldl_url ... SymbolCount_Directoryname SymbolCount_FileName \\\n",
"2134 0.0 ... 2.0 0.0 \n",
"9178 2.0 ... 12.0 3.0 \n",
"13622 1.0 ... 1.0 0.0 \n",
"15182 0.0 ... 2.0 1.0 \n",
"8013 26.0 ... 4.0 5.0 \n",
"12408 0.0 ... 2.0 0.0 \n",
"509 0.0 ... 1.0 14.0 \n",
"10714 0.0 ... 4.0 0.0 \n",
"3986 0.0 ... 2.0 0.0 \n",
"748 1.0 ... 2.0 0.0 \n",
"\n",
" SymbolCount_Extension SymbolCount_Afterpath Entropy_URL \\\n",
"2134 0.0 -1.0 0.681183 \n",
"9178 0.0 4.0 0.695232 \n",
"13622 0.0 -1.0 0.836006 \n",
"15182 0.0 -1.0 0.731804 \n",
"8013 4.0 3.0 0.653371 \n",
"12408 0.0 -1.0 0.726479 \n",
"509 13.0 12.0 0.678515 \n",
"10714 0.0 -1.0 0.745348 \n",
"3986 0.0 -1.0 0.760843 \n",
"748 0.0 -1.0 0.709062 \n",
"\n",
" Entropy_Domain Entropy_DirectoryName Entropy_Filename \\\n",
"2134 0.827729 0.702637 0.849605 \n",
"9178 0.820160 0.682849 0.875578 \n",
"13622 0.869991 0.879588 1.000000 \n",
"15182 0.796490 0.796658 1.000000 \n",
"8013 0.820569 0.758055 0.714969 \n",
"12408 0.789538 0.800705 1.000000 \n",
"509 0.796658 0.871049 0.695112 \n",
"10714 0.869991 0.788921 1.000000 \n",
"3986 0.798231 0.822491 0.796670 \n",
"748 0.929897 0.884735 0.674994 \n",
"\n",
" Entropy_Extension Entropy_Afterpath \n",
"2134 0.000000 -1.000000 \n",
"9178 0.000000 0.778747 \n",
"13622 0.000000 -1.000000 \n",
"15182 1.000000 -1.000000 \n",
"8013 0.712215 0.708031 \n",
"12408 0.000000 -1.000000 \n",
"509 0.701662 0.698106 \n",
"10714 0.000000 -1.000000 \n",
"3986 0.000000 -1.000000 \n",
"748 0.000000 -1.000000 \n",
"\n",
"[10 rows x 78 columns]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train_prep.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Series([], dtype: bool)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Comprobamos si hay valores nulos en el conjunto de datos de entrenamiento\n",
"is_null = X_train_prep.isna().any()\n",
"is_null[is_null]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Series([], dtype: bool)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Comprobación de la existencia de valores infinitos en el conjunto de datos de entrenamiento\n",
"is_inf = X_train_prep.isin([np.inf, -np.inf]).any()\n",
"is_inf[is_inf]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">6. SMV: Kernel lineal</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Entrenaremos nuestro algoritmo de **2 formas**:\n",
"\n",
"<h4>1era forma (CONJUNTO DE DATOS REDUCIDO):</h4>\n",
"Voy a entrenarlo extrayendo únicamente 2 características de entrada (domainUrlRatio y domainlength) para poder representar gráficamente el límite de decisión.\n",
"\n",
"<h4>2da forma: (CONJUNTO DE DATOS COMPLETO)</h4> \n",
"Voy a entrenarlo utilizando las 78 carcacterísticas de entrada."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6.1 Conjunto de datos reducido"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Entrenamiento del algoritmo con un conjunto de datos reducido**"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Reducimos el conjunto de datos a 2 atributos ((domainUrlRatio y domainlength) )para representarlo gráficamente:\n",
"X_train_reduced = X_train_prep[[\"domainUrlRatio\", \"domainlength\"]].copy()\n",
"X_val_reduced = X_val_prep[[\"domainUrlRatio\", \"domainlength\"]].copy()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>domainUrlRatio</th>\n",
" <th>domainlength</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2134</th>\n",
" <td>0.072464</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9178</th>\n",
" <td>0.166667</td>\n",
" <td>16.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13622</th>\n",
" <td>0.511628</td>\n",
" <td>22.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15182</th>\n",
" <td>0.315789</td>\n",
" <td>12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8013</th>\n",
" <td>0.107527</td>\n",
" <td>20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5191</th>\n",
" <td>0.116667</td>\n",
" <td>14.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13418</th>\n",
" <td>0.477273</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5390</th>\n",
" <td>0.157895</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>860</th>\n",
" <td>0.072917</td>\n",
" <td>7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7270</th>\n",
" <td>0.207547</td>\n",
" <td>11.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9220 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" domainUrlRatio domainlength\n",
"2134 0.072464 5.0\n",
"9178 0.166667 16.0\n",
"13622 0.511628 22.0\n",
"15182 0.315789 12.0\n",
"8013 0.107527 20.0\n",
"... ... ...\n",
"5191 0.116667 14.0\n",
"13418 0.477273 21.0\n",
"5390 0.157895 9.0\n",
"860 0.072917 7.0\n",
"7270 0.207547 11.0\n",
"\n",
"[9220 rows x 2 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Conjunto de datos de entrenamiento reducido de las características de entrada\n",
"X_train_reduced"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### SVM Large Margin Classification\n",
"Vamos a entrenar nuestro algoritmo SVM con un kernel lineal"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SVC(C=50, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False)"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Importamos la clase SVC de sklearn.svm\n",
"from sklearn.svm import SVC\n",
"\n",
"# Luego instanciamos la clase SVC en un objeto \"svm_clf\" asignándole el parámetro kernel = 'linear ' y otro parámetro\n",
"# c=50 que va a controlar la distania de los márgenes.\n",
"svm_clf = SVC(kernel=\"linear\", C=50)\n",
"\n",
"# Luego entrenamos nuestro algoritmo SVM\n",
"# invocamos el método fit y le proporcionamos el conjunto de datos de entrenamiento reducido y las etiquetas del conjunto\n",
"# de entrenamiento.\n",
"svm_clf.fit(X_train_reduced, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Representación del límite de decisión**"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"def plot_svc_decision_boundary(svm_clf, xmin, xmax):\n",
" w = svm_clf.coef_[0]\n",
" b = svm_clf.intercept_[0]\n",
"\n",
" x0 = np.linspace(xmin, xmax, 200)\n",
" decision_boundary = -w[0]/w[1] * x0 - b/w[1]\n",
"\n",
" margin = 1/w[1]\n",
" gutter_up = decision_boundary + margin\n",
" gutter_down = decision_boundary - margin\n",
"\n",
" svs = svm_clf.support_vectors_\n",
" plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#FFAAAA')\n",
" plt.plot(x0, decision_boundary, \"k-\", linewidth=2)\n",
" plt.plot(x0, gutter_up, \"k--\", linewidth=2)\n",
" plt.plot(x0, gutter_down, \"k--\", linewidth=2)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(12, 6))\n",
"plt.plot(X_train_reduced.values[:, 0][y_train==\"phishing\"], X_train_reduced.values[:, 1][y_train==\"phishing\"], \"g^\")\n",
"plt.plot(X_train_reduced.values[:, 0][y_train==\"benign\"], X_train_reduced.values[:, 1][y_train==\"benign\"], \"bs\")\n",
"plot_svc_decision_boundary(svm_clf, 0, 1)\n",
"plt.title(\"$C = {}$\".format(svm_clf.C), fontsize=16)\n",
"plt.axis([0, 1, -100, 250])\n",
"plt.xlabel(\"domainUrlRatio\", fontsize=13)\n",
"plt.ylabel(\"domainlength\", fontsize=13)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Vemos a los datos pertenecientes a la clase phishing en verde (triángulos) y los datos de la clase benign en azul \n",
"# (cuadrados).\n",
"\n",
"# Luego observamos el límete de decisión (la recta) que ha construido el algoritmo SVM. Este está separando los ejemplos\n",
"# legítimos de los de phishing de una forma correcta. Los ejemplos que están en rojo son los de la clase legítima que ha\n",
"# clasificado correctamente."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Predicción con un conjunto de datos reducido**"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# Validación del modelo\n",
"# Hacemos predicción de la variable \"URL_Type_obf_Type\" utilizando la data de validación \"X_val_reduced\"\n",
"y_pred = svm_clf.predict(X_val_reduced)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.8142614601018675\n"
]
}
],
"source": [
"# Luego comparamos esa predicción con los valores reales\n",
"# Mi algoritmo ha sido capaz de predecir correctamente en un 81.43% para mi conjunto de datos de validación\n",
"# Es decir, para un 81.43 % de URLs que no ha visto nunca, ha sido capaz de clasificar correctamente en URLs legítimas\n",
"# y de phishing.\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val, pos_label='phishing'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Como se verá más adelante, para determinados kernels es muy importante escalar el conjunto de datos. En ese caso, para el kernel lineal, no es tan relevante, aunque es posible que proporcione mejores resultados.**"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('scaler',\n",
" RobustScaler(copy=True, quantile_range=(25.0, 75.0),\n",
" with_centering=True, with_scaling=True)),\n",
" ('linear_svc',\n",
" SVC(C=50, break_ties=False, cache_size=200, class_weight=None,\n",
" coef0=0.0, decision_function_shape='ovr', degree=3,\n",
" gamma='scale', kernel='linear', max_iter=-1,\n",
" probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False))],\n",
" verbose=False)"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Aplicaremos escalado utilizando un pipeline, por un lado el transformador RobustScaler() para escalar datos y por otro\n",
"# lado el algoritmo SVM con el kernel lineal.\n",
"svm_clf_sc = Pipeline([\n",
" (\"scaler\", RobustScaler()),\n",
" (\"linear_svc\", SVC(kernel=\"linear\", C=50)),\n",
" ])\n",
"\n",
"svm_clf_sc.fit(X_train_reduced, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# Hacemos la predicción con el conjunto de datos de validación\n",
"y_pred = svm_clf_sc.predict(X_val_reduced)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.8142614601018675\n"
]
}
],
"source": [
"# Para un 81.43 % de URLs que no ha visto nunca, ha sido capaz de clasificar correctamente en URLs legítimas\n",
"# y de phishing.\n",
"# No hay mucha diferencia en aplicar o no escalado usando un kernel lineal.\n",
"# Cuando aplicamos kernels no lineales, escalar si proporciona mejores resultados.\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val, pos_label='phishing'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6.2 Conjunto de datos completo"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False)"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Entrenamiento con todo el conjunto de datos\n",
"from sklearn.svm import SVC\n",
"\n",
"svm_clf = SVC(kernel=\"linear\", C=1)\n",
"svm_clf.fit(X_train_prep, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['benign', 'benign', 'benign', ..., 'phishing', 'phishing',\n",
" 'phishing'], dtype=object)"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Realizamos las predicciones utilizando la data de validación\n",
"y_pred = svm_clf.predict(X_val_prep)\n",
"y_pred"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.9611330698287219\n"
]
}
],
"source": [
"# Mi algoritmo ha sido capaz de clasificar correctamente en un 96.11% para mi conjunto de datos de validación\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val, pos_label='phishing'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">7. SMV: Kernel no lineal</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7.1. Polynomial Kernel (I)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Entrenamiento del algoritmo con un conjunto de datos reducido"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"# Para representar el límite de decisión tenemos que pasar la variable objetivo a numérica en las bases de datos\n",
"# y_train & y_val\n",
"\n",
"# Luego convertimos las categorias 'benign' y 'phishing' a 0 y 1 respectivamente, esto para representar la longitud\n",
"# del límite de decisión.\n",
"\n",
"y_train_num = y_train.factorize()[0]\n",
"y_val_num = y_val.factorize()[0]"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('poly_features',\n",
" PolynomialFeatures(degree=3, include_bias=True,\n",
" interaction_only=False, order='C')),\n",
" ('scaler',\n",
" StandardScaler(copy=True, with_mean=True, with_std=True)),\n",
" ('svm_clf',\n",
" LinearSVC(C=20, class_weight=None, dual=True,\n",
" fit_intercept=True, intercept_scaling=1,\n",
" loss='hinge', max_iter=100000, multi_class='ovr',\n",
" penalty='l2', random_state=42, tol=0.0001,\n",
" verbose=0))],\n",
" verbose=False)"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Utilizando un pipeline, vamos a realizar algunas transformaciones sobre nuestro kernel lineal.\n",
"\n",
"# \"poly_features\", PolynomialFeatures(degree=3) -> Convierte varias de las características de entrada a características\n",
"# polinómicas. Es decir, coge los atributos de entrada (X1, X2, X3,...X79) y las transforma en características polinómicas\n",
"# donde concretamente llega a un límite de polinomio de un grado igual a 3. \n",
"\n",
"# \"scaler\", StandardScaler() -> Escalo mis características.\n",
"\n",
"# (\"svm_clf\", LinearSVC(C=20, loss=\"hinge\", random_state=42, max_iter=100000) -> entrenamos un algoritmo SVM lineal. Con\n",
"# la diferencia que ahora estoy introduciendo características polinómicas y el resultado será un límite de desición no\n",
"# lineal\n",
"\n",
"from sklearn.datasets import make_moons\n",
"from sklearn.svm import LinearSVC\n",
"from sklearn.preprocessing import PolynomialFeatures\n",
"\n",
"polynomial_svm_clf = Pipeline([\n",
" (\"poly_features\", PolynomialFeatures(degree=3)),\n",
" (\"scaler\", StandardScaler()),\n",
" (\"svm_clf\", LinearSVC(C=20, loss=\"hinge\", random_state=42, max_iter=100000))\n",
" ])\n",
"\n",
"polynomial_svm_clf.fit(X_train_reduced, y_train_num)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Representación del límite de decisión**"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"def plot_dataset(X, y):\n",
" plt.plot(X[:, 0][y==1], X[:, 1][y==1], \"g.\")\n",
" plt.plot(X[:, 0][y==0], X[:, 1][y==0], \"b.\")"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x360 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def plot_predictions(clf, axes):\n",
" x0s = np.linspace(axes[0], axes[1], 100)\n",
" x1s = np.linspace(axes[2], axes[3], 100)\n",
" x0, x1 = np.meshgrid(x0s, x1s)\n",
" X = np.c_[x0.ravel(), x1.ravel()]\n",
" y_pred = clf.predict(X).reshape(x0.shape)\n",
" y_decision = clf.decision_function(X).reshape(x0.shape)\n",
" plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)\n",
" plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)\n",
"\n",
"fig, axes = plt.subplots(ncols=2, figsize=(15,5), sharey=True)\n",
"plt.sca(axes[0])\n",
"plot_dataset(X_train_reduced.values, y_train_num)\n",
"plot_predictions(polynomial_svm_clf, [0, 1, -100, 250])\n",
"plt.xlabel(\"domainUrlRatio\", fontsize=11)\n",
"plt.ylabel(\"domainlength\", fontsize=11)\n",
"plt.sca(axes[1])\n",
"plot_predictions(polynomial_svm_clf, [0, 1, -100, 250])\n",
"plt.xlabel(\"domainUrlRatio\", fontsize=11)\n",
"plt.ylabel(\"domainlength\", fontsize=11)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Observamos en azul a las URLs benigas y en verde observamos las URLs de phishing.\n",
"# Al generar características polinómicas, el resultado es un límite de decisión no lineal (color rosado). Aparentemente\n",
"# representa bastante bien la separación entre ambas clases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Predicción con el conjunto de datos reducido**"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"# Hacemos la predicción con el conjunto de datos de validación reducido\n",
"y_pred = polynomial_svm_clf.predict(X_val_reduced)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.8574514038876889\n"
]
}
],
"source": [
"# Mi algoritmo ha sido capaz de predecir correctamente en un 85.74% para mi conjunto de datos de validación\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val_num))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7.2. Polynomial Kernel (II)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Existe una forma más sencilla de entrenar un algoritmo SVM que utilize polynomial kernel utilizando el parámetro **kernel** de la propia función implementada en **sklearn**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Entrenamiento del algoritmo con un conjunto de datos reducido**"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SVC(C=20, break_ties=False, cache_size=200, class_weight=None, coef0=10,\n",
" decision_function_shape='ovr', degree=3, gamma='scale', kernel='poly',\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False)"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Instanciamos la clase SVC en un objeto \"svm_clf\" asignándole el parámetro kernel = 'poly', c=50 que va a controlar la \n",
"# distania de los márgenes y degree=3 para que genere característias polinómicas de grado 3.\n",
"svm_clf = SVC(kernel=\"poly\", degree=3, coef0=10, C=20)\n",
"svm_clf.fit(X_train_reduced, y_train_num)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Representación del límite de decisión**"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x360 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, axes = plt.subplots(ncols=2, figsize=(15,5), sharey=True)\n",
"plt.sca(axes[0])\n",
"plot_dataset(X_train_reduced.values, y_train_num)\n",
"plot_predictions(svm_clf, [0, 1, -100, 250])\n",
"plt.xlabel(\"domainUrlRatio\", fontsize=11)\n",
"plt.ylabel(\"domainlength\", fontsize=11)\n",
"plt.sca(axes[1])\n",
"plot_predictions(svm_clf, [0, 1, -100, 250])\n",
"plt.xlabel(\"domainUrlRatio\", fontsize=11)\n",
"plt.ylabel(\"domainlength\", fontsize=11)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Predicción con un conjunto de datos reducido**"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"# Hacemos la predicción con el conjunto de datos de validación reducido\n",
"y_pred = svm_clf.predict(X_val_reduced)"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.8249238062986793\n"
]
}
],
"source": [
"# Mi algoritmo ha sido capaz de predecir correctamente en un 82.49% para mi conjunto de datos de validación\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val_num))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Predicción con el conjunto de datos completo**"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SVC(C=40, break_ties=False, cache_size=200, class_weight=None, coef0=10,\n",
" decision_function_shape='ovr', degree=3, gamma='scale', kernel='poly',\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False)"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"svm_clf = SVC(kernel=\"poly\", degree=3, coef0=10, C=40)\n",
"svm_clf.fit(X_train_prep, y_train_num)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"# Hacemos la predicción con el conjunto de datos de validación\n",
"y_pred = svm_clf.predict(X_val_prep)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.9715984147952443\n"
]
}
],
"source": [
"# Mi algoritmo ha sido capaz de predecir correctamente en un 97.16% para mi conjunto de datos de validación\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val_num))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7.3. Gaussian Kernel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Entrenamiento del algoritmo con un conjunto de datos reducido**"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('scaler',\n",
" RobustScaler(copy=True, quantile_range=(25.0, 75.0),\n",
" with_centering=True, with_scaling=True)),\n",
" ('svm_clf',\n",
" SVC(C=1000, break_ties=False, cache_size=200,\n",
" class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma=0.5,\n",
" kernel='rbf', max_iter=-1, probability=False,\n",
" random_state=None, shrinking=True, tol=0.001,\n",
" verbose=False))],\n",
" verbose=False)"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Utilizando un pipeline, vamos a realizar algunas transformaciones sobre nuestro kernel.\n",
"\n",
"# \"scaler\", RobustScaler() -> Escalo mis características.\n",
"\n",
"# \"svm_clf\", SVC(kernel=\"rbf\", gamma=0.5, C=1000) -> donde: kernel=\"rbf\" es el kernel gaussiano \n",
"\n",
"rbf_kernel_svm_clf = Pipeline([\n",
" (\"scaler\", RobustScaler()),\n",
" (\"svm_clf\", SVC(kernel=\"rbf\", gamma=0.5, C=1000))\n",
" ])\n",
"\n",
"rbf_kernel_svm_clf.fit(X_train_reduced, y_train_num)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Representación del límite de decisión**"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x360 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, axes = plt.subplots(ncols=2, figsize=(15,5), sharey=True)\n",
"plt.sca(axes[0])\n",
"plot_dataset(X_train_reduced.values, y_train_num)\n",
"plot_predictions(rbf_kernel_svm_clf, [0, 1, -100, 250])\n",
"plt.xlabel(\"domainUrlRatio\", fontsize=11)\n",
"plt.ylabel(\"domainlength\", fontsize=11)\n",
"plt.sca(axes[1])\n",
"plot_predictions(rbf_kernel_svm_clf, [0, 1, -100, 250])\n",
"plt.xlabel(\"domainUrlRatio\", fontsize=11)\n",
"plt.ylabel(\"domainlength\", fontsize=11)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Predicción con un conjunto de datos reducido**"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"# Hacemos la predicción con el conjunto de datos de validación\n",
"y_pred = rbf_kernel_svm_clf.predict(X_val_reduced)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.8617363344051447\n"
]
}
],
"source": [
"# Mi algoritmo ha sido capaz de predecir correctamente en un 86.17% para mi conjunto de datos de validación\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val_num))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Predicción con un conjunto de datos completo**"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('scaler',\n",
" RobustScaler(copy=True, quantile_range=(25.0, 75.0),\n",
" with_centering=True, with_scaling=True)),\n",
" ('svm_clf',\n",
" SVC(C=1000, break_ties=False, cache_size=200,\n",
" class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma=0.05,\n",
" kernel='rbf', max_iter=-1, probability=False,\n",
" random_state=None, shrinking=True, tol=0.001,\n",
" verbose=False))],\n",
" verbose=False)"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rbf_kernel_svm_clf = Pipeline([\n",
" (\"scaler\", RobustScaler()),\n",
" (\"svm_clf\", SVC(kernel=\"rbf\", gamma=0.05, C=1000))\n",
" ])\n",
"\n",
"rbf_kernel_svm_clf.fit(X_train_prep, y_train_num)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"# Hacemos la predicción con el conjunto de datos de validación\n",
"y_pred = rbf_kernel_svm_clf.predict(X_val_prep)"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 Score: 0.9640522875816993\n"
]
}
],
"source": [
"# Mi algoritmo ha sido capaz de predecir correctamente en un 96.4% para mi conjunto de datos de validación\n",
"print(\"F1 Score:\", f1_score(y_pred, y_val_num))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
@fenix820077
Copy link

Excelente

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment