Skip to content

Instantly share code, notes, and snippets.

@GabrielCzar
Last active June 25, 2018 03:36
Show Gist options
  • Save GabrielCzar/65206fe5a6cc09b77c213da9ec7220c6 to your computer and use it in GitHub Desktop.
Save GabrielCzar/65206fe5a6cc09b77c213da9ec7220c6 to your computer and use it in GitHub Desktop.
Trabalho Final de Machine Learning - Salary Prediction any UK Job Ad Based
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "BsX4l60fEwAa"
},
"source": [
"# Job Salary Prediction\n",
"_Predict the salary of any UK job ad based on its contents_\n",
"\n",
"### Job Data\n",
"\n",
"- **Id**: Identificador para cada job.\n",
"\n",
"- **Title**: Texto livre com o titulo ou resumo da vaga.\n",
"\n",
"- **FullDescription**: Descrição da vaga sem qualquer informação salarial.\n",
"\n",
"- **LocationRaw**: Localização da vaga em texto livre.\n",
"\n",
"- **LocationNormalized**: Localização aproximada a partir da convesao do texto livre.\n",
"\n",
"- **ContractType**: full_time ou part_time.\n",
"\n",
"- **ContractTime**: permanent or contract.\n",
"\n",
"- **Company**: Nome da empresa.\n",
"\n",
"- **Category**: Qual das 30 categorias de trabalho padrão esse anúncio se encaixa, inferida de uma maneira muito confusa com base na origem da origem do anúncio. Sabemos que há muito barulho e erro nesse campo.\n",
"\n",
"- **SalaryRaw**: Descrição salarial em texto livre.\n",
"\n",
"- **SalaryNormalised**: Salario bruto anual. Valor que estamos tentando prever.\n",
"\n",
"- **SourceName**: Nome do site ou anunciante da vaga.\n",
"\n",
"### Location Tree\n",
"\n",
"Este é um conjunto de dados suplementares que descreve o relacionamento hierárquico entre os diferentes locais normalizados mostrados nos dados do trabalho. É provável que existam relações significativas entre os salários dos empregos em uma área geográfica semelhante, por exemplo, os salários médios em Londres e no Sudeste são mais altos do que no resto do Reino Unido.\n",
"\n",
"### Saida\n",
"\n",
"\n",
" Id,SalaryNormalized\n",
" 13656201,36205\n",
" 14663195,74570\n",
" 16530664,31910.50\n",
" ... \n",
" \n",
"### Sizes\n",
"\n",
"- Train:\n",
" - 421M \n",
" - 244768 entries\n",
"- Test: \n",
" - 206M\n",
" - 122463 entries\n",
" \n",
"### Problema\n",
"- Regressão Linear\n",
" - Determinar os salarios a partir de anúncios\n",
" \n",
"### Métricas\n",
"- Mean Squared Error – MSE\n",
"- Mean Absolute Error – MAE"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "gx6OgD6XEwAd"
},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1333,
"status": "ok",
"timestamp": 1529805951283,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "AA0eH24pEwAe",
"outputId": "7b086481-a9af-48e5-f9dc-bab2a63cb8c6"
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.svm import SVR\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.neighbors import KNeighborsRegressor\n",
"from sklearn.model_selection import KFold, cross_validate\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.linear_model import LinearRegression, LogisticRegression\n",
"from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA\n",
"from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "cLl3SdOOEwAh"
},
"source": [
"## Dataset"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1837,
"status": "ok",
"timestamp": 1529800925284,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "jJyRejGsGJ78",
"outputId": "f37fe0c7-41b3-4850-ae3b-f457e006c051"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 811M\r\n",
"drwxr-xr-x 4 unknown unknown 4,0K jun 24 16:18 .\r\n",
"drwxr-xr-x 7 unknown unknown 4,0K jun 19 01:34 ..\r\n",
"drwxr-xr-x 8 unknown unknown 4,0K jun 19 01:37 .git\r\n",
"-rw-r--r-- 1 unknown unknown 19 jun 17 18:09 .gitignore\r\n",
"drwxr-xr-x 2 unknown unknown 4,0K jun 23 23:38 .ipynb_checkpoints\r\n",
"-rw-r--r-- 1 unknown unknown 108K jun 24 16:18 Job Salary Prediction.ipynb\r\n",
"-rw-r--r-- 1 unknown unknown 111K jun 24 01:47 Job_Salary_Prediction__v1.ipynb\r\n",
"-rw-r--r-- 1 unknown unknown 108K jun 24 04:30 Job_Salary_Prediction__v2.ipynb\r\n",
"-rw-r--r-- 1 unknown unknown 161K jun 18 02:07 List_12__Clustering.ipynb\r\n",
"-rw-r--r-- 1 unknown unknown 376K jun 19 01:31 List_13__Clusterization_Hierarchical.ipynb\r\n",
"-rw-r--r-- 1 unknown unknown 216 jun 18 01:16 README.md\r\n",
"-rw-r--r-- 1 unknown unknown 206M fev 21 2013 Test_rev1.csv\r\n",
"-rw-r--r-- 1 unknown unknown 62M jun 23 23:55 Test_rev1.zip\r\n",
"-rw-r--r-- 1 unknown unknown 421M fev 21 2013 Train_rev1.csv\r\n",
"-rw-r--r-- 1 unknown unknown 123M jun 23 23:49 Train_rev1.zip\r\n"
]
}
],
"source": [
"!ls -lha"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df_job_data = pd.read_csv('Train_rev1.csv')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"df_test_rev1 = pd.read_csv('Test_rev1.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "PRC1vAQrEwAr"
},
"source": [
"## Informações"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "bQe_4GOHEwAr"
},
"source": [
"### Job Data"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 216
},
"colab_type": "code",
"executionInfo": {
"elapsed": 3211,
"status": "ok",
"timestamp": 1529801029609,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "GozC7kFvEwAs",
"outputId": "825320fc-27e0-4ece-8a9e-773c6e668fbb"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Title</th>\n",
" <th>FullDescription</th>\n",
" <th>LocationRaw</th>\n",
" <th>LocationNormalized</th>\n",
" <th>ContractType</th>\n",
" <th>ContractTime</th>\n",
" <th>Company</th>\n",
" <th>Category</th>\n",
" <th>SalaryRaw</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>12612628</td>\n",
" <td>Engineering Systems Analyst</td>\n",
" <td>Engineering Systems Analyst Dorking Surrey Sal...</td>\n",
" <td>Dorking, Surrey, Surrey</td>\n",
" <td>Dorking</td>\n",
" <td>NaN</td>\n",
" <td>permanent</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>Engineering Jobs</td>\n",
" <td>20000 - 30000/annum 20-30K</td>\n",
" <td>25000</td>\n",
" <td>cv-library.co.uk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12612830</td>\n",
" <td>Stress Engineer Glasgow</td>\n",
" <td>Stress Engineer Glasgow Salary **** to **** We...</td>\n",
" <td>Glasgow, Scotland, Scotland</td>\n",
" <td>Glasgow</td>\n",
" <td>NaN</td>\n",
" <td>permanent</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>Engineering Jobs</td>\n",
" <td>25000 - 35000/annum 25-35K</td>\n",
" <td>30000</td>\n",
" <td>cv-library.co.uk</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Title \\\n",
"0 12612628 Engineering Systems Analyst \n",
"1 12612830 Stress Engineer Glasgow \n",
"\n",
" FullDescription \\\n",
"0 Engineering Systems Analyst Dorking Surrey Sal... \n",
"1 Stress Engineer Glasgow Salary **** to **** We... \n",
"\n",
" LocationRaw LocationNormalized ContractType ContractTime \\\n",
"0 Dorking, Surrey, Surrey Dorking NaN permanent \n",
"1 Glasgow, Scotland, Scotland Glasgow NaN permanent \n",
"\n",
" Company Category SalaryRaw \\\n",
"0 Gregory Martin International Engineering Jobs 20000 - 30000/annum 20-30K \n",
"1 Gregory Martin International Engineering Jobs 25000 - 35000/annum 25-35K \n",
"\n",
" SalaryNormalized SourceName \n",
"0 25000 cv-library.co.uk \n",
"1 30000 cv-library.co.uk "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_job_data.head(n=2)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 306
},
"colab_type": "code",
"executionInfo": {
"elapsed": 5385,
"status": "ok",
"timestamp": 1529801041548,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "Hvfq0CKlEwA0",
"outputId": "8184344d-3cb5-43bb-fb11-c8fbbc94eb5a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 244768 entries, 0 to 244767\n",
"Data columns (total 12 columns):\n",
"Id 244768 non-null int64\n",
"Title 244767 non-null object\n",
"FullDescription 244768 non-null object\n",
"LocationRaw 244768 non-null object\n",
"LocationNormalized 244768 non-null object\n",
"ContractType 65442 non-null object\n",
"ContractTime 180863 non-null object\n",
"Company 212338 non-null object\n",
"Category 244768 non-null object\n",
"SalaryRaw 244768 non-null object\n",
"SalaryNormalized 244768 non-null int64\n",
"SourceName 244767 non-null object\n",
"dtypes: int64(2), object(10)\n",
"memory usage: 22.4+ MB\n"
]
}
],
"source": [
"df_job_data.info()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f63273053c8>]],\n",
" dtype=object)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1008x432 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_job_data.hist(column='SalaryNormalized', figsize=(14,6))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>SalaryNormalized</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>2.447680e+05</td>\n",
" <td>244768.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>6.970142e+07</td>\n",
" <td>34122.577576</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>3.129813e+06</td>\n",
" <td>17640.543124</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.261263e+07</td>\n",
" <td>5000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>6.869550e+07</td>\n",
" <td>21500.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>6.993700e+07</td>\n",
" <td>30000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>7.162606e+07</td>\n",
" <td>42500.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>7.270524e+07</td>\n",
" <td>200000.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id SalaryNormalized\n",
"count 2.447680e+05 244768.000000\n",
"mean 6.970142e+07 34122.577576\n",
"std 3.129813e+06 17640.543124\n",
"min 1.261263e+07 5000.000000\n",
"25% 6.869550e+07 21500.000000\n",
"50% 6.993700e+07 30000.000000\n",
"75% 7.162606e+07 42500.000000\n",
"max 7.270524e+07 200000.000000"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_job_data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "O5pvNAk2EwA6"
},
"source": [
"### Test"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 162
},
"colab_type": "code",
"executionInfo": {
"elapsed": 2533,
"status": "ok",
"timestamp": 1529801091483,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "eLuOUoyREwA6",
"outputId": "92b02f00-f266-44ec-8661-23fd9d420996"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Title</th>\n",
" <th>FullDescription</th>\n",
" <th>LocationRaw</th>\n",
" <th>LocationNormalized</th>\n",
" <th>ContractType</th>\n",
" <th>ContractTime</th>\n",
" <th>Company</th>\n",
" <th>Category</th>\n",
" <th>SourceName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11888454</td>\n",
" <td>Business Development Manager</td>\n",
" <td>The Company: Our client is a national training...</td>\n",
" <td>Tyne Wear, North East</td>\n",
" <td>Newcastle Upon Tyne</td>\n",
" <td>NaN</td>\n",
" <td>permanent</td>\n",
" <td>Asset Appointments</td>\n",
" <td>Teaching Jobs</td>\n",
" <td>cv-library.co.uk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>11988350</td>\n",
" <td>Internal Account Manager</td>\n",
" <td>The Company: Founded in **** our client is a U...</td>\n",
" <td>Tyne and Wear, North East</td>\n",
" <td>Newcastle Upon Tyne</td>\n",
" <td>NaN</td>\n",
" <td>permanent</td>\n",
" <td>Asset Appointments</td>\n",
" <td>Consultancy Jobs</td>\n",
" <td>cv-library.co.uk</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Title \\\n",
"0 11888454 Business Development Manager \n",
"1 11988350 Internal Account Manager \n",
"\n",
" FullDescription \\\n",
"0 The Company: Our client is a national training... \n",
"1 The Company: Founded in **** our client is a U... \n",
"\n",
" LocationRaw LocationNormalized ContractType ContractTime \\\n",
"0 Tyne Wear, North East Newcastle Upon Tyne NaN permanent \n",
"1 Tyne and Wear, North East Newcastle Upon Tyne NaN permanent \n",
"\n",
" Company Category SourceName \n",
"0 Asset Appointments Teaching Jobs cv-library.co.uk \n",
"1 Asset Appointments Consultancy Jobs cv-library.co.uk "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test_rev1.head(n=2)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 272
},
"colab_type": "code",
"executionInfo": {
"elapsed": 741,
"status": "ok",
"timestamp": 1529801095157,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "23KJVhIKEwA_",
"outputId": "e92fd119-9ad8-4f17-f9f1-c2cf9736e848"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 122463 entries, 0 to 122462\n",
"Data columns (total 10 columns):\n",
"Id 122463 non-null int64\n",
"Title 122463 non-null object\n",
"FullDescription 122463 non-null object\n",
"LocationRaw 122463 non-null object\n",
"LocationNormalized 122463 non-null object\n",
"ContractType 33013 non-null object\n",
"ContractTime 90702 non-null object\n",
"Company 106202 non-null object\n",
"Category 122463 non-null object\n",
"SourceName 122463 non-null object\n",
"dtypes: int64(1), object(9)\n",
"memory usage: 9.3+ MB\n"
]
}
],
"source": [
"df_test_rev1.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "OBLh0PIxEwBF"
},
"source": [
"## Pré-processamento"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 955,
"status": "ok",
"timestamp": 1529801099122,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "aqPyCLOfEwBG",
"outputId": "e842d9b0-3453-4c2b-fe3f-6c43ffce7024"
},
"outputs": [],
"source": [
"def normalizeTextField(df, field):\n",
" vectorizer = CountVectorizer(max_features=100)\n",
" fields = vectorizer.fit_transform(df[field]).toarray()\n",
" # Generate field names\n",
" fcols = np.vectorize(lambda x: field + str(x))(np.arange(2))\n",
" # Reduz a dimensionalidade para 2 \n",
" pca = PCA(n_components = 2)\n",
" _df = pd.DataFrame(pca.fit_transform(fields), columns=fcols)\n",
" # Concatena o dataframe com o novo\n",
" df = pd.concat([df, _df], join ='inner', axis=1)\n",
" del df[field]\n",
" return df"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "OHzogV28EwBJ"
},
"source": [
"### SalaryRaw"
]
},
{
"cell_type": "code",
"execution_count": 202,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1138,
"status": "ok",
"timestamp": 1529801103174,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "Tf_wd3knEwBK",
"outputId": "f9844ff9-02c1-4641-9498-8250636e6d09"
},
"outputs": [],
"source": [
"del df_job_data['SalaryRaw']"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "SUuskyQsEwBP"
},
"source": [
"### Remove ContractType"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "SkLiRtjpEwBP"
},
"source": [
"Grande quantidade de valores null"
]
},
{
"cell_type": "code",
"execution_count": 203,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 2707,
"status": "ok",
"timestamp": 1529801107862,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "PwsCuAoyEwBQ",
"outputId": "f083dfd4-58a1-4bbc-efcf-cb7b42e6ae3d"
},
"outputs": [],
"source": [
"del df_job_data['ContractType']\n",
"del df_test_rev1['ContractType']"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "TuGX7DRrEwBW"
},
"source": [
"### Remove ContractTime"
]
},
{
"cell_type": "code",
"execution_count": 204,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 887,
"status": "ok",
"timestamp": 1529801110023,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "L7JlYf_dEwBY",
"outputId": "6f32ac6e-6608-4fa6-b977-8059eae0b64a"
},
"outputs": [],
"source": [
"del df_job_data['ContractTime']\n",
"del df_test_rev1['ContractTime']"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3qa4BYJUEwBb"
},
"source": [
"### Removendo Category"
]
},
{
"cell_type": "code",
"execution_count": 205,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 738,
"status": "ok",
"timestamp": 1529801113956,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "9QF_BvwYEwBe",
"outputId": "916c7390-15a2-4db6-f7d5-728b75d8028b"
},
"outputs": [],
"source": [
"del df_job_data['Category']\n",
"del df_test_rev1['Category']"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "iILrtpDxEwBi"
},
"source": [
"### Removendo Location Raw"
]
},
{
"cell_type": "code",
"execution_count": 206,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 963,
"status": "ok",
"timestamp": 1529801118238,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "XYyvZMbjEwBi",
"outputId": "261cae9d-276a-4549-bd7d-8dbbbd23d067"
},
"outputs": [],
"source": [
"del df_job_data['LocationRaw']\n",
"del df_test_rev1['LocationRaw']"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "uIOf83lkEwBm"
},
"source": [
"### Company"
]
},
{
"cell_type": "code",
"execution_count": 207,
"metadata": {},
"outputs": [],
"source": [
"df_job_data['Company'].replace(value='NULL', to_replace=np.nan, inplace=True)\n",
"df_test_rev1['Company'].replace(value='NULL', to_replace=np.nan, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 208,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Gregory Martin International', 'Indigo 21 Ltd',\n",
" 'Code Blue Recruitment', ..., 'Jobs North ',\n",
" 'National Army Museum', 'DMC Healthcare'], dtype=object)"
]
},
"execution_count": 208,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_job_data['Company'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 210,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(20813,)"
]
},
"execution_count": 210,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_job_data['Company'].unique().shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "JeIbKYJCEwBz"
},
"source": [
"### Removendo linhas com valores NULL"
]
},
{
"cell_type": "code",
"execution_count": 211,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 778,
"status": "ok",
"timestamp": 1529801127234,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "1goRZOL-EwB1",
"outputId": "e2b250f2-03d5-48f8-ddf4-08d0f7764ce8"
},
"outputs": [],
"source": [
"df_job_data.dropna(subset=['Title'], inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 212,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 748,
"status": "ok",
"timestamp": 1529801129354,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "MEUOmmdmEwB4",
"outputId": "ced2f425-5e70-476e-cb98-17dfbcd74d78"
},
"outputs": [],
"source": [
"df_job_data.dropna(subset=['SourceName'], inplace = True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "FDBRPSj_EwB8"
},
"source": [
"### Retirando Label"
]
},
{
"cell_type": "code",
"execution_count": 213,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 789,
"status": "ok",
"timestamp": 1529801134604,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "8LftYmQUEwB8",
"outputId": "c0d28223-86eb-4b0f-889a-c8eceddfe985"
},
"outputs": [],
"source": [
"y = df_job_data['SalaryNormalized'].values"
]
},
{
"cell_type": "code",
"execution_count": 214,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 741,
"status": "ok",
"timestamp": 1529801137435,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "bHra0m_cEwCA",
"outputId": "8ef54773-01f3-4faa-a8e4-cae54008bf11"
},
"outputs": [
{
"data": {
"text/plain": [
"array([25000, 30000, 30000, ..., 22800, 22800, 42500])"
]
},
"execution_count": 214,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ztzrX_FOEwCE"
},
"source": [
"### Retirando IDS"
]
},
{
"cell_type": "code",
"execution_count": 215,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 820,
"status": "ok",
"timestamp": 1529801142077,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "iLsjAp48EwCF",
"outputId": "cedfc00b-981e-41c6-c794-5c6c0baed4eb"
},
"outputs": [],
"source": [
"idx_job = df_job_data['Id'].values"
]
},
{
"cell_type": "code",
"execution_count": 216,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1004,
"status": "ok",
"timestamp": 1529801144276,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "2WC4Yn_-EwCI",
"outputId": "07086f3f-25a5-4a5e-bb4a-a5b1ef476189"
},
"outputs": [
{
"data": {
"text/plain": [
"array([12612628, 12612830, 12612844, ..., 72705213, 72705216, 72705235])"
]
},
"execution_count": 216,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idx_job"
]
},
{
"cell_type": "code",
"execution_count": 217,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 762,
"status": "ok",
"timestamp": 1529801146895,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "QpyY_VXBEwCM",
"outputId": "e9977d70-2ff9-42e6-c5b3-0a39274c318e"
},
"outputs": [],
"source": [
"idx_test = df_test_rev1['Id'].values"
]
},
{
"cell_type": "code",
"execution_count": 218,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 732,
"status": "ok",
"timestamp": 1529801149196,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "noj4zYiaEwCT",
"outputId": "74e041cb-20cc-4c37-8778-c3c51041d3d9"
},
"outputs": [
{
"data": {
"text/plain": [
"array([11888454, 11988350, 12612558, ..., 72705210, 72705214, 72705218])"
]
},
"execution_count": 218,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idx_test"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "slLrezFsEwCZ"
},
"source": [
"### Juntando conteudo"
]
},
{
"cell_type": "code",
"execution_count": 219,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 765,
"status": "ok",
"timestamp": 1529801154922,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "c4kx1jyOEwCa",
"outputId": "55517387-f516-4921-f719-5cb541f8b58c"
},
"outputs": [
{
"data": {
"text/plain": [
"(244766, 7)"
]
},
"execution_count": 219,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_job_tuple = df_job_data.shape\n",
"df_job_tuple"
]
},
{
"cell_type": "code",
"execution_count": 220,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 736,
"status": "ok",
"timestamp": 1529801157401,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "pZ2qSLGwEwCf",
"outputId": "b6916215-2bd4-42fe-dcd1-b37875b26a1e"
},
"outputs": [
{
"data": {
"text/plain": [
"(122463, 6)"
]
},
"execution_count": 220,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test_tuple = df_test_rev1.shape\n",
"df_test_tuple"
]
},
{
"cell_type": "code",
"execution_count": 221,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 755,
"status": "ok",
"timestamp": 1529801161403,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "eRG7FgLoEwCl",
"outputId": "0186aee5-5356-4a00-9be1-55b35d7908e6"
},
"outputs": [],
"source": [
"df = df_job_data.append(df_test_rev1, sort=False)"
]
},
{
"cell_type": "code",
"execution_count": 222,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 779,
"status": "ok",
"timestamp": 1529801163957,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "c2ZxM7ThEwCo",
"outputId": "f206b2c9-60b8-4f31-b590-899ab1321e55"
},
"outputs": [
{
"data": {
"text/plain": [
"(367229, 7)"
]
},
"execution_count": 222,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "WUHN-XxLEwCv"
},
"source": [
"#### LocationNormalized"
]
},
{
"cell_type": "code",
"execution_count": 223,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 3215,
"status": "ok",
"timestamp": 1529801169844,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "iQKRzS_DEwCv",
"outputId": "c008de64-1b60-46e4-8f64-536af8a08e0b"
},
"outputs": [],
"source": [
"df = normalizeTextField(df, 'LocationNormalized')"
]
},
{
"cell_type": "code",
"execution_count": 224,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 745,
"status": "ok",
"timestamp": 1529801171609,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "q2FnUvsLEwC0",
"outputId": "cb9bef89-0efd-43f9-88f5-b2a0b1c3eafa"
},
"outputs": [
{
"data": {
"text/plain": [
"(367229, 8)"
]
},
"execution_count": 224,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 225,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Title</th>\n",
" <th>FullDescription</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>12612628</td>\n",
" <td>Engineering Systems Analyst</td>\n",
" <td>Engineering Systems Analyst Dorking Surrey Sal...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>25000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12612830</td>\n",
" <td>Stress Engineer Glasgow</td>\n",
" <td>Stress Engineer Glasgow Salary **** to **** We...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>30000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.118995</td>\n",
" <td>-0.237572</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>12612844</td>\n",
" <td>Modelling and simulation analyst</td>\n",
" <td>Mathematical Modeller / Simulation Analyst / O...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>30000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.120516</td>\n",
" <td>-0.241914</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>12613049</td>\n",
" <td>Engineering Systems Analyst / Mathematical Mod...</td>\n",
" <td>Engineering Systems Analyst / Mathematical Mod...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>27500.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>12613647</td>\n",
" <td>Pioneer, Miser Engineering Systems Analyst</td>\n",
" <td>Pioneer, Miser Engineering Systems Analyst Do...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>25000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Title \\\n",
"0 12612628 Engineering Systems Analyst \n",
"1 12612830 Stress Engineer Glasgow \n",
"2 12612844 Modelling and simulation analyst \n",
"3 12613049 Engineering Systems Analyst / Mathematical Mod... \n",
"4 12613647 Pioneer, Miser Engineering Systems Analyst \n",
"\n",
" FullDescription \\\n",
"0 Engineering Systems Analyst Dorking Surrey Sal... \n",
"1 Stress Engineer Glasgow Salary **** to **** We... \n",
"2 Mathematical Modeller / Simulation Analyst / O... \n",
"3 Engineering Systems Analyst / Mathematical Mod... \n",
"4 Pioneer, Miser Engineering Systems Analyst Do... \n",
"\n",
" Company SalaryNormalized SourceName \\\n",
"0 Gregory Martin International 25000.0 cv-library.co.uk \n",
"1 Gregory Martin International 30000.0 cv-library.co.uk \n",
"2 Gregory Martin International 30000.0 cv-library.co.uk \n",
"3 Gregory Martin International 27500.0 cv-library.co.uk \n",
"4 Gregory Martin International 25000.0 cv-library.co.uk \n",
"\n",
" LocationNormalized0 LocationNormalized1 \n",
"0 -0.116790 -0.229172 \n",
"1 -0.118995 -0.237572 \n",
"2 -0.120516 -0.241914 \n",
"3 -0.122604 -0.249312 \n",
"4 -0.122604 -0.249312 "
]
},
"execution_count": 225,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "tL-laH_pEwC-"
},
"source": [
"#### Title"
]
},
{
"cell_type": "code",
"execution_count": 226,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 4337,
"status": "ok",
"timestamp": 1529801179499,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "TpmwNKR_EwC_",
"outputId": "f83246de-8b37-4d08-8a4b-f69de188cfcc"
},
"outputs": [],
"source": [
"df = normalizeTextField(df, 'Title')"
]
},
{
"cell_type": "code",
"execution_count": 227,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 991,
"status": "ok",
"timestamp": 1529801182206,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "kB93el4PEwDC",
"outputId": "223e13d3-1c58-4d59-d82b-8c15f5517cc2"
},
"outputs": [
{
"data": {
"text/plain": [
"(367229, 9)"
]
},
"execution_count": 227,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 228,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>FullDescription</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>12612628</td>\n",
" <td>Engineering Systems Analyst Dorking Surrey Sal...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>25000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12612830</td>\n",
" <td>Stress Engineer Glasgow Salary **** to **** We...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>30000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.118995</td>\n",
" <td>-0.237572</td>\n",
" <td>-0.379568</td>\n",
" <td>-0.578663</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>12612844</td>\n",
" <td>Mathematical Modeller / Simulation Analyst / O...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>30000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.120516</td>\n",
" <td>-0.241914</td>\n",
" <td>-0.204017</td>\n",
" <td>0.064045</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>12613049</td>\n",
" <td>Engineering Systems Analyst / Mathematical Mod...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>27500.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>12613647</td>\n",
" <td>Pioneer, Miser Engineering Systems Analyst Do...</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>25000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id FullDescription \\\n",
"0 12612628 Engineering Systems Analyst Dorking Surrey Sal... \n",
"1 12612830 Stress Engineer Glasgow Salary **** to **** We... \n",
"2 12612844 Mathematical Modeller / Simulation Analyst / O... \n",
"3 12613049 Engineering Systems Analyst / Mathematical Mod... \n",
"4 12613647 Pioneer, Miser Engineering Systems Analyst Do... \n",
"\n",
" Company SalaryNormalized SourceName \\\n",
"0 Gregory Martin International 25000.0 cv-library.co.uk \n",
"1 Gregory Martin International 30000.0 cv-library.co.uk \n",
"2 Gregory Martin International 30000.0 cv-library.co.uk \n",
"3 Gregory Martin International 27500.0 cv-library.co.uk \n",
"4 Gregory Martin International 25000.0 cv-library.co.uk \n",
"\n",
" LocationNormalized0 LocationNormalized1 Title0 Title1 \n",
"0 -0.116790 -0.229172 -0.211709 0.010168 \n",
"1 -0.118995 -0.237572 -0.379568 -0.578663 \n",
"2 -0.120516 -0.241914 -0.204017 0.064045 \n",
"3 -0.122604 -0.249312 -0.211709 0.010168 \n",
"4 -0.122604 -0.249312 -0.211709 0.010168 "
]
},
"execution_count": 228,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "xDIMGEN7EwDG"
},
"source": [
"#### Full Description"
]
},
{
"cell_type": "code",
"execution_count": 229,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 68085,
"status": "ok",
"timestamp": 1529801253123,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "nDp6SmCVEwDG",
"outputId": "e6f242f1-591f-47f1-8454-d2e2eb75ee00"
},
"outputs": [],
"source": [
"df = normalizeTextField(df, 'FullDescription')"
]
},
{
"cell_type": "code",
"execution_count": 230,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 2471,
"status": "ok",
"timestamp": 1529801284445,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "jOBnnMV8EwDK",
"outputId": "a027949e-77b5-4018-9e74-15a7a3dddfda"
},
"outputs": [
{
"data": {
"text/plain": [
"(367229, 10)"
]
},
"execution_count": 230,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 231,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" <th>FullDescription0</th>\n",
" <th>FullDescription1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>12612628</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>25000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-18.530014</td>\n",
" <td>2.881801</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12612830</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>30000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.118995</td>\n",
" <td>-0.237572</td>\n",
" <td>-0.379568</td>\n",
" <td>-0.578663</td>\n",
" <td>1.115408</td>\n",
" <td>-2.899837</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>12612844</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>30000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.120516</td>\n",
" <td>-0.241914</td>\n",
" <td>-0.204017</td>\n",
" <td>0.064045</td>\n",
" <td>-1.111251</td>\n",
" <td>2.198475</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>12613049</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>27500.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-18.890457</td>\n",
" <td>3.393423</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>12613647</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>25000.0</td>\n",
" <td>cv-library.co.uk</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-19.451188</td>\n",
" <td>2.751042</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Company SalaryNormalized SourceName \\\n",
"0 12612628 Gregory Martin International 25000.0 cv-library.co.uk \n",
"1 12612830 Gregory Martin International 30000.0 cv-library.co.uk \n",
"2 12612844 Gregory Martin International 30000.0 cv-library.co.uk \n",
"3 12613049 Gregory Martin International 27500.0 cv-library.co.uk \n",
"4 12613647 Gregory Martin International 25000.0 cv-library.co.uk \n",
"\n",
" LocationNormalized0 LocationNormalized1 Title0 Title1 \\\n",
"0 -0.116790 -0.229172 -0.211709 0.010168 \n",
"1 -0.118995 -0.237572 -0.379568 -0.578663 \n",
"2 -0.120516 -0.241914 -0.204017 0.064045 \n",
"3 -0.122604 -0.249312 -0.211709 0.010168 \n",
"4 -0.122604 -0.249312 -0.211709 0.010168 \n",
"\n",
" FullDescription0 FullDescription1 \n",
"0 -18.530014 2.881801 \n",
"1 1.115408 -2.899837 \n",
"2 -1.111251 2.198475 \n",
"3 -18.890457 3.393423 \n",
"4 -19.451188 2.751042 "
]
},
"execution_count": 231,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3UqZ9i79EwDN"
},
"source": [
"#### Source Name"
]
},
{
"cell_type": "code",
"execution_count": 232,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 819,
"status": "ok",
"timestamp": 1529801289739,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "Wakkvtn4EwDV",
"outputId": "39e2df5b-a80a-4d86-89da-8c83a777754b"
},
"outputs": [],
"source": [
"_, sources = np.unique(df['SourceName'], return_inverse=True)"
]
},
{
"cell_type": "code",
"execution_count": 233,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 3545,
"status": "ok",
"timestamp": 1529801294803,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "W_FwJ5TYEwDZ",
"outputId": "486c3805-18bc-40d4-e525-002b70e3453c"
},
"outputs": [
{
"data": {
"text/plain": [
"(367229,)"
]
},
"execution_count": 233,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sources.shape"
]
},
{
"cell_type": "code",
"execution_count": 234,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 4481,
"status": "ok",
"timestamp": 1529801299695,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "JN1a4NevEwDj",
"outputId": "2c7e337a-0f1b-4ec8-cfcd-1f605d426501"
},
"outputs": [],
"source": [
"df['SourceName'] = sources"
]
},
{
"cell_type": "code",
"execution_count": 235,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 702,
"status": "ok",
"timestamp": 1529801300859,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "BBjy-ZqbEwDo",
"outputId": "639a972b-b8d7-49e6-e630-f6a234e1cb57"
},
"outputs": [
{
"data": {
"text/plain": [
"(367229, 10)"
]
},
"execution_count": 235,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 236,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 160
},
"colab_type": "code",
"executionInfo": {
"elapsed": 749,
"status": "ok",
"timestamp": 1529801304114,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "CTVv0buBEwDw",
"outputId": "5b58b9d4-c607-4c6d-a6a4-75a7d04bce92"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" <th>FullDescription0</th>\n",
" <th>FullDescription1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>12612628</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>25000.0</td>\n",
" <td>42</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-18.530014</td>\n",
" <td>2.881801</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12612830</td>\n",
" <td>Gregory Martin International</td>\n",
" <td>30000.0</td>\n",
" <td>42</td>\n",
" <td>-0.118995</td>\n",
" <td>-0.237572</td>\n",
" <td>-0.379568</td>\n",
" <td>-0.578663</td>\n",
" <td>1.115408</td>\n",
" <td>-2.899837</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Company SalaryNormalized SourceName \\\n",
"0 12612628 Gregory Martin International 25000.0 42 \n",
"1 12612830 Gregory Martin International 30000.0 42 \n",
"\n",
" LocationNormalized0 LocationNormalized1 Title0 Title1 \\\n",
"0 -0.116790 -0.229172 -0.211709 0.010168 \n",
"1 -0.118995 -0.237572 -0.379568 -0.578663 \n",
"\n",
" FullDescription0 FullDescription1 \n",
"0 -18.530014 2.881801 \n",
"1 1.115408 -2.899837 "
]
},
"execution_count": 236,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(n=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Company"
]
},
{
"cell_type": "code",
"execution_count": 237,
"metadata": {},
"outputs": [],
"source": [
"_, companies = np.unique(df['Company'], return_inverse=True)"
]
},
{
"cell_type": "code",
"execution_count": 238,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(367229,)"
]
},
"execution_count": 238,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"companies.shape"
]
},
{
"cell_type": "code",
"execution_count": 239,
"metadata": {},
"outputs": [],
"source": [
"df['Company'] = companies"
]
},
{
"cell_type": "code",
"execution_count": 240,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(367229, 10)"
]
},
"execution_count": 240,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 241,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" <th>FullDescription0</th>\n",
" <th>FullDescription1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>12612628</td>\n",
" <td>9229</td>\n",
" <td>25000.0</td>\n",
" <td>42</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-18.530014</td>\n",
" <td>2.881801</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12612830</td>\n",
" <td>9229</td>\n",
" <td>30000.0</td>\n",
" <td>42</td>\n",
" <td>-0.118995</td>\n",
" <td>-0.237572</td>\n",
" <td>-0.379568</td>\n",
" <td>-0.578663</td>\n",
" <td>1.115408</td>\n",
" <td>-2.899837</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Company SalaryNormalized SourceName LocationNormalized0 \\\n",
"0 12612628 9229 25000.0 42 -0.116790 \n",
"1 12612830 9229 30000.0 42 -0.118995 \n",
"\n",
" LocationNormalized1 Title0 Title1 FullDescription0 FullDescription1 \n",
"0 -0.229172 -0.211709 0.010168 -18.530014 2.881801 \n",
"1 -0.237572 -0.379568 -0.578663 1.115408 -2.899837 "
]
},
"execution_count": 241,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(n=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pós processamento"
]
},
{
"cell_type": "code",
"execution_count": 242,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" <th>FullDescription0</th>\n",
" <th>FullDescription1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>12612628</td>\n",
" <td>9229</td>\n",
" <td>25000.0</td>\n",
" <td>42</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-18.530014</td>\n",
" <td>2.881801</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12612830</td>\n",
" <td>9229</td>\n",
" <td>30000.0</td>\n",
" <td>42</td>\n",
" <td>-0.118995</td>\n",
" <td>-0.237572</td>\n",
" <td>-0.379568</td>\n",
" <td>-0.578663</td>\n",
" <td>1.115408</td>\n",
" <td>-2.899837</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>12612844</td>\n",
" <td>9229</td>\n",
" <td>30000.0</td>\n",
" <td>42</td>\n",
" <td>-0.120516</td>\n",
" <td>-0.241914</td>\n",
" <td>-0.204017</td>\n",
" <td>0.064045</td>\n",
" <td>-1.111251</td>\n",
" <td>2.198475</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>12613049</td>\n",
" <td>9229</td>\n",
" <td>27500.0</td>\n",
" <td>42</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-18.890457</td>\n",
" <td>3.393423</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>12613647</td>\n",
" <td>9229</td>\n",
" <td>25000.0</td>\n",
" <td>42</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-19.451188</td>\n",
" <td>2.751042</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Company SalaryNormalized SourceName LocationNormalized0 \\\n",
"0 12612628 9229 25000.0 42 -0.116790 \n",
"1 12612830 9229 30000.0 42 -0.118995 \n",
"2 12612844 9229 30000.0 42 -0.120516 \n",
"3 12613049 9229 27500.0 42 -0.122604 \n",
"4 12613647 9229 25000.0 42 -0.122604 \n",
"\n",
" LocationNormalized1 Title0 Title1 FullDescription0 FullDescription1 \n",
"0 -0.229172 -0.211709 0.010168 -18.530014 2.881801 \n",
"1 -0.237572 -0.379568 -0.578663 1.115408 -2.899837 \n",
"2 -0.241914 -0.204017 0.064045 -1.111251 2.198475 \n",
"3 -0.249312 -0.211709 0.010168 -18.890457 3.393423 \n",
"4 -0.249312 -0.211709 0.010168 -19.451188 2.751042 "
]
},
"execution_count": 242,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 243,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 160
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1926,
"status": "ok",
"timestamp": 1529801314400,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "QB7-RNIHEwD4",
"outputId": "1cc11811-1a6a-4572-e7c7-6f37bdef14f9"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" <th>FullDescription0</th>\n",
" <th>FullDescription1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>122458</th>\n",
" <td>72703426</td>\n",
" <td>22483</td>\n",
" <td>NaN</td>\n",
" <td>95</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.140759</td>\n",
" <td>0.027805</td>\n",
" <td>-16.425155</td>\n",
" <td>3.326807</td>\n",
" </tr>\n",
" <tr>\n",
" <th>122459</th>\n",
" <td>72703453</td>\n",
" <td>232</td>\n",
" <td>NaN</td>\n",
" <td>95</td>\n",
" <td>-0.118020</td>\n",
" <td>-0.233316</td>\n",
" <td>-0.148008</td>\n",
" <td>0.061885</td>\n",
" <td>-17.558738</td>\n",
" <td>2.838631</td>\n",
" </tr>\n",
" <tr>\n",
" <th>122460</th>\n",
" <td>72705210</td>\n",
" <td>14637</td>\n",
" <td>NaN</td>\n",
" <td>64</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.187463</td>\n",
" <td>0.364341</td>\n",
" <td>-11.138799</td>\n",
" <td>-0.978168</td>\n",
" </tr>\n",
" <tr>\n",
" <th>122461</th>\n",
" <td>72705214</td>\n",
" <td>14637</td>\n",
" <td>NaN</td>\n",
" <td>64</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>0.868984</td>\n",
" <td>-0.102670</td>\n",
" <td>-3.389519</td>\n",
" <td>-0.760346</td>\n",
" </tr>\n",
" <tr>\n",
" <th>122462</th>\n",
" <td>72705218</td>\n",
" <td>14637</td>\n",
" <td>NaN</td>\n",
" <td>64</td>\n",
" <td>-0.118635</td>\n",
" <td>-0.235408</td>\n",
" <td>-0.168568</td>\n",
" <td>0.034974</td>\n",
" <td>-13.765711</td>\n",
" <td>-0.120907</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Company SalaryNormalized SourceName LocationNormalized0 \\\n",
"122458 72703426 22483 NaN 95 -0.116790 \n",
"122459 72703453 232 NaN 95 -0.118020 \n",
"122460 72705210 14637 NaN 64 -0.116790 \n",
"122461 72705214 14637 NaN 64 -0.116790 \n",
"122462 72705218 14637 NaN 64 -0.118635 \n",
"\n",
" LocationNormalized1 Title0 Title1 FullDescription0 \\\n",
"122458 -0.229172 -0.140759 0.027805 -16.425155 \n",
"122459 -0.233316 -0.148008 0.061885 -17.558738 \n",
"122460 -0.229172 -0.187463 0.364341 -11.138799 \n",
"122461 -0.229172 0.868984 -0.102670 -3.389519 \n",
"122462 -0.235408 -0.168568 0.034974 -13.765711 \n",
"\n",
" FullDescription1 \n",
"122458 3.326807 \n",
"122459 2.838631 \n",
"122460 -0.978168 \n",
"122461 -0.760346 \n",
"122462 -0.120907 "
]
},
"execution_count": 243,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.tail()"
]
},
{
"cell_type": "code",
"execution_count": 244,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" <th>FullDescription0</th>\n",
" <th>FullDescription1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>3.672290e+05</td>\n",
" <td>367229.000000</td>\n",
" <td>244766.000000</td>\n",
" <td>367229.000000</td>\n",
" <td>367229.000000</td>\n",
" <td>367229.000000</td>\n",
" <td>367229.000000</td>\n",
" <td>367229.000000</td>\n",
" <td>367229.000000</td>\n",
" <td>367229.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>6.969881e+07</td>\n",
" <td>12360.855161</td>\n",
" <td>34122.192494</td>\n",
" <td>88.657734</td>\n",
" <td>-0.003500</td>\n",
" <td>-0.003753</td>\n",
" <td>-0.000547</td>\n",
" <td>0.002329</td>\n",
" <td>-0.026405</td>\n",
" <td>0.014674</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>3.127609e+06</td>\n",
" <td>6570.361799</td>\n",
" <td>17639.753029</td>\n",
" <td>56.313850</td>\n",
" <td>0.456049</td>\n",
" <td>0.351458</td>\n",
" <td>0.429128</td>\n",
" <td>0.318787</td>\n",
" <td>12.385516</td>\n",
" <td>4.597004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.188845e+07</td>\n",
" <td>0.000000</td>\n",
" <td>5000.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.568546</td>\n",
" <td>-0.420773</td>\n",
" <td>-1.119124</td>\n",
" <td>-2.701985</td>\n",
" <td>-19.732022</td>\n",
" <td>-40.437940</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>6.869505e+07</td>\n",
" <td>7040.000000</td>\n",
" <td>21500.000000</td>\n",
" <td>42.000000</td>\n",
" <td>-0.124085</td>\n",
" <td>-0.236447</td>\n",
" <td>-0.204409</td>\n",
" <td>-0.110315</td>\n",
" <td>-8.602899</td>\n",
" <td>-2.505604</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>6.993552e+07</td>\n",
" <td>13860.000000</td>\n",
" <td>30000.000000</td>\n",
" <td>85.000000</td>\n",
" <td>-0.117893</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.171657</td>\n",
" <td>0.034974</td>\n",
" <td>-2.243902</td>\n",
" <td>0.173910</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>7.162515e+07</td>\n",
" <td>17047.000000</td>\n",
" <td>42500.000000</td>\n",
" <td>154.000000</td>\n",
" <td>-0.116790</td>\n",
" <td>0.107018</td>\n",
" <td>-0.130477</td>\n",
" <td>0.057478</td>\n",
" <td>5.772098</td>\n",
" <td>2.474907</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>7.270524e+07</td>\n",
" <td>24854.000000</td>\n",
" <td>200000.000000</td>\n",
" <td>168.000000</td>\n",
" <td>1.290674</td>\n",
" <td>0.648287</td>\n",
" <td>3.708054</td>\n",
" <td>3.296137</td>\n",
" <td>244.121250</td>\n",
" <td>58.723964</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Company SalaryNormalized SourceName \\\n",
"count 3.672290e+05 367229.000000 244766.000000 367229.000000 \n",
"mean 6.969881e+07 12360.855161 34122.192494 88.657734 \n",
"std 3.127609e+06 6570.361799 17639.753029 56.313850 \n",
"min 1.188845e+07 0.000000 5000.000000 0.000000 \n",
"25% 6.869505e+07 7040.000000 21500.000000 42.000000 \n",
"50% 6.993552e+07 13860.000000 30000.000000 85.000000 \n",
"75% 7.162515e+07 17047.000000 42500.000000 154.000000 \n",
"max 7.270524e+07 24854.000000 200000.000000 168.000000 \n",
"\n",
" LocationNormalized0 LocationNormalized1 Title0 Title1 \\\n",
"count 367229.000000 367229.000000 367229.000000 367229.000000 \n",
"mean -0.003500 -0.003753 -0.000547 0.002329 \n",
"std 0.456049 0.351458 0.429128 0.318787 \n",
"min -0.568546 -0.420773 -1.119124 -2.701985 \n",
"25% -0.124085 -0.236447 -0.204409 -0.110315 \n",
"50% -0.117893 -0.229172 -0.171657 0.034974 \n",
"75% -0.116790 0.107018 -0.130477 0.057478 \n",
"max 1.290674 0.648287 3.708054 3.296137 \n",
"\n",
" FullDescription0 FullDescription1 \n",
"count 367229.000000 367229.000000 \n",
"mean -0.026405 0.014674 \n",
"std 12.385516 4.597004 \n",
"min -19.732022 -40.437940 \n",
"25% -8.602899 -2.505604 \n",
"50% -2.243902 0.173910 \n",
"75% 5.772098 2.474907 \n",
"max 244.121250 58.723964 "
]
},
"execution_count": 244,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 248,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Company</th>\n",
" <th>SalaryNormalized</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" <th>FullDescription0</th>\n",
" <th>FullDescription1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Id</th>\n",
" <td>1.000000</td>\n",
" <td>-0.020986</td>\n",
" <td>0.047094</td>\n",
" <td>0.109891</td>\n",
" <td>0.032935</td>\n",
" <td>0.057275</td>\n",
" <td>0.002192</td>\n",
" <td>-0.002024</td>\n",
" <td>0.035829</td>\n",
" <td>0.004801</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Company</th>\n",
" <td>-0.020986</td>\n",
" <td>1.000000</td>\n",
" <td>0.004974</td>\n",
" <td>0.027165</td>\n",
" <td>-0.007489</td>\n",
" <td>-0.017697</td>\n",
" <td>-0.003113</td>\n",
" <td>0.001284</td>\n",
" <td>-0.003085</td>\n",
" <td>0.004680</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SalaryNormalized</th>\n",
" <td>0.047094</td>\n",
" <td>0.004974</td>\n",
" <td>1.000000</td>\n",
" <td>0.123441</td>\n",
" <td>0.082108</td>\n",
" <td>0.050715</td>\n",
" <td>0.013384</td>\n",
" <td>-0.077149</td>\n",
" <td>0.030054</td>\n",
" <td>0.031389</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SourceName</th>\n",
" <td>0.109891</td>\n",
" <td>0.027165</td>\n",
" <td>0.123441</td>\n",
" <td>1.000000</td>\n",
" <td>0.017216</td>\n",
" <td>0.112476</td>\n",
" <td>0.049994</td>\n",
" <td>0.020802</td>\n",
" <td>0.071979</td>\n",
" <td>-0.021501</td>\n",
" </tr>\n",
" <tr>\n",
" <th>LocationNormalized0</th>\n",
" <td>0.032935</td>\n",
" <td>-0.007489</td>\n",
" <td>0.082108</td>\n",
" <td>0.017216</td>\n",
" <td>1.000000</td>\n",
" <td>0.000530</td>\n",
" <td>0.050502</td>\n",
" <td>0.044066</td>\n",
" <td>0.018854</td>\n",
" <td>0.003637</td>\n",
" </tr>\n",
" <tr>\n",
" <th>LocationNormalized1</th>\n",
" <td>0.057275</td>\n",
" <td>-0.017697</td>\n",
" <td>0.050715</td>\n",
" <td>0.112476</td>\n",
" <td>0.000530</td>\n",
" <td>1.000000</td>\n",
" <td>0.039818</td>\n",
" <td>0.016730</td>\n",
" <td>0.046324</td>\n",
" <td>-0.014547</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Title0</th>\n",
" <td>0.002192</td>\n",
" <td>-0.003113</td>\n",
" <td>0.013384</td>\n",
" <td>0.049994</td>\n",
" <td>0.050502</td>\n",
" <td>0.039818</td>\n",
" <td>1.000000</td>\n",
" <td>-0.004641</td>\n",
" <td>0.120983</td>\n",
" <td>-0.020667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Title1</th>\n",
" <td>-0.002024</td>\n",
" <td>0.001284</td>\n",
" <td>-0.077149</td>\n",
" <td>0.020802</td>\n",
" <td>0.044066</td>\n",
" <td>0.016730</td>\n",
" <td>-0.004641</td>\n",
" <td>1.000000</td>\n",
" <td>0.004257</td>\n",
" <td>-0.139567</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FullDescription0</th>\n",
" <td>0.035829</td>\n",
" <td>-0.003085</td>\n",
" <td>0.030054</td>\n",
" <td>0.071979</td>\n",
" <td>0.018854</td>\n",
" <td>0.046324</td>\n",
" <td>0.120983</td>\n",
" <td>0.004257</td>\n",
" <td>1.000000</td>\n",
" <td>-0.002455</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FullDescription1</th>\n",
" <td>0.004801</td>\n",
" <td>0.004680</td>\n",
" <td>0.031389</td>\n",
" <td>-0.021501</td>\n",
" <td>0.003637</td>\n",
" <td>-0.014547</td>\n",
" <td>-0.020667</td>\n",
" <td>-0.139567</td>\n",
" <td>-0.002455</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Company SalaryNormalized SourceName \\\n",
"Id 1.000000 -0.020986 0.047094 0.109891 \n",
"Company -0.020986 1.000000 0.004974 0.027165 \n",
"SalaryNormalized 0.047094 0.004974 1.000000 0.123441 \n",
"SourceName 0.109891 0.027165 0.123441 1.000000 \n",
"LocationNormalized0 0.032935 -0.007489 0.082108 0.017216 \n",
"LocationNormalized1 0.057275 -0.017697 0.050715 0.112476 \n",
"Title0 0.002192 -0.003113 0.013384 0.049994 \n",
"Title1 -0.002024 0.001284 -0.077149 0.020802 \n",
"FullDescription0 0.035829 -0.003085 0.030054 0.071979 \n",
"FullDescription1 0.004801 0.004680 0.031389 -0.021501 \n",
"\n",
" LocationNormalized0 LocationNormalized1 Title0 \\\n",
"Id 0.032935 0.057275 0.002192 \n",
"Company -0.007489 -0.017697 -0.003113 \n",
"SalaryNormalized 0.082108 0.050715 0.013384 \n",
"SourceName 0.017216 0.112476 0.049994 \n",
"LocationNormalized0 1.000000 0.000530 0.050502 \n",
"LocationNormalized1 0.000530 1.000000 0.039818 \n",
"Title0 0.050502 0.039818 1.000000 \n",
"Title1 0.044066 0.016730 -0.004641 \n",
"FullDescription0 0.018854 0.046324 0.120983 \n",
"FullDescription1 0.003637 -0.014547 -0.020667 \n",
"\n",
" Title1 FullDescription0 FullDescription1 \n",
"Id -0.002024 0.035829 0.004801 \n",
"Company 0.001284 -0.003085 0.004680 \n",
"SalaryNormalized -0.077149 0.030054 0.031389 \n",
"SourceName 0.020802 0.071979 -0.021501 \n",
"LocationNormalized0 0.044066 0.018854 0.003637 \n",
"LocationNormalized1 0.016730 0.046324 -0.014547 \n",
"Title0 -0.004641 0.120983 -0.020667 \n",
"Title1 1.000000 0.004257 -0.139567 \n",
"FullDescription0 0.004257 1.000000 -0.002455 \n",
"FullDescription1 -0.139567 -0.002455 1.000000 "
]
},
"execution_count": 248,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corr()"
]
},
{
"cell_type": "code",
"execution_count": 254,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Id 0.047094\n",
"Company 0.004974\n",
"SalaryNormalized 1.000000\n",
"SourceName 0.123441\n",
"LocationNormalized0 0.082108\n",
"LocationNormalized1 0.050715\n",
"Title0 0.013384\n",
"Title1 -0.077149\n",
"FullDescription0 0.030054\n",
"FullDescription1 0.031389\n",
"dtype: float64"
]
},
"execution_count": 254,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corrwith(df['SalaryNormalized'])"
]
},
{
"cell_type": "code",
"execution_count": 256,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f631ee06d30>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f6333601be0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f63335fa2b0>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7f631dcb7940>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f631e24bc88>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f6324e9e128>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7f631de40320>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f631de78978>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f6320624048>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7f63335f0d68>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f63335c4d68>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f631e3d9438>]],\n",
" dtype=object)"
]
},
"execution_count": 256,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1008x864 with 12 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df.hist(figsize=(14, 12))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-EhpTVryEwD9"
},
"source": [
"### Separando Train e Test "
]
},
{
"cell_type": "code",
"execution_count": 257,
"metadata": {},
"outputs": [],
"source": [
"del df['Id']\n",
"del df['SalaryNormalized']"
]
},
{
"cell_type": "code",
"execution_count": 258,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "qKGGNDhVVf4B"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Company</th>\n",
" <th>SourceName</th>\n",
" <th>LocationNormalized0</th>\n",
" <th>LocationNormalized1</th>\n",
" <th>Title0</th>\n",
" <th>Title1</th>\n",
" <th>FullDescription0</th>\n",
" <th>FullDescription1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>9229</td>\n",
" <td>42</td>\n",
" <td>-0.116790</td>\n",
" <td>-0.229172</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-18.530014</td>\n",
" <td>2.881801</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>9229</td>\n",
" <td>42</td>\n",
" <td>-0.118995</td>\n",
" <td>-0.237572</td>\n",
" <td>-0.379568</td>\n",
" <td>-0.578663</td>\n",
" <td>1.115408</td>\n",
" <td>-2.899837</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9229</td>\n",
" <td>42</td>\n",
" <td>-0.120516</td>\n",
" <td>-0.241914</td>\n",
" <td>-0.204017</td>\n",
" <td>0.064045</td>\n",
" <td>-1.111251</td>\n",
" <td>2.198475</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>9229</td>\n",
" <td>42</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-18.890457</td>\n",
" <td>3.393423</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9229</td>\n",
" <td>42</td>\n",
" <td>-0.122604</td>\n",
" <td>-0.249312</td>\n",
" <td>-0.211709</td>\n",
" <td>0.010168</td>\n",
" <td>-19.451188</td>\n",
" <td>2.751042</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Company SourceName LocationNormalized0 LocationNormalized1 Title0 \\\n",
"0 9229 42 -0.116790 -0.229172 -0.211709 \n",
"1 9229 42 -0.118995 -0.237572 -0.379568 \n",
"2 9229 42 -0.120516 -0.241914 -0.204017 \n",
"3 9229 42 -0.122604 -0.249312 -0.211709 \n",
"4 9229 42 -0.122604 -0.249312 -0.211709 \n",
"\n",
" Title1 FullDescription0 FullDescription1 \n",
"0 0.010168 -18.530014 2.881801 \n",
"1 -0.578663 1.115408 -2.899837 \n",
"2 0.064045 -1.111251 2.198475 \n",
"3 0.010168 -18.890457 3.393423 \n",
"4 0.010168 -19.451188 2.751042 "
]
},
"execution_count": 258,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 259,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1817,
"status": "ok",
"timestamp": 1529801321689,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "doodiP6IEwD_",
"outputId": "c8d75d4e-d9d0-4969-cf45-96d28e9c2b46"
},
"outputs": [],
"source": [
"X_train = df.values[:df_job_tuple[0], :df_job_tuple[0]]"
]
},
{
"cell_type": "code",
"execution_count": 260,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1466,
"status": "ok",
"timestamp": 1529801324123,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "2Ryw0iShEwEE",
"outputId": "5c518610-2da8-4e01-8c86-9def87aa91aa"
},
"outputs": [],
"source": [
"X_test = df.values[:df_test_tuple[0], :df_test_tuple[0]]"
]
},
{
"cell_type": "code",
"execution_count": 261,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 728,
"status": "ok",
"timestamp": 1529801329170,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "9RLhOM49EwEI",
"outputId": "e666aaef-3a0e-4ccf-bace-2ad37b65971e"
},
"outputs": [
{
"data": {
"text/plain": [
"((244766, 8), (122463, 8))"
]
},
"execution_count": 261,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "2QZB6KKaEwEM"
},
"source": [
"### Criando Scaler"
]
},
{
"cell_type": "code",
"execution_count": 263,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 780,
"status": "ok",
"timestamp": 1529801336408,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "yl3cggxyEwEN",
"outputId": "c223576e-ef0a-460a-ed48-dd16d9e76615"
},
"outputs": [],
"source": [
"scaler = StandardScaler()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "btL2O-tJEwEn"
},
"source": [
"### Criando Folds"
]
},
{
"cell_type": "code",
"execution_count": 264,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 773,
"status": "ok",
"timestamp": 1529801345950,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "SLlImgbfEwEo",
"outputId": "e546ce8e-65e9-4332-df8d-bbe1b14a723e"
},
"outputs": [],
"source": [
"n_splits = 10\n",
"kfold = KFold(n_splits=n_splits)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Função para executar modelos"
]
},
{
"cell_type": "code",
"execution_count": 265,
"metadata": {},
"outputs": [],
"source": [
"def cross_validation(model, X, y):\n",
" scoring = [ 'neg_mean_absolute_error', 'neg_mean_squared_error']\n",
" pipeline = Pipeline([('transformer', scaler), ('estimator', model)])\n",
" \n",
" return cross_validate(pipeline, X=X, y=y, cv=kfold, n_jobs=1, verbose=5, scoring=scoring, return_train_score=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "YfAiDB8dEwEv"
},
"source": [
"## Criando modelos"
]
},
{
"cell_type": "code",
"execution_count": 266,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 753,
"status": "ok",
"timestamp": 1529801353639,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "GwdEADLvEwEw",
"outputId": "38ba3f27-9e4d-4adf-b0fd-0612650296b4"
},
"outputs": [],
"source": [
"rf_model = RandomForestRegressor(n_estimators=50, min_samples_split=30, random_state=1)"
]
},
{
"cell_type": "code",
"execution_count": 267,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 672,
"status": "ok",
"timestamp": 1529801355131,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "JEHZ_sfvEwE0",
"outputId": "cadc9dd1-3a5e-4e30-a4c2-83d40c5cdb36"
},
"outputs": [],
"source": [
"gb_model = GradientBoostingRegressor(min_samples_split=30, random_state=1)"
]
},
{
"cell_type": "code",
"execution_count": 268,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1055,
"status": "ok",
"timestamp": 1529801357880,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "WvFkxrQsEwE6",
"outputId": "6f3778eb-e7fc-4f03-fb58-68cf164a6a52"
},
"outputs": [],
"source": [
"ada_model = AdaBoostRegressor(random_state=1)"
]
},
{
"cell_type": "code",
"execution_count": 269,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 37
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1476,
"status": "ok",
"timestamp": 1529801359857,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "KtNyotnoEwE9",
"outputId": "8d3f5269-5f05-4dfc-9fd4-5718ca61f69a"
},
"outputs": [],
"source": [
"knn_model = KNeighborsRegressor()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "p4gR0Ps9EwFH"
},
"source": [
"## Treinamento"
]
},
{
"cell_type": "code",
"execution_count": 270,
"metadata": {},
"outputs": [],
"source": [
"def calc_metrics(cv):\n",
" '''Retorna as tuplas contendo (rmse_train, rmse_test) , (mae_train, mae_test)'''\n",
" time_train = np.sum(cv['fit_time']) / n_splits\n",
" print('Tempo médio de treinamento: %f seg. Para 1 / %d folds' % (time_train, n_splits))\n",
" train_rmse = np.sum(np.sqrt(np.abs(cv['train_neg_mean_squared_error']))) / n_splits\n",
" print('RMSE Train: %.2f' % train_rmse)\n",
" test_rmse = np.sum(np.sqrt(np.abs(cv['test_neg_mean_squared_error']))) / n_splits\n",
" print('RMSE Test: %.2f' % test_rmse)\n",
" mae_train = np.sum(np.abs(cv['train_neg_mean_squared_error'])) / n_splits\n",
" print('MAE Train: %.2f' % mae_train)\n",
" mae_test = np.sum(np.abs(cv['test_neg_mean_squared_error'])) / n_splits\n",
" print('MAE Test: %.2f' % mae_test)\n",
" return (train_rmse, test_rmse) , (mae_train, mae_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### KNN"
]
},
{
"cell_type": "code",
"execution_count": 271,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "3EU7lJjeAm3G"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-12685.973305552152, neg_mean_squared_error=-304122865.90516645, total= 11.0s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.4min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-13356.993520447766, neg_mean_squared_error=-325638520.3811414, total= 9.9s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 2.8min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-13712.026449319768, neg_mean_squared_error=-320672771.6496646, total= 8.4s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 4.1min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-13522.491236671161, neg_mean_squared_error=-325192332.7262622, total= 9.4s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 5.4min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-13496.07007394697, neg_mean_squared_error=-325081517.22830087, total= 9.4s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-13322.775740491072, neg_mean_squared_error=-325439213.57288396, total= 9.1s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-13444.718589638831, neg_mean_squared_error=-330733038.4749485, total= 9.4s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-13544.787800294167, neg_mean_squared_error=-335721280.6270453, total= 9.2s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-13538.33361660402, neg_mean_squared_error=-325569272.2439647, total= 8.8s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-13588.120166693907, neg_mean_squared_error=-331692605.1628795, total= 9.5s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 13.2min finished\n"
]
},
{
"data": {
"text/plain": [
"{'fit_time': array([0.93913698, 0.95052528, 0.84312224, 0.82429481, 0.79533362,\n",
" 0.80535126, 0.97619629, 0.81978226, 0.80367112, 0.82421613]),\n",
" 'score_time': array([10.03360677, 8.97757196, 7.5689044 , 8.57959175, 8.64531207,\n",
" 8.25824142, 8.41873646, 8.33647299, 7.96712136, 8.70186543]),\n",
" 'test_neg_mean_absolute_error': array([-12685.97330555, -13356.99352045, -13712.02644932, -13522.49123667,\n",
" -13496.07007395, -13322.77574049, -13444.71858964, -13544.78780029,\n",
" -13538.3336166 , -13588.12016669]),\n",
" 'train_neg_mean_absolute_error': array([-10891.56388472, -10806.09391027, -10788.03617521, -10809.45374848,\n",
" -10802.44566728, -10824.32854931, -10805.95799174, -10783.61143039,\n",
" -10805.23312543, -10775.263036 ]),\n",
" 'test_neg_mean_squared_error': array([-3.04122866e+08, -3.25638520e+08, -3.20672772e+08, -3.25192333e+08,\n",
" -3.25081517e+08, -3.25439214e+08, -3.30733038e+08, -3.35721281e+08,\n",
" -3.25569272e+08, -3.31692605e+08]),\n",
" 'train_neg_mean_squared_error': array([-2.14095979e+08, -2.12305434e+08, -2.12535813e+08, -2.12849730e+08,\n",
" -2.12372106e+08, -2.12559687e+08, -2.12092152e+08, -2.11452972e+08,\n",
" -2.12838701e+08, -2.11672096e+08])}"
]
},
"execution_count": 271,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_knn = cross_validation(model=knn_model, X=X_train, y=y)\n",
"cv_knn"
]
},
{
"cell_type": "code",
"execution_count": 272,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tempo médio de treinamento: 0.858163 seg. Para 1 / 10 folds\n",
"RMSE Train: 14576.59\n",
"RMSE Test: 18025.97\n",
"MAE Train: 212477466.91\n",
"MAE Test: 324986341.80\n"
]
}
],
"source": [
"knn_rmse, knn_mae = calc_metrics(cv_knn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ADA"
]
},
{
"cell_type": "code",
"execution_count": 273,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "W_R1eykiEwFO"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-18026.59717156553, neg_mean_squared_error=-436203278.82177216, total= 10.0s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 10.6s remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-14086.39406115166, neg_mean_squared_error=-303697370.69939584, total= 6.5s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 17.4s remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-14497.848275125934, neg_mean_squared_error=-305129969.17546636, total= 8.1s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 26.0s remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-16808.006081666368, neg_mean_squared_error=-390860148.630144, total= 10.2s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 36.8s remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-14068.020373887357, neg_mean_squared_error=-302624342.10342324, total= 6.9s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-14126.87077556214, neg_mean_squared_error=-306093109.0017542, total= 7.4s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-13732.683230498624, neg_mean_squared_error=-293095706.45876855, total= 7.5s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-14217.656180605974, neg_mean_squared_error=-312771073.1200797, total= 6.9s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-14173.262098189978, neg_mean_squared_error=-306589571.9856712, total= 7.4s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-14050.543581768403, neg_mean_squared_error=-304371643.70640105, total= 5.7s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 1.3min finished\n"
]
},
{
"data": {
"text/plain": [
"{'fit_time': array([ 9.90668845, 6.45981526, 8.04847336, 10.13515735, 6.87510085,\n",
" 7.37868142, 7.44444728, 6.859236 , 7.38190651, 5.6998508 ]),\n",
" 'score_time': array([0.0626862 , 0.03798509, 0.04893899, 0.06131077, 0.04354501,\n",
" 0.0439086 , 0.04419541, 0.04118729, 0.04363227, 0.03371882]),\n",
" 'test_neg_mean_absolute_error': array([-18026.59717157, -14086.39406115, -14497.84827513, -16808.00608167,\n",
" -14068.02037389, -14126.87077556, -13732.6832305 , -14217.65618061,\n",
" -14173.26209819, -14050.54358177]),\n",
" 'train_neg_mean_absolute_error': array([-16281.59898887, -14006.81536966, -14404.75011922, -16784.77793312,\n",
" -14042.15380637, -14185.868585 , -13943.36532318, -14082.49068848,\n",
" -14081.79809772, -13897.6004175 ]),\n",
" 'test_neg_mean_squared_error': array([-4.36203279e+08, -3.03697371e+08, -3.05129969e+08, -3.90860149e+08,\n",
" -3.02624342e+08, -3.06093109e+08, -2.93095706e+08, -3.12771073e+08,\n",
" -3.06589572e+08, -3.04371644e+08]),\n",
" 'train_neg_mean_squared_error': array([-3.72203634e+08, -3.00704999e+08, -3.11884955e+08, -3.91016317e+08,\n",
" -3.01096915e+08, -3.05386553e+08, -2.98062609e+08, -3.03837926e+08,\n",
" -3.02289561e+08, -2.96651457e+08])}"
]
},
"execution_count": 273,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_ada = cross_validation(model=ada_model, X=X_train, y=y)\n",
"cv_ada"
]
},
{
"cell_type": "code",
"execution_count": 274,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tempo médio de treinamento: 7.618936 seg. Para 1 / 10 folds\n",
"RMSE Train: 17820.08\n",
"RMSE Test: 18020.35\n",
"MAE Train: 318313492.77\n",
"MAE Test: 326143621.37\n"
]
}
],
"source": [
"ada_rmse, ada_mae = calc_metrics(cv_ada)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Gradient Boosting"
]
},
{
"cell_type": "code",
"execution_count": 275,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 1866
},
"colab_type": "code",
"executionInfo": {
"elapsed": 620046,
"status": "error",
"timestamp": 1527489481723,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "2IRh66SLAgbA",
"outputId": "dce47b87-c196-4d84-d65c-e12a64a195db"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-11157.757141783874, neg_mean_squared_error=-236307514.546646, total= 27.7s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 28.4s remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-11794.54542317465, neg_mean_squared_error=-252768448.2961616, total= 27.7s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 56.7s remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-11974.000602985794, neg_mean_squared_error=-248563334.5069125, total= 27.4s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 1.4min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-11718.60288485207, neg_mean_squared_error=-249430319.5574618, total= 27.7s\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 1.9min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-11680.106617317362, neg_mean_squared_error=-249995110.64746425, total= 27.7s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-11418.145479275334, neg_mean_squared_error=-244904057.39809507, total= 27.1s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-11568.160365604053, neg_mean_squared_error=-248934116.92652842, total= 27.1s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-11794.06479295684, neg_mean_squared_error=-256482078.08974105, total= 27.3s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-11780.26542657853, neg_mean_squared_error=-249402499.14327267, total= 27.2s\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-12171.39535453055, neg_mean_squared_error=-267072995.2332186, total= 26.9s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 4.7min finished\n"
]
},
{
"data": {
"text/plain": [
"{'fit_time': array([27.61209464, 27.65011168, 27.2816534 , 27.59758711, 27.61559343,\n",
" 27.00480771, 27.07606125, 27.19110966, 27.1590209 , 26.84476233]),\n",
" 'score_time': array([0.06998491, 0.07221317, 0.0750978 , 0.07314777, 0.07354879,\n",
" 0.07360482, 0.07236648, 0.07219744, 0.07217264, 0.07455015]),\n",
" 'test_neg_mean_absolute_error': array([-11157.75714178, -11794.54542317, -11974.00060299, -11718.60288485,\n",
" -11680.10661732, -11418.14547928, -11568.1603656 , -11794.06479296,\n",
" -11780.26542658, -12171.39535453]),\n",
" 'train_neg_mean_absolute_error': array([-11722.73349644, -11628.15233363, -11615.83555431, -11668.20734179,\n",
" -11658.00362845, -11690.10426759, -11671.62031954, -11645.17506191,\n",
" -11630.84285931, -11582.4413844 ]),\n",
" 'test_neg_mean_squared_error': array([-2.36307515e+08, -2.52768448e+08, -2.48563335e+08, -2.49430320e+08,\n",
" -2.49995111e+08, -2.44904057e+08, -2.48934117e+08, -2.56482078e+08,\n",
" -2.49402499e+08, -2.67072995e+08]),\n",
" 'train_neg_mean_squared_error': array([-2.49767278e+08, -2.48153348e+08, -2.48356487e+08, -2.48617632e+08,\n",
" -2.48557717e+08, -2.49443714e+08, -2.48808790e+08, -2.48419772e+08,\n",
" -2.47943080e+08, -2.46573006e+08])}"
]
},
"execution_count": 275,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_gb = cross_validation(model=gb_model, X=X_train, y=y)\n",
"cv_gb"
]
},
{
"cell_type": "code",
"execution_count": 276,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tempo médio de treinamento: 27.303280 seg. Para 1 / 10 folds\n",
"RMSE Train: 15762.72\n",
"RMSE Test: 15821.84\n",
"MAE Train: 248464082.25\n",
"MAE Test: 250386047.43\n"
]
}
],
"source": [
"gb_rmse, gb_mae = calc_metrics(cv_gb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Random Forest"
]
},
{
"cell_type": "code",
"execution_count": 277,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 442
},
"colab_type": "code",
"executionInfo": {
"elapsed": 6282845,
"status": "ok",
"timestamp": 1527482202118,
"user": {
"displayName": "Gabriel Cesar",
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg",
"userId": "109223051625932368282"
},
"user_tz": 180
},
"id": "qxLDy6a0EwFI",
"outputId": "6bd7664a-5ea8-4c8d-ea48-d055e86a75e1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-10455.372119829986, neg_mean_squared_error=-217434181.2922031, total= 1.7min\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.8min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-10928.521278788283, neg_mean_squared_error=-220613589.8554696, total= 1.8min\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 3.7min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-11142.203312201136, neg_mean_squared_error=-222814795.3751126, total= 1.8min\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 5.6min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-10898.332185039357, neg_mean_squared_error=-219861113.39863864, total= 1.8min\n",
"[CV] ................................................................\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 7.4min remaining: 0.0s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] , neg_mean_absolute_error=-10757.151731450098, neg_mean_squared_error=-216575519.43371564, total= 1.7min\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-10533.368663378076, neg_mean_squared_error=-213042538.59741327, total= 1.8min\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-10782.932878683925, neg_mean_squared_error=-218226930.35436878, total= 1.8min\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-10895.059390163617, neg_mean_squared_error=-223279548.11551553, total= 1.7min\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-10771.6545234562, neg_mean_squared_error=-215919823.7076418, total= 1.7min\n",
"[CV] ................................................................\n",
"[CV] , neg_mean_absolute_error=-11297.096975680786, neg_mean_squared_error=-235400339.70382506, total= 1.7min\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 18.4min finished\n"
]
},
{
"data": {
"text/plain": [
"{'fit_time': array([103.4280529 , 106.41595149, 106.93811679, 105.83180761,\n",
" 104.17053199, 106.16159153, 108.26112056, 101.72895241,\n",
" 100.85266733, 100.95680785]),\n",
" 'score_time': array([0.60349679, 0.60762167, 0.62306309, 0.60878992, 0.64006448,\n",
" 0.68612719, 0.59401655, 0.6063838 , 0.59904742, 0.59874058]),\n",
" 'test_neg_mean_absolute_error': array([-10455.37211983, -10928.52127879, -11142.2033122 , -10898.33218504,\n",
" -10757.15173145, -10533.36866338, -10782.93287868, -10895.05939016,\n",
" -10771.65452346, -11297.09697568]),\n",
" 'train_neg_mean_absolute_error': array([-8369.41769811, -8320.75310612, -8326.0715284 , -8345.51824293,\n",
" -8325.88116065, -8363.95158476, -8342.9088071 , -8324.69521833,\n",
" -8343.51265885, -8257.02241748]),\n",
" 'test_neg_mean_squared_error': array([-2.17434181e+08, -2.20613590e+08, -2.22814795e+08, -2.19861113e+08,\n",
" -2.16575519e+08, -2.13042539e+08, -2.18226930e+08, -2.23279548e+08,\n",
" -2.15919824e+08, -2.35400340e+08]),\n",
" 'train_neg_mean_squared_error': array([-1.31514389e+08, -1.31258313e+08, -1.31510278e+08, -1.31873412e+08,\n",
" -1.31235206e+08, -1.32228487e+08, -1.31570988e+08, -1.30997437e+08,\n",
" -1.31894629e+08, -1.29364998e+08])}"
]
},
"execution_count": 277,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_rf = cross_validation(model=rf_model, X=X_train, y=y)\n",
"cv_rf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### RMSE & MAE"
]
},
{
"cell_type": "code",
"execution_count": 278,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tempo médio de treinamento: 104.474560 seg. Para 1 / 10 folds\n",
"RMSE Train: 11460.53\n",
"RMSE Test: 14841.79\n",
"MAE Train: 131344813.73\n",
"MAE Test: 220316837.98\n"
]
}
],
"source": [
"rf_rmse, rf_mae = calc_metrics(cv_rf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualização dos resultados"
]
},
{
"cell_type": "code",
"execution_count": 279,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[14576.5875465 , 18025.96891592],\n",
" [17820.07682486, 18020.34891532],\n",
" [15762.72186577, 15821.84438009],\n",
" [11460.53036498, 14841.79143433]])"
]
},
"execution_count": 279,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rmses = np.array([knn_rmse, ada_rmse, gb_rmse, rf_rmse])\n",
"rmses"
]
},
{
"cell_type": "code",
"execution_count": 280,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[2.12477467e+08, 3.24986342e+08],\n",
" [3.18313493e+08, 3.26143621e+08],\n",
" [2.48464082e+08, 2.50386047e+08],\n",
" [1.31344814e+08, 2.20316838e+08]])"
]
},
"execution_count": 280,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"maes = np.array([knn_mae, ada_mae, gb_mae, rf_mae])\n",
"maes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### RMSE "
]
},
{
"cell_type": "code",
"execution_count": 281,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5,1,'RMSE Train')"
]
},
"execution_count": 281,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"class_names = np.array(['KNN', 'ADA', 'GB', 'RF'])\n",
"plt.bar(range(rmses.shape[0]), rmses[:, 0])\n",
"plt.xticks(range(rmses.shape[0]), class_names)\n",
"plt.title('RMSE Train')"
]
},
{
"cell_type": "code",
"execution_count": 282,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5,1,'RMSE Test')"
]
},
"execution_count": 282,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"class_names = np.array(['KNN', 'ADA', 'GB', 'RF'])\n",
"plt.bar(range(rmses.shape[0]), rmses[:, 1])\n",
"plt.xticks(range(rmses.shape[0]), class_names)\n",
"plt.title('RMSE Test')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### MAE"
]
},
{
"cell_type": "code",
"execution_count": 283,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5,1,'MAE Train')"
]
},
"execution_count": 283,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEmJJREFUeJzt3XuQZGV5x/HvL7B4CQjRnRgCi2OUKi8kgtkgxJAQMVULGDEJRjYJomVqE6MJJuZCrMTbX+aiSaFEag2o4AWNGrMqXgtTQgWQARcCrsblImwwYXR1cYVIVp/80WdN09vD9Oz0TM+++/1Ude25PKf7mVOzv3nnPad7UlVIktryQ5NuQJI0foa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdpiSX5iSQ7J92H9i+Gu1a0JHckeSDJ6oHtm5NUkumB7a/tth8/sP1FSb6XZOfA48cH6o4a2F9JvtO3ftJCv4aquq2qDl7ocdJiGO7aF9wOrN+9kuQngUcMFiUJcDawHThnyPNcXVUHDzzu7i+oqjv793ebn9a37cohr3vAIr42aUkY7toXXAq8sG/9HOCSIXUnAT8OnAucleSgpWgmybuSXJDkE0m+A5yU5LndbxPfTnJnkr/sq39ikupbvyrJ65L8W1f/iSSPXopetf8y3LUvuAZ4VJInd6PkFwDvGlJ3DvAR4H3d+nOWsKffAF4HHAJcDewEfgs4FPhl4NwkD/X6v0Gv38cCPwz80RL2qv3QRMM9ycVJ7kly8wi1RyX5bJIvJLkpyWnL0aNWjN2j918CvgT8Z//OJI8Eng+8p6r+F/gAe07NnJDkW32PWxfRzz9X1dVV9f2q+m5VXVFVN3frNwKXAb/wEMdfVFVfqar7gH8Cjl1EL9IeJj1yfwewbsTavwDeX1XHAWcB/7BUTWlFupTeaPdFDJ+S+RVgF3B5t/5u4NQkU30111TVYX2PJyyin7v6V5KcmORfk8wm2QH8NrB6+KEA/Fff8n2AF1w1VhMN96r6HL2LXz+Q5AndHOT1Sa5M8qTd5cCjuuVDgQddCFPbquqr9C6sngZ8aEjJOfQC8s4k/0VvNLyKvgux425pYP0y4IPAmqo6FPhHIEv02tK8Dpx0A0NsBH63qr6S5Bn0RujPAl4LfCrJ79Obo3z25FrUhLwE+JGq+k6SH3zvJjkCOAU4Fbipr/4V9EL//GXo7RBge1X9T5IT6P12+dFleF1pqBUV7kkOBn4W+KfeXW0APKz7dz3wjqp6Y5ITgUuTHFNV359Aq5qAqpprjvxsYHNVfap/Y5LzgVcmOabbdOKQNxP9YlVdN4b2Xgr8TZILgc8C7wceOYbnlfZKJv3HOro3oXy0qo5J8ijgy1V1+JC6W4B1VXVXt34bcEJV3bOc/UrSvmDSF1QfpKruBW5P8nzovSklydO63XfS+9WbJE8GHg7MTqRRSVrhJjpyT/Je4GR6dxX8N/Aa4ArgrcDh9C6IXVZVr0/yFOBt9C6aFfCng7+GS5J6Jj4tI0kavxU1LSNJGo+J3S2zevXqmp6entTLS9I+6frrr/96VU3NVzexcJ+enmZmZmZSLy9J+6QkXx2lzmkZSWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lq0Ir6Yx1aPtPnfWzSLUzUHW84fdItSEvKkbskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ2aN9yTPDzJ55PcmOSWJK8bUvOwJO9LsjXJtUmml6JZSdJoRhm5fxd4VlU9DTgWWJfkhIGalwDfrKonAn8H/NV425QkLcS84V49O7vVVd2jBsrOAN7ZLX8AOCVJxtalJGlBRppzT3JAks3APcCnq+ragZIjgLsAqmoXsAN4zDgblSSNbqRwr6rvVdWxwJHA8UmOGSgZNkofHN2TZEOSmSQzs7OzC+9WkjSSBd0tU1XfAv4VWDewaxuwBiDJgcChwPYhx2+sqrVVtXZqamqvGpYkzW+Uu2WmkhzWLT8CeDbwpYGyTcA53fKZwBVVtcfIXZK0PEb5yN/DgXcmOYDeD4P3V9VHk7wemKmqTcBFwKVJttIbsZ+1ZB1LkuY1b7hX1U3AcUO2v7pv+X+A54+3NUnS3vIdqpLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAbNG+5J1iT5bJItSW5Jcu6QmpOT7EiyuXu8emnalSSN4sARanYBr6yqG5IcAlyf5NNV9cWBuiur6jnjb1GStFDzjtyr6mtVdUO3/G1gC3DEUjcmSdp7C5pzTzINHAdcO2T3iUluTPLxJE+d4/gNSWaSzMzOzi64WUnSaEaZlgEgycHAB4FXVNW9A7tvAB5XVTuTnAZ8GDh68DmqaiOwEWDt2rW1111LEzZ93scm3cJE3fGG0yfdguYx0sg9ySp6wf7uqvrQ4P6qureqdnbLlwOrkqwea6eSpJGNcrdMgIuALVX1pjlqfqyrI8nx3fN+Y5yNSpJGN8q0zDOBs4F/T7K52/Yq4CiAqroQOBN4aZJdwP3AWVXltIskTci84V5VVwGZp+YtwFvG1ZQkaXF8h6okNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUoFH+QPaKM33exybdwkTd8YbTJ92CpBXOkbskNchwl6QGGe6S1CDDXZIaNG+4J1mT5LNJtiS5Jcm5Q2qS5PwkW5PclOTpS9OuJGkUo9wtswt4ZVXdkOQQ4Pokn66qL/bVnAoc3T2eAby1+1eSNAHzjtyr6mtVdUO3/G1gC3DEQNkZwCXVcw1wWJLDx96tJGkkC5pzTzINHAdcO7DrCOCuvvVt7PkDgCQbkswkmZmdnV1Yp5KkkY0c7kkOBj4IvKKq7h3cPeSQ2mND1caqWltVa6emphbWqSRpZCOFe5JV9IL93VX1oSEl24A1fetHAncvvj1J0t4Y5W6ZABcBW6rqTXOUbQJe2N01cwKwo6q+NsY+JUkLMMrdMs8Ezgb+PcnmbturgKMAqupC4HLgNGArcB/w4vG3Kkka1bzhXlVXMXxOvb+mgJeNqylJ0uL4DlVJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGjRvuCe5OMk9SW6eY//JSXYk2dw9Xj3+NiVJC3HgCDXvAN4CXPIQNVdW1XPG0pEkadHmHblX1eeA7cvQiyRpTMY1535ikhuTfDzJU+cqSrIhyUySmdnZ2TG9tCRp0DjC/QbgcVX1NODNwIfnKqyqjVW1tqrWTk1NjeGlJUnDLDrcq+reqtrZLV8OrEqyetGdSZL22qLDPcmPJUm3fHz3nN9Y7PNKkvbevHfLJHkvcDKwOsk24DXAKoCquhA4E3hpkl3A/cBZVVVL1rEkaV7zhntVrZ9n/1vo3SopSVohfIeqJDXIcJekBhnuktQgw12SGmS4S1KDDHdJatAonwopSWM1fd7HJt3CRN3xhtOX/DUcuUtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KD5g33JBcnuSfJzXPsT5Lzk2xNclOSp4+/TUnSQowycn8HsO4h9p8KHN09NgBvXXxbkqTFmDfcq+pzwPaHKDkDuKR6rgEOS3L4uBqUJC3cOObcjwDu6lvf1m3bQ5INSWaSzMzOzo7hpSVJw4wj3DNkWw0rrKqNVbW2qtZOTU2N4aUlScOMI9y3AWv61o8E7h7D80qS9tI4wn0T8MLurpkTgB1V9bUxPK8kaS8dOF9BkvcCJwOrk2wDXgOsAqiqC4HLgdOArcB9wIuXqllJ0mjmDfeqWj/P/gJeNraOJEmL5jtUJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWrQSOGeZF2SLyfZmuS8IftflGQ2yebu8dvjb1WSNKoD5ytIcgBwAfBLwDbguiSbquqLA6Xvq6qXL0GPkqQFGmXkfjywtapuq6oHgMuAM5a2LUnSYowS7kcAd/Wtb+u2Dfq1JDcl+UCSNcOeKMmGJDNJZmZnZ/eiXUnSKEYJ9wzZVgPrHwGmq+qngM8A7xz2RFW1sarWVtXaqamphXUqSRrZKOG+DegfiR8J3N1fUFXfqKrvdqtvA356PO1JkvbGKOF+HXB0kscnOQg4C9jUX5Dk8L7V5wJbxteiJGmh5r1bpqp2JXk58EngAODiqrolyeuBmaraBPxBkucCu4DtwIuWsGdJ0jzmDXeAqrocuHxg26v7lv8c+PPxtiZJ2lu+Q1WSGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktSgkcI9ybokX06yNcl5Q/Y/LMn7uv3XJpked6OSpNHNG+5JDgAuAE4FngKsT/KUgbKXAN+sqicCfwf81bgblSSNbpSR+/HA1qq6raoeAC4DzhioOQN4Z7f8AeCUJBlfm5KkhThwhJojgLv61rcBz5irpqp2JdkBPAb4en9Rkg3Ahm51Z5Iv703TK8BqBr625ZQ2fi/yHC6O529x9uXz97hRikYJ92Ej8NqLGqpqI7BxhNdc0ZLMVNXaSfexL/McLo7nb3H2h/M3yrTMNmBN3/qRwN1z1SQ5EDgU2D6OBiVJCzdKuF8HHJ3k8UkOAs4CNg3UbALO6ZbPBK6oqj1G7pKk5THvtEw3h/5y4JPAAcDFVXVLktcDM1W1CbgIuDTJVnoj9rOWsukVYJ+fWloBPIeL4/lbnObPXxxgS1J7fIeqJDXIcJekBhnuA5Ls7Fs+LclXkhyV5LVJ7kvyo3PUVpI39q3/cZLXLlvjK0iSX+nOx5O69ekk9yf5QpItST6f5Jwhx/1LkquXv+OVK8ljk7wnyW1Jrk9ydXd+T06yI8nmJDcl+Uz/96Z6knyvO0c3J/lIksO67bu/Jzf3PQ6adL/jZLjPIckpwJuBdVV1Z7f568Ar5zjku8CvJlm9HP2tcOuBq3jwhfVbq+q4qnpyt/0Pk7x4987uP93TgcOSPH5Zu12hund5fxj4XFX9RFX9NL1zd2RXcmVVHVtVP0XvrraXTajVlez+7hwdQ+9mj/5zdGu3b/fjgQn1uCQM9yGSnAS8DTi9qm7t23Ux8IIkjx5y2C56V+D/cBlaXLGSHAw8k97nDQ29a6qqbgP+CPiDvs2/BnyE3sdbtH631aieBTxQVRfu3lBVX62qN/cXdT8EDgG+ucz97Wuupvdu+v2C4b6nhwH/Ajyvqr40sG8nvYA/d45jLwB+M8mhS9jfSvc84BNV9R/A9iRPn6PuBuBJfevrgfd2j/VL2+I+46n0ztNcTkqyGbgTeDa9700N0X0A4ik8+D06T+ibkrlgQq0tGcN9T/8L/Bu9kecw5wPnJHnU4I6quhe4hAePSPc36+mNvun+nSuof/CRFUkeCzwRuKr7obAryTFL2uU+KMkFSW5Mcl23afe0zBrg7cBfT7C9leoR3Q/AbwCPBj7dt69/Wqa5KS3DfU/fB34d+JkkrxrcWVXfAt4D/N4cx/89vR8MP7xkHa5QSR5DbyrhH5PcAfwJ8AKGf/bQccCWbvkFwI8At3fHTePUDMAt9K5DANAF0CnA1JDaTcDPL1Nf+5L7q+pYeh+2dRD70XUJw32IqroPeA69KZZhI/g3Ab/DkHf4VtV24P3MPfJv2ZnAJVX1uKqa7kaUt/P/FwCB3p0KwN/Su2ANvdH9uu6YaWD3hcP93RXAw5O8tG/bI+eo/Tng1jn27feqage936j/OMmqSfezHAz3OXQhvQ74iyRnDOz7OvDP9Obnh3kjvY8U3d+sp3de+n0QeBW9+c0vJNlC74ffm6vq7V3QHwVcs/uAqroduDfJ4EdL71e6z2d6HvALSW5P8nl6fzfhz7qSk7r54huBs5n7Ti4BVfUF4Eb2k4GDHz8gSQ1y5C5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoP+D9bs5l7GT47SAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"class_names = np.array(['KNN', 'ADA', 'GB', 'RF'])\n",
"plt.bar(range(maes.shape[0]), maes[:, 0])\n",
"plt.xticks(range(maes.shape[0]), class_names)\n",
"plt.title('MAE Train')"
]
},
{
"cell_type": "code",
"execution_count": 284,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5,1,'MAE Test')"
]
},
"execution_count": 284,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEoFJREFUeJzt3X+w5XVdx/HnK3ZRC4V0r4mwcE0pf6CAboQZxUg2KziiiclWiGazjWliaUWOo+SMM9iUFkjSmqSQv39ka+KvBhtwEvSCC4krtfwQNjCurC6SJK69++N8t45nz/Wee++5e+5+9vmY+c6e7/fzPue873d2X+dzP+d7zqaqkCS15Ucm3YAkafwMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw137jCS3Jrk/yZqB41uSVJLpgePndsePHzj+oiTfT3LvwPbIgbojBsYryX/17Z+4hJ/l60l+frH3l+ZjuGtfcwuwYfdOkicCDxosShLgTGAHcNaQx/l8VR00sN3RX1BVt/WPd4eP6Tt25dh+KmnMDHftay4FXti3fxZwyZC6E4FHAmcDZyQ5cDmaSfKgJH+R5PZuNn5Bkgd0Y49I8skk30pyd5LLu+MfBB4OfLr7DeAVy9Gb9m+Gu/Y1VwEPSfK4JAcALwD+bkjdWcDHgPd3+89apn7eAhwOPBH4aeCngHO6sT8CbgTWAIcC5wJU1fOBu4Bf7n4DOH+ZetN+bKLhnuTiJHcl+fIItUck+WySLyW5Pskpe6NHrUi7Z+/PAL4K/Ef/YJIfBZ4PvKeqvgd8iD2XZk7oZtS7t5sW2kSSVcBvAmdX1beqaidwHnBGV/I9er89HFFV91fVFQt9DmmxJj1zfyewfsTa1wIfqKrj6P3j+avlakor3qXArwEvYviSzHOBXcBl3f67gWcmmeqruaqqDunbHr2IPh4JrAZu2P0iAXyU3pILwBuBO4DPJtmW5PcX8RzSokw03LuZzI7+Y0ke3a1TXpPkyiSP3V0OPKS7fTC9fzTaD1XV1+i9sXoK8JEhJWcBBwG3Jfk68EF6IbxhSO1S3EnvReTRfS8SB1fVw7o+d1bV2VV1JPA84LVJnrb7xxhzL9IPmPTMfZhNwO9W1VOAV/P/M/Rzgd9Isp3ejOx3J9OeVoiXAE+vqv/qP5jkMOBkemvsx3bbMcCbGH7VzKJ1Sz4XA3+ZZE161iZ5RtfLs5M8qrtyZyfw/W4D+E/gJ8fZj9RvRYV7koOAnwM+mGQL8Nf03oiC3qzrnVV1OL0Z26VJVlT/2nuq6qaqmhkydCawpao+XVVf370B5wNPSnJ0V/fUIde5/8wiWnklvd8iZ+gF+CeBx3RjjwP+Gfg2cAXwZ1V1VTf2RuCN3XLOyxfxvNIPlUn/Zx3dB0/+saqOTvIQ4MaqOnRI3Q3A+qq6vdu/GTihqu7am/1K0r5gRc18q+oe4JYkz4feB1GSHNMN30bv122SPA54IDA7kUYlaYWb6Mw9yXuBk+hdB/yfwOuBy4G30VuOWQ28r6rekOTxwNvpvVFWwB9W1acn0bckrXQTX5aRJI3filqWkSSNx6pJPfGaNWtqenp6Uk8vSfuka6655htVNTVf3cTCfXp6mpmZYVeySZLmkuRro9S5LCNJDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ2a2CdUNVnT53x80i1M1K3nnTrpFqRl5cxdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGzRvuSR6Y5AtJrktyQ5I/GVLzgCTvT7ItydVJppejWUnSaEaZuX8XeHpVHQMcC6xPcsJAzUuAb1bVY4C3AG8ab5uSpIWYN9yr595ud3W31UDZacC7utsfAk5OkrF1KUlakJHW3JMckGQLcBfwmaq6eqDkMOB2gKraBewEHjbORiVJoxsp3Kvq+1V1LHA4cHySowdKhs3SB2f3JNmYZCbJzOzs7MK7lSSNZEFXy1TVt4B/BtYPDG0H1gIkWQUcDOwYcv9NVbWuqtZNTU0tqmFJ0vxGuVpmKskh3e0HAb8EfHWgbDNwVnf7dODyqtpj5i5J2jtG+crfQ4F3JTmA3ovBB6rqH5O8AZipqs3AO4BLk2yjN2M/Y9k6liTNa95wr6rrgeOGHH9d3+3/Bp4/3tYkSYvlJ1QlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNmjfck6xN8tkkW5PckOTsITUnJdmZZEu3vW552pUkjWLVCDW7gFdV1bVJHgxck+QzVfWVgborq+pZ429RkrRQ887cq+rOqrq2u/1tYCtw2HI3JklavAWtuSeZBo4Drh4y/NQk1yX5RJInzHH/jUlmkszMzs4uuFlJ0mhGWZYBIMlBwIeBV1bVPQPD1wJHVtW9SU4BPgocNfgYVbUJ2ASwbt26WnTX0oRNn/PxSbcwUbeed+qkW9A8Rpq5J1lNL9jfXVUfGRyvqnuq6t7u9mXA6iRrxtqpJGlko1wtE+AdwNaqevMcNY/o6khyfPe4d4+zUUnS6EZZlnkacCbwr0m2dMdeAxwBUFUXAacDL02yC7gPOKOqXHaRpAmZN9yr6nNA5ql5K/DWcTUlSVoaP6EqSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNWiU/yB7xZk+5+OTbmGibj3v1Em3IGmFc+YuSQ0y3CWpQYa7JDXIcJekBs0b7knWJvlskq1Jbkhy9pCaJDk/ybYk1yd58vK0K0kaxShXy+wCXlVV1yZ5MHBNks9U1Vf6ap4JHNVtPwu8rftTkjQB887cq+rOqrq2u/1tYCtw2EDZacAl1XMVcEiSQ8ferSRpJAtac08yDRwHXD0wdBhwe9/+dvZ8ASDJxiQzSWZmZ2cX1qkkaWQjh3uSg4APA6+sqnsGh4fcpfY4ULWpqtZV1bqpqamFdSpJGtlI4Z5kNb1gf3dVfWRIyXZgbd/+4cAdS29PkrQYo1wtE+AdwNaqevMcZZuBF3ZXzZwA7KyqO8fYpyRpAUa5WuZpwJnAvybZ0h17DXAEQFVdBFwGnAJsA74DvHj8rUqSRjVvuFfV5xi+pt5fU8DLxtWUJGlp/ISqJDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ2aN9yTXJzkriRfnmP8pCQ7k2zptteNv01J0kKsGqHmncBbgUt+SM2VVfWssXQkSVqyeWfuVXUFsGMv9CJJGpNxrbk/Ncl1ST6R5AlzFSXZmGQmyczs7OyYnlqSNGgc4X4tcGRVHQNcAHx0rsKq2lRV66pq3dTU1BieWpI0zJLDvaruqap7u9uXAauTrFlyZ5KkRVtyuCd5RJJ0t4/vHvPupT6uJGnx5r1aJsl7gZOANUm2A68HVgNU1UXA6cBLk+wC7gPOqKpato4lSfOaN9yrasM842+ld6mkJGmF8BOqktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGjfCukJI3V9Dkfn3QLE3Xreacu+3M4c5ekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGzRvuSS5OcleSL88xniTnJ9mW5PokTx5/m5KkhRhl5v5OYP0PGX8mcFS3bQTetvS2JElLMW+4V9UVwI4fUnIacEn1XAUckuTQcTUoSVq4cay5Hwbc3re/vTu2hyQbk8wkmZmdnR3DU0uShhlHuGfIsRpWWFWbqmpdVa2bmpoaw1NLkoYZR7hvB9b27R8O3DGGx5UkLdI4wn0z8MLuqpkTgJ1VdecYHleStEir5itI8l7gJGBNku3A64HVAFV1EXAZcAqwDfgO8OLlalaSNJp5w72qNswzXsDLxtaRJGnJ/ISqJDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0aKdyTrE9yY5JtSc4ZMv6iJLNJtnTbb42/VUnSqFbNV5DkAOBC4BnAduCLSTZX1VcGSt9fVS9fhh4lSQs0ysz9eGBbVd1cVfcD7wNOW962JElLMUq4Hwbc3re/vTs26HlJrk/yoSRrhz1Qko1JZpLMzM7OLqJdSdIoRgn3DDlWA/sfA6ar6knAPwHvGvZAVbWpqtZV1bqpqamFdSpJGtko4b4d6J+JHw7c0V9QVXdX1Xe73bcDTxlPe5KkxRgl3L8IHJXkUUkOBM4ANvcXJDm0b/fZwNbxtShJWqh5r5apql1JXg58CjgAuLiqbkjyBmCmqjYDr0jybGAXsAN40TL2LEmax7zhDlBVlwGXDRx7Xd/tPwb+eLytSZIWy0+oSlKDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGjRSuCdZn+TGJNuSnDNk/AFJ3t+NX51ketyNSpJGN2+4JzkAuBB4JvB4YEOSxw+UvQT4ZlU9BngL8KZxNypJGt0oM/fjgW1VdXNV3Q+8DzhtoOY04F3d7Q8BJyfJ+NqUJC3EqhFqDgNu79vfDvzsXDVVtSvJTuBhwDf6i5JsBDZ2u/cmuXExTa8Aaxj42famtPF7kedwaTx/S7Mvn78jRykaJdyHzcBrETVU1SZg0wjPuaIlmamqdZPuY1/mOVwaz9/S7A/nb5Rlme3A2r79w4E75qpJsgo4GNgxjgYlSQs3Srh/ETgqyaOSHAicAWweqNkMnNXdPh24vKr2mLlLkvaOeZdlujX0lwOfAg4ALq6qG5K8AZipqs3AO4BLk2yjN2M/YzmbXgH2+aWlFcBzuDSev6Vp/vzFCbYktcdPqEpSgwx3SWqQ4T4gyb19t09J8u9JjkhybpLvJHn4HLWV5M/79l+d5Ny91vgKkuS53fl4bLc/neS+JF9KsjXJF5KcNeR+/5Dk83u/45UryU8keU+Sm5Nck+Tz3fk9KcnOJFuSXJ/kn/r/bqonyfe7c/TlJB9Lckh3fPffyS1924GT7necDPc5JDkZuABYX1W3dYe/Abxqjrt8F/iVJGv2Rn8r3Abgc/zgG+s3VdVxVfW47vjvJXnx7sHuH92TgUOSPGqvdrtCdZ/y/ihwRVX9ZFU9hd65O7wrubKqjq2qJ9G7qu1lE2p1JbuvO0dH07vYo/8c3dSN7d7un1CPy8JwHyLJicDbgVOr6qa+oYuBFyR56JC77aL3Dvzv7YUWV6wkBwFPo/d9Q0Ovmqqqm4HfB17Rd/h5wMfofb1F61dbjerpwP1VddHuA1X1taq6oL+oexF4MPDNvdzfvubz9D5Nv18w3Pf0AOAfgOdU1VcHxu6lF/Bnz3HfC4FfT3LwMva30j0H+GRV/RuwI8mT56i7Fnhs3/4G4L3dtmF5W9xnPIHeeZrLiUm2ALcBv0Tv76aG6L4A8WR+8DM6j+5bkrlwQq0tG8N9T98D/oXezHOY84GzkjxkcKCq7gEu4QdnpPubDfRm33R/zhXU//eVFUl+AngM8LnuRWFXkqOXtct9UJILk1yX5Ivdod3LMmuBvwX+dILtrVQP6l4A7wYeCnymb6x/Waa5JS3DfU//A/wq8DNJXjM4WFXfAt4D/M4c9/8Lei8MP7ZsHa5QSR5Gbynhb5LcCvwB8AKGf/fQccDW7vYLgB8HbunuN41LMwA30HsfAoAugE4GpobUbgZ+YS/1tS+5r6qOpfdlWweyH70vYbgPUVXfAZ5Fb4ll2Az+zcBvM+QTvlW1A/gAc8/8W3Y6cElVHVlV092M8hb+/w1AoHelAvBn9N6wht7sfn13n2lg9xuH+7vLgQcmeWnfsR+do/bngZvmGNvvVdVOer9RvzrJ6kn3szcY7nPoQno98Nokpw2MfQP4e3rr88P8Ob2vFN3fbKB3Xvp9GHgNvfXNLyXZSu/F74Kq+tsu6I8Artp9h6q6BbgnyeBXS+9Xuu9neg7wi0luSfIFev9vwh91JSd268XXAWcy95VcAqrqS8B17CcTB79+QJIa5MxdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QG/S806QHj5O9/ZAAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"class_names = np.array(['KNN', 'ADA', 'GB', 'RF'])\n",
"plt.bar(range(maes.shape[0]), maes[:, 0])\n",
"plt.xticks(range(maes.shape[0]), class_names)\n",
"plt.title('MAE Test')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dificuldades"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Trabalhar com textos nas features do dataset.\n",
"- Substituir valores que estão faltando.\n",
"- Saber quando remover uma feature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Aprendizados"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Utilização das ferramentas ja existentes para realizar atividades que seriam realizadas manualmente.\n",
"- Um pouco de conhecimento sobre como funciona uma predição onde as features contém textos."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Possíveis melhorias futuras"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- O score que é retornada é, portanto, negativo quando o score deve ser minimizado e positivo se for um score que deva ser maximizado. Portanto, minizar o score é uma melhoria futura.\n",
"- Verificar o comportamento com as features que não foram utilizadas.\n",
"- Utilizar outros algoritmos para realizar a predição."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"OHzogV28EwBJ",
"SUuskyQsEwBP",
"TuGX7DRrEwBW",
"3qa4BYJUEwBb",
"iILrtpDxEwBi",
"uIOf83lkEwBm",
"JeIbKYJCEwBz",
"FDBRPSj_EwB8",
"ztzrX_FOEwCE",
"slLrezFsEwCZ",
"WUHN-XxLEwCv",
"tL-laH_pEwC-",
"xDIMGEN7EwDG",
"3UqZ9i79EwDN",
"-EhpTVryEwD9",
"btL2O-tJEwEn"
],
"default_view": {},
"name": "Job Salary Prediction.ipynb",
"provenance": [],
"version": "0.3.2",
"views": {}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment