Skip to content

Instantly share code, notes, and snippets.

@tkhan0
Last active November 10, 2019 22:26
Show Gist options
  • Save tkhan0/683943c3329aa1566a9306b16d8cdb49 to your computer and use it in GitHub Desktop.
Save tkhan0/683943c3329aa1566a9306b16d8cdb49 to your computer and use it in GitHub Desktop.
Data Cleaning -Part 2.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## Diverse features in a dataset\n\nThe dataset that we will normally deal with will potentially be present in different magnitudes and range. In order to make the data uniform and of same scale and magnitude, we have various transformation techniques. These techniques will ensure that every feature of the data set has an appropriate effect on model's prediction.\n\n\nYou will encounter that our of the many features in the data, some will have high magnitude like the salary feature and some would have low magnitude like the overall job experience of an employee. But we should'nt ignore the data because of it lowe magnitude. It could be as important as the feature with high magnitude or even more important. So to ensure that the prediction that out model makes doesn't vary because of the varied magnitudes of data, we perform scaling, standardization or normalization. These are the 3 ways in which we could encounter the challenge of magnitude in our dara."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dtypes = df.dtypes",
"execution_count": 126,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dtypes",
"execution_count": 127,
"outputs": [
{
"data": {
"text/plain": "id int64\ndiagnosis object\nradius_mean float64\ntexture_mean float64\nperimeter_mean float64\narea_mean float64\nsmoothness_mean float64\ncompactness_mean float64\nconcavity_mean float64\nconcave points_mean float64\nsymmetry_mean float64\nfractal_dimension_mean float64\nradius_se float64\ntexture_se float64\nperimeter_se float64\narea_se float64\nsmoothness_se float64\ncompactness_se float64\nconcavity_se float64\nconcave points_se float64\nsymmetry_se float64\nfractal_dimension_se float64\nradius_worst float64\ntexture_worst float64\nperimeter_worst float64\narea_worst float64\nsmoothness_worst float64\ncompactness_worst float64\nconcavity_worst float64\nconcave points_worst float64\nsymmetry_worst float64\nfractal_dimension_worst float64\ndtype: object"
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"scrolled": false,
"trusted": true
},
"cell_type": "code",
"source": "info = pd.concat([null_,dtypes],axis=1,keys=['null','type'])\nprint(info)",
"execution_count": 128,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": " null type\nid False int64\ndiagnosis False object\nradius_mean False float64\ntexture_mean False float64\nperimeter_mean False float64\narea_mean False float64\nsmoothness_mean False float64\ncompactness_mean False float64\nconcavity_mean False float64\nconcave points_mean False float64\nsymmetry_mean False float64\nfractal_dimension_mean False float64\nradius_se False float64\ntexture_se False float64\nperimeter_se False float64\narea_se False float64\nsmoothness_se False float64\ncompactness_se False float64\nconcavity_se False float64\nconcave points_se False float64\nsymmetry_se False float64\nfractal_dimension_se False float64\nradius_worst False float64\ntexture_worst False float64\nperimeter_worst False float64\narea_worst False float64\nsmoothness_worst False float64\ncompactness_worst False float64\nconcavity_worst False float64\nconcave points_worst False float64\nsymmetry_worst False float64\nfractal_dimension_worst False float64\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Approach - I\n\n## 1. Standard Scalar \n\nStandard Scalar is a technique in where the data will be transformed such that the mean value will be zero and the standard deviation will be 1.\n\nStandard scalar is an efficient method for scaling the data and we can use it in almost every case. However it wont have much effect if it comes to decision trees or regressors. But still keep this as the default scaling method. If you feel that this is not scaling the data then we could always switch to other scaling methods.\n\nBelow is the formula for **Standard Scalar**:\n\n![standard+scalar.PNG](attachment:standard+scalar.PNG)\n\nTo calculate the **mean** we will use the below formula:\n\n![mean.PNG](attachment:mean.PNG)\n\nTo calculate the **standard deviation** we will use the below formula:\n\n![standard+deviation.PNG](attachment:standard+deviation.PNG)\n\nlet's take an example to understand the Standard scalar method\n\nlet's say we have 10 random values x : 44, 50, 38, 96, 42, 47, 40, 39, 46, 50\n\n1. The **Mean** of x will be 44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50/ 10 = 492/10 = 49.2\n\n\n2. To calculate the standard deviation, lets create the below table\n\n| x | (x - 49.2) | (x - 49.2)^2 |\n|---|----------|----------------|\n|44 | -5.2 | 27.04 |\n|50 |0.8 |0.64 |\n|38 |-11.2 |125.44 |\n|96 |46.8 |2190.24 |\n|42 |-7.2 |51.84 |\n|47 |-2.2 |4.84 |\n|40 |-9.2 |86.64 |\n|39 |-10.2 |104.04 |\n|46 |-3.2 |10.24 |\n|50 |0.8 |0.64 |\n|Total|- |2600.4 |\n\nNow as per the formula of standard deviation: \\begin{equation*}\\sqrt{{1/N}\\sum_{i=1}^n (x - 49.2)^2}\\end{equation*}\n\n\\begin{equation*}=\\sqrt{2600.4/10}\\end{equation*}\n\\begin{equation*}=\\sqrt{260.04}\\end{equation*}\n\\begin{equation*}=16.13\\end{equation*}\n\nNow, the standard scalar for all the points are:\n\n\\begin{equation*}z =x - \\mu / \\sigma \\end{equation*}\n\\begin{equation*}z =x - \\mu / \\sigma \\end{equation*}\n\\begin{equation*} =-5.2/16.13 = -0.322 \\end{equation*}\n\\begin{equation*} =0.8/16.13 = 0.05 \\end{equation*}\n\\begin{equation*} =-11.2/16.13 = 0.694 \\end{equation*}\n\\begin{equation*} =46.8/16.13 = 2.90 \\end{equation*}\n\\begin{equation*} =-7.2/16.13 = -0.45 \\end{equation*}\n\\begin{equation*} =-2.2/16.13 = -0.136 \\end{equation*}\n\\begin{equation*} =-9.2/16.13 = -0.57 \\end{equation*}\n\\begin{equation*} =-10.2/16.13 = -0.632 \\end{equation*}\n\\begin{equation*} =-3.2/16.13 = -0.2 \\end{equation*}\n\\begin{equation*} =0.8/16.13 = 0.05\\end{equation*}\n\nIf you add all those above standard scalar points and calculate the mean then you will get a value which is approximately 0 and standard deviation approximately close to 1.",
"attachments": {
"standard+scalar.PNG": {
"image/png": ""
},
"mean.PNG": {
"image/png": ""
},
"standard+deviation.PNG": {
"image/png": ""
}
}
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### Example: Let us implement the StandardScalar in a smaller set of array and see how it works."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.preprocessing import StandardScaler\nimport numpy as np\n\n### 4 samples/observations and 2 variables/features\ndata = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])\nscaler = StandardScaler()\nscaled_data = scaler.fit_transform(data)\n",
"execution_count": 129,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "C:\\Users\\tkhan050\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\sklearn\\utils\\validation.py:475: DataConversionWarning: Data with input dtype int32 was converted to float64 by StandardScaler.\n warnings.warn(msg, DataConversionWarning)\n"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "print(\"Data: \", data)\nprint(\"Standard Scalar Data: \",scaled_data)",
"execution_count": 130,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Data: [[0 0]\n [1 0]\n [0 1]\n [1 1]]\nStandard Scalar Data: [[-1. -1.]\n [ 1. -1.]\n [-1. 1.]\n [ 1. 1.]]\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### If you calculate the Standard Scalar using the math formula you will get the same result"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Now lets implement Standard Scalar to out data"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd\ndataset = '/Data.csv'",
"execution_count": 150,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv(dataset)",
"execution_count": 151,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = df.loc[:, ~df.columns.str.contains('^Unnamed')]",
"execution_count": 152,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.head()",
"execution_count": 153,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>diagnosis</th>\n <th>radius_mean</th>\n <th>texture_mean</th>\n <th>perimeter_mean</th>\n <th>area_mean</th>\n <th>smoothness_mean</th>\n <th>compactness_mean</th>\n <th>concavity_mean</th>\n <th>concave points_mean</th>\n <th>...</th>\n <th>radius_worst</th>\n <th>texture_worst</th>\n <th>perimeter_worst</th>\n <th>area_worst</th>\n <th>smoothness_worst</th>\n <th>compactness_worst</th>\n <th>concavity_worst</th>\n <th>concave points_worst</th>\n <th>symmetry_worst</th>\n <th>fractal_dimension_worst</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>842302</td>\n <td>M</td>\n <td>17.99</td>\n <td>10.38</td>\n <td>122.80</td>\n <td>1001.0</td>\n <td>0.11840</td>\n <td>0.27760</td>\n <td>0.3001</td>\n <td>0.14710</td>\n <td>...</td>\n <td>25.38</td>\n <td>17.33</td>\n <td>184.60</td>\n <td>2019.0</td>\n <td>0.1622</td>\n <td>0.6656</td>\n <td>0.7119</td>\n <td>0.2654</td>\n <td>0.4601</td>\n <td>0.11890</td>\n </tr>\n <tr>\n <th>1</th>\n <td>842517</td>\n <td>M</td>\n <td>20.57</td>\n <td>17.77</td>\n <td>132.90</td>\n <td>1326.0</td>\n <td>0.08474</td>\n <td>0.07864</td>\n <td>0.0869</td>\n <td>0.07017</td>\n <td>...</td>\n <td>24.99</td>\n <td>23.41</td>\n <td>158.80</td>\n <td>1956.0</td>\n <td>0.1238</td>\n <td>0.1866</td>\n <td>0.2416</td>\n <td>0.1860</td>\n <td>0.2750</td>\n <td>0.08902</td>\n </tr>\n <tr>\n <th>2</th>\n <td>84300903</td>\n <td>M</td>\n <td>19.69</td>\n <td>21.25</td>\n <td>130.00</td>\n <td>1203.0</td>\n <td>0.10960</td>\n <td>0.15990</td>\n <td>0.1974</td>\n <td>0.12790</td>\n <td>...</td>\n <td>23.57</td>\n <td>25.53</td>\n <td>152.50</td>\n <td>1709.0</td>\n <td>0.1444</td>\n <td>0.4245</td>\n <td>0.4504</td>\n <td>0.2430</td>\n <td>0.3613</td>\n <td>0.08758</td>\n </tr>\n <tr>\n <th>3</th>\n <td>84348301</td>\n <td>M</td>\n <td>11.42</td>\n <td>20.38</td>\n <td>77.58</td>\n <td>386.1</td>\n <td>0.14250</td>\n <td>0.28390</td>\n <td>0.2414</td>\n <td>0.10520</td>\n <td>...</td>\n <td>14.91</td>\n <td>26.50</td>\n <td>98.87</td>\n <td>567.7</td>\n <td>0.2098</td>\n <td>0.8663</td>\n <td>0.6869</td>\n <td>0.2575</td>\n <td>0.6638</td>\n <td>0.17300</td>\n </tr>\n <tr>\n <th>4</th>\n <td>84358402</td>\n <td>M</td>\n <td>20.29</td>\n <td>14.34</td>\n <td>135.10</td>\n <td>1297.0</td>\n <td>0.10030</td>\n <td>0.13280</td>\n <td>0.1980</td>\n <td>0.10430</td>\n <td>...</td>\n <td>22.54</td>\n <td>16.67</td>\n <td>152.20</td>\n <td>1575.0</td>\n <td>0.1374</td>\n <td>0.2050</td>\n <td>0.4000</td>\n <td>0.1625</td>\n <td>0.2364</td>\n <td>0.07678</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 32 columns</p>\n</div>",
"text/plain": " id diagnosis radius_mean texture_mean perimeter_mean area_mean \\\n0 842302 M 17.99 10.38 122.80 1001.0 \n1 842517 M 20.57 17.77 132.90 1326.0 \n2 84300903 M 19.69 21.25 130.00 1203.0 \n3 84348301 M 11.42 20.38 77.58 386.1 \n4 84358402 M 20.29 14.34 135.10 1297.0 \n\n smoothness_mean compactness_mean concavity_mean concave points_mean \\\n0 0.11840 0.27760 0.3001 0.14710 \n1 0.08474 0.07864 0.0869 0.07017 \n2 0.10960 0.15990 0.1974 0.12790 \n3 0.14250 0.28390 0.2414 0.10520 \n4 0.10030 0.13280 0.1980 0.10430 \n\n ... radius_worst texture_worst perimeter_worst \\\n0 ... 25.38 17.33 184.60 \n1 ... 24.99 23.41 158.80 \n2 ... 23.57 25.53 152.50 \n3 ... 14.91 26.50 98.87 \n4 ... 22.54 16.67 152.20 \n\n area_worst smoothness_worst compactness_worst concavity_worst \\\n0 2019.0 0.1622 0.6656 0.7119 \n1 1956.0 0.1238 0.1866 0.2416 \n2 1709.0 0.1444 0.4245 0.4504 \n3 567.7 0.2098 0.8663 0.6869 \n4 1575.0 0.1374 0.2050 0.4000 \n\n concave points_worst symmetry_worst fractal_dimension_worst \n0 0.2654 0.4601 0.11890 \n1 0.1860 0.2750 0.08902 \n2 0.2430 0.3613 0.08758 \n3 0.2575 0.6638 0.17300 \n4 0.1625 0.2364 0.07678 \n\n[5 rows x 32 columns]"
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 1.Check whether there is any missing data. If there is, drop the missing data:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "missing = df.isna().any()",
"execution_count": 154,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "missing",
"execution_count": 155,
"outputs": [
{
"data": {
"text/plain": "id False\ndiagnosis False\nradius_mean False\ntexture_mean False\nperimeter_mean False\narea_mean False\nsmoothness_mean False\ncompactness_mean False\nconcavity_mean False\nconcave points_mean False\nsymmetry_mean False\nfractal_dimension_mean False\nradius_se False\ntexture_se False\nperimeter_se False\narea_se False\nsmoothness_se False\ncompactness_se False\nconcavity_se False\nconcave points_se False\nsymmetry_se False\nfractal_dimension_se False\nradius_worst False\ntexture_worst False\nperimeter_worst False\narea_worst False\nsmoothness_worst False\ncompactness_worst False\nconcavity_worst False\nconcave points_worst False\nsymmetry_worst False\nfractal_dimension_worst False\ndtype: bool"
},
"execution_count": 155,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 2. Check the datatypes"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.dtypes",
"execution_count": 156,
"outputs": [
{
"data": {
"text/plain": "id int64\ndiagnosis object\nradius_mean float64\ntexture_mean float64\nperimeter_mean float64\narea_mean float64\nsmoothness_mean float64\ncompactness_mean float64\nconcavity_mean float64\nconcave points_mean float64\nsymmetry_mean float64\nfractal_dimension_mean float64\nradius_se float64\ntexture_se float64\nperimeter_se float64\narea_se float64\nsmoothness_se float64\ncompactness_se float64\nconcavity_se float64\nconcave points_se float64\nsymmetry_se float64\nfractal_dimension_se float64\nradius_worst float64\ntexture_worst float64\nperimeter_worst float64\narea_worst float64\nsmoothness_worst float64\ncompactness_worst float64\nconcavity_worst float64\nconcave points_worst float64\nsymmetry_worst float64\nfractal_dimension_worst float64\ndtype: object"
},
"execution_count": 156,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Concatenate both the above results"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "concatenate = pd.concat([missing,dtypes],axis=1,keys=['Null','datatype'])\nprint(concatenate)",
"execution_count": 157,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": " Null datatype\nid False int64\ndiagnosis False object\nradius_mean False float64\ntexture_mean False float64\nperimeter_mean False float64\narea_mean False float64\nsmoothness_mean False float64\ncompactness_mean False float64\nconcavity_mean False float64\nconcave points_mean False float64\nsymmetry_mean False float64\nfractal_dimension_mean False float64\nradius_se False float64\ntexture_se False float64\nperimeter_se False float64\narea_se False float64\nsmoothness_se False float64\ncompactness_se False float64\nconcavity_se False float64\nconcave points_se False float64\nsymmetry_se False float64\nfractal_dimension_se False float64\nradius_worst False float64\ntexture_worst False float64\nperimeter_worst False float64\narea_worst False float64\nsmoothness_worst False float64\ncompactness_worst False float64\nconcavity_worst False float64\nconcave points_worst False float64\nsymmetry_worst False float64\nfractal_dimension_worst False float64\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "As you could see we have one column \"**diagnosis**\" which is a categorical column. We need to convert it into numerical column before applying StandardScalar() or else it won't fit. We will use the one hot encoding method for that"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_dummies = pd.get_dummies(df, drop_first=False)",
"execution_count": 163,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_dummies.head()",
"execution_count": 164,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>radius_mean</th>\n <th>texture_mean</th>\n <th>perimeter_mean</th>\n <th>area_mean</th>\n <th>smoothness_mean</th>\n <th>compactness_mean</th>\n <th>concavity_mean</th>\n <th>concave points_mean</th>\n <th>symmetry_mean</th>\n <th>...</th>\n <th>perimeter_worst</th>\n <th>area_worst</th>\n <th>smoothness_worst</th>\n <th>compactness_worst</th>\n <th>concavity_worst</th>\n <th>concave points_worst</th>\n <th>symmetry_worst</th>\n <th>fractal_dimension_worst</th>\n <th>diagnosis_B</th>\n <th>diagnosis_M</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>842302</td>\n <td>17.99</td>\n <td>10.38</td>\n <td>122.80</td>\n <td>1001.0</td>\n <td>0.11840</td>\n <td>0.27760</td>\n <td>0.3001</td>\n <td>0.14710</td>\n <td>0.2419</td>\n <td>...</td>\n <td>184.60</td>\n <td>2019.0</td>\n <td>0.1622</td>\n <td>0.6656</td>\n <td>0.7119</td>\n <td>0.2654</td>\n <td>0.4601</td>\n <td>0.11890</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>842517</td>\n <td>20.57</td>\n <td>17.77</td>\n <td>132.90</td>\n <td>1326.0</td>\n <td>0.08474</td>\n <td>0.07864</td>\n <td>0.0869</td>\n <td>0.07017</td>\n <td>0.1812</td>\n <td>...</td>\n <td>158.80</td>\n <td>1956.0</td>\n <td>0.1238</td>\n <td>0.1866</td>\n <td>0.2416</td>\n <td>0.1860</td>\n <td>0.2750</td>\n <td>0.08902</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>84300903</td>\n <td>19.69</td>\n <td>21.25</td>\n <td>130.00</td>\n <td>1203.0</td>\n <td>0.10960</td>\n <td>0.15990</td>\n <td>0.1974</td>\n <td>0.12790</td>\n <td>0.2069</td>\n <td>...</td>\n <td>152.50</td>\n <td>1709.0</td>\n <td>0.1444</td>\n <td>0.4245</td>\n <td>0.4504</td>\n <td>0.2430</td>\n <td>0.3613</td>\n <td>0.08758</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>84348301</td>\n <td>11.42</td>\n <td>20.38</td>\n <td>77.58</td>\n <td>386.1</td>\n <td>0.14250</td>\n <td>0.28390</td>\n <td>0.2414</td>\n <td>0.10520</td>\n <td>0.2597</td>\n <td>...</td>\n <td>98.87</td>\n <td>567.7</td>\n <td>0.2098</td>\n <td>0.8663</td>\n <td>0.6869</td>\n <td>0.2575</td>\n <td>0.6638</td>\n <td>0.17300</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>84358402</td>\n <td>20.29</td>\n <td>14.34</td>\n <td>135.10</td>\n <td>1297.0</td>\n <td>0.10030</td>\n <td>0.13280</td>\n <td>0.1980</td>\n <td>0.10430</td>\n <td>0.1809</td>\n <td>...</td>\n <td>152.20</td>\n <td>1575.0</td>\n <td>0.1374</td>\n <td>0.2050</td>\n <td>0.4000</td>\n <td>0.1625</td>\n <td>0.2364</td>\n <td>0.07678</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 33 columns</p>\n</div>",
"text/plain": " id radius_mean texture_mean perimeter_mean area_mean \\\n0 842302 17.99 10.38 122.80 1001.0 \n1 842517 20.57 17.77 132.90 1326.0 \n2 84300903 19.69 21.25 130.00 1203.0 \n3 84348301 11.42 20.38 77.58 386.1 \n4 84358402 20.29 14.34 135.10 1297.0 \n\n smoothness_mean compactness_mean concavity_mean concave points_mean \\\n0 0.11840 0.27760 0.3001 0.14710 \n1 0.08474 0.07864 0.0869 0.07017 \n2 0.10960 0.15990 0.1974 0.12790 \n3 0.14250 0.28390 0.2414 0.10520 \n4 0.10030 0.13280 0.1980 0.10430 \n\n symmetry_mean ... perimeter_worst area_worst smoothness_worst \\\n0 0.2419 ... 184.60 2019.0 0.1622 \n1 0.1812 ... 158.80 1956.0 0.1238 \n2 0.2069 ... 152.50 1709.0 0.1444 \n3 0.2597 ... 98.87 567.7 0.2098 \n4 0.1809 ... 152.20 1575.0 0.1374 \n\n compactness_worst concavity_worst concave points_worst symmetry_worst \\\n0 0.6656 0.7119 0.2654 0.4601 \n1 0.1866 0.2416 0.1860 0.2750 \n2 0.4245 0.4504 0.2430 0.3613 \n3 0.8663 0.6869 0.2575 0.6638 \n4 0.2050 0.4000 0.1625 0.2364 \n\n fractal_dimension_worst diagnosis_B diagnosis_M \n0 0.11890 0 1 \n1 0.08902 0 1 \n2 0.08758 0 1 \n3 0.17300 0 1 \n4 0.07678 0 1 \n\n[5 rows x 33 columns]"
},
"execution_count": 164,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_dummies.columns",
"execution_count": 165,
"outputs": [
{
"data": {
"text/plain": "Index(['id', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',\n 'smoothness_mean', 'compactness_mean', 'concavity_mean',\n 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',\n 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',\n 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',\n 'fractal_dimension_se', 'radius_worst', 'texture_worst',\n 'perimeter_worst', 'area_worst', 'smoothness_worst',\n 'compactness_worst', 'concavity_worst', 'concave points_worst',\n 'symmetry_worst', 'fractal_dimension_worst', 'diagnosis_B',\n 'diagnosis_M'],\n dtype='object')"
},
"execution_count": 165,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 2. Now perform standard scaling and print the first five rows of the new dataset. To do so, use the StandardScaler() class from sklearn.preprocessing and implement the fit_transorm() method: "
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn import preprocessing\nstandarsScalar = preprocessing.StandardScaler().fit_transform(df_dummies)",
"execution_count": 170,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "standarsScalar",
"execution_count": 171,
"outputs": [
{
"data": {
"text/plain": "array([[-0.23640517, 1.09706398, -2.07333501, ..., 1.93701461,\n -1.29767572, 1.29767572],\n [-0.23640344, 1.82982061, -0.35363241, ..., 0.28118999,\n -1.29767572, 1.29767572],\n [ 0.43174109, 1.57988811, 0.45618695, ..., 0.20139121,\n -1.29767572, 1.29767572],\n ...,\n [-0.23572747, 0.70228425, 2.0455738 , ..., -0.31840916,\n -1.29767572, 1.29767572],\n [-0.23572517, 1.83834103, 2.33645719, ..., 2.21963528,\n -1.29767572, 1.29767572],\n [-0.24240586, -1.80840125, 1.22179204, ..., -0.75120669,\n 0.77060855, -0.77060855]])"
},
"execution_count": 171,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "scale_frame = pd.DataFrame(standarsScalar,columns=df_dummies.columns)",
"execution_count": 173,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "scale_frame.head()",
"execution_count": 174,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>radius_mean</th>\n <th>texture_mean</th>\n <th>perimeter_mean</th>\n <th>area_mean</th>\n <th>smoothness_mean</th>\n <th>compactness_mean</th>\n <th>concavity_mean</th>\n <th>concave points_mean</th>\n <th>symmetry_mean</th>\n <th>...</th>\n <th>perimeter_worst</th>\n <th>area_worst</th>\n <th>smoothness_worst</th>\n <th>compactness_worst</th>\n <th>concavity_worst</th>\n <th>concave points_worst</th>\n <th>symmetry_worst</th>\n <th>fractal_dimension_worst</th>\n <th>diagnosis_B</th>\n <th>diagnosis_M</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>-0.236405</td>\n <td>1.097064</td>\n <td>-2.073335</td>\n <td>1.269934</td>\n <td>0.984375</td>\n <td>1.568466</td>\n <td>3.283515</td>\n <td>2.652874</td>\n <td>2.532475</td>\n <td>2.217515</td>\n <td>...</td>\n <td>2.303601</td>\n <td>2.001237</td>\n <td>1.307686</td>\n <td>2.616665</td>\n <td>2.109526</td>\n <td>2.296076</td>\n <td>2.750622</td>\n <td>1.937015</td>\n <td>-1.297676</td>\n <td>1.297676</td>\n </tr>\n <tr>\n <th>1</th>\n <td>-0.236403</td>\n <td>1.829821</td>\n <td>-0.353632</td>\n <td>1.685955</td>\n <td>1.908708</td>\n <td>-0.826962</td>\n <td>-0.487072</td>\n <td>-0.023846</td>\n <td>0.548144</td>\n <td>0.001392</td>\n <td>...</td>\n <td>1.535126</td>\n <td>1.890489</td>\n <td>-0.375612</td>\n <td>-0.430444</td>\n <td>-0.146749</td>\n <td>1.087084</td>\n <td>-0.243890</td>\n <td>0.281190</td>\n <td>-1.297676</td>\n <td>1.297676</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0.431741</td>\n <td>1.579888</td>\n <td>0.456187</td>\n <td>1.566503</td>\n <td>1.558884</td>\n <td>0.942210</td>\n <td>1.052926</td>\n <td>1.363478</td>\n <td>2.037231</td>\n <td>0.939685</td>\n <td>...</td>\n <td>1.347475</td>\n <td>1.456285</td>\n <td>0.527407</td>\n <td>1.082932</td>\n <td>0.854974</td>\n <td>1.955000</td>\n <td>1.152255</td>\n <td>0.201391</td>\n <td>-1.297676</td>\n <td>1.297676</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0.432121</td>\n <td>-0.768909</td>\n <td>0.253732</td>\n <td>-0.592687</td>\n <td>-0.764464</td>\n <td>3.283553</td>\n <td>3.402909</td>\n <td>1.915897</td>\n <td>1.451707</td>\n <td>2.867383</td>\n <td>...</td>\n <td>-0.249939</td>\n <td>-0.550021</td>\n <td>3.394275</td>\n <td>3.893397</td>\n <td>1.989588</td>\n <td>2.175786</td>\n <td>6.046041</td>\n <td>4.935010</td>\n <td>-1.297676</td>\n <td>1.297676</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0.432201</td>\n <td>1.750297</td>\n <td>-1.151816</td>\n <td>1.776573</td>\n <td>1.826229</td>\n <td>0.280372</td>\n <td>0.539340</td>\n <td>1.371011</td>\n <td>1.428493</td>\n <td>-0.009560</td>\n <td>...</td>\n <td>1.338539</td>\n <td>1.220724</td>\n <td>0.220556</td>\n <td>-0.313395</td>\n <td>0.613179</td>\n <td>0.729259</td>\n <td>-0.868353</td>\n <td>-0.397100</td>\n <td>-1.297676</td>\n <td>1.297676</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 33 columns</p>\n</div>",
"text/plain": " id radius_mean texture_mean perimeter_mean area_mean \\\n0 -0.236405 1.097064 -2.073335 1.269934 0.984375 \n1 -0.236403 1.829821 -0.353632 1.685955 1.908708 \n2 0.431741 1.579888 0.456187 1.566503 1.558884 \n3 0.432121 -0.768909 0.253732 -0.592687 -0.764464 \n4 0.432201 1.750297 -1.151816 1.776573 1.826229 \n\n smoothness_mean compactness_mean concavity_mean concave points_mean \\\n0 1.568466 3.283515 2.652874 2.532475 \n1 -0.826962 -0.487072 -0.023846 0.548144 \n2 0.942210 1.052926 1.363478 2.037231 \n3 3.283553 3.402909 1.915897 1.451707 \n4 0.280372 0.539340 1.371011 1.428493 \n\n symmetry_mean ... perimeter_worst area_worst smoothness_worst \\\n0 2.217515 ... 2.303601 2.001237 1.307686 \n1 0.001392 ... 1.535126 1.890489 -0.375612 \n2 0.939685 ... 1.347475 1.456285 0.527407 \n3 2.867383 ... -0.249939 -0.550021 3.394275 \n4 -0.009560 ... 1.338539 1.220724 0.220556 \n\n compactness_worst concavity_worst concave points_worst symmetry_worst \\\n0 2.616665 2.109526 2.296076 2.750622 \n1 -0.430444 -0.146749 1.087084 -0.243890 \n2 1.082932 0.854974 1.955000 1.152255 \n3 3.893397 1.989588 2.175786 6.046041 \n4 -0.313395 0.613179 0.729259 -0.868353 \n\n fractal_dimension_worst diagnosis_B diagnosis_M \n0 1.937015 -1.297676 1.297676 \n1 0.281190 -1.297676 1.297676 \n2 0.201391 -1.297676 1.297676 \n3 4.935010 -1.297676 1.297676 \n4 -0.397100 -1.297676 1.297676 \n\n[5 rows x 33 columns]"
},
"execution_count": 174,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "As you could see above we have scaled the data into a uniform range of same scale using the **StandardScalar** scaling method. Now it will become easy to ingest the data to the model and get accurate results."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Approach - II\n\n## 1. MinMax Scalar \n\nMinMax Scalar is another method of scaling the data. The formula for the MinMax scaler is :\n \n![MinMax.PNG](attachment:MinMax.PNG)\n\nLet's take an example to calculate MinMax Scalar\n\nWe have a 5 numbers: -1000.1, -200.2, 500.5, 600.6, 9000.9\n\nTo calculate the MinMax scalar:\n\n\\begin{equation*} z = (-1000.1 -min(x))/(max(x) -min(x))\\end{equation*}\n\n\\begin{equation*} =(-1000.1+1000.1)/(9000.9+1000.1) = 0 \\end{equation*}\n\\begin{equation*} =(-200.2+1000.1)/(9000.9+1000.1) = 0.079982 \\end{equation*}\n\\begin{equation*} =(500.5+1000.1)/(9000.9+1000.1) = 0.150045 \\end{equation*}\n\\begin{equation*} =(600.6.1+1000.1)/(9000.9+1000.1) =0.16005399 \\end{equation*}\n\\begin{equation*} =(9000.9+1000.1)/(9000.9+1000.1) = 1 \\end{equation*}\n\nLet's implement it in code and see:\n\n**MinMax Scalar** is a good method for scaling your data but the disadvantage that is there for this method is your scaled data will be bounded between 0, 1 which will have lower standard deviation and will also downsize or diminish the effect of outliers. We won't really be able to understand if some of the data is an outlier or not as the difference will be minimal due to such small boundary.",
"attachments": {
"MinMax.PNG": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAASAAAABWCAYAAABvj0voAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAAFiUAABYlAUlSJPAAABK9SURBVHhe7Z3PSxtdF8ffP0aQko1xIZaX0lIoIgGRiIKCDy1YFCkWH7AguBCyCGQhdhGwEHwXwUJFoUSh1EXFhZKFqFCSRTELSSGQgBBByEI47z33x2QymUwmyUx+TM7nYXjqTO5MJnPne8+999xz/gMEQRAdggSIIIiOQQJEEETHIAEiCKJjkAARBNExSIAIgugYJEAEQXQMEiCCcJLSAxSLJflH65TuH6D0JP/wICRABOEQpT97sOAPwPqPrNzTKiW4igRgYDwEpwW5y2OQABGEE/w9YOLzDKbjGbnDIZ6ysL80xEQoAlePcp+HIAEiPMwDnEZmYXhwCJ7/ewRZt7oyT2mIjj+DgZk9uHXjGo9nsMnEzbd0BHm5yyuQABHeJceskkEmDHwbgq0bud9hUl9YN2nwDYQvnRv7MZI/XOT3sPrjQe7xBiRAhHdhlklsknVfUIBGN+DnvdzvJPdHsIznZ9aPUyM/pjxdQ5hZQQOvdyDloUFpEiDC2zyVoOjiTFLqyxsucMvH7lsmqWj7rtUuSIAIollw7Oc1du8WYT8n97nJZQR8aG19PIKi3NXrkAARRLOkd+AVCsLUHtzKXa5SOoNNvN7gGvz0iAKRAPUI+fMdWJ58wVtA38tFCB9noMT+uz0MwfRLHOcYgbEPu5DqqanaAuy/xxeKffepIDzHMY7INUAxDYnICkzw+8L7DcJy5Ahu5b3lz3dh9V0AhvnLyMq+W4NY0uAoowagRwMwIX+38KU8ZvO6WFZ/XSPZb2/F58JJuceCpwdIsWc1Pz7CywzPyO/8VICL6AqMjbLz+F/A9OaJxWxdFvb/we89pLuX3oYEqOsRzmi+yQic3pV4Rb4Ii7GAsckg+MY3IMH253+FYIzt84WSrETvUCo+QPZyh/vQ8Jf5/aJw5kOBxRexeA3RGXHMt7QH+/hbzO3AVUHcZf54TXRLBgOwdaO7cz72U4BUfFEe1wtQo9c1n/6+CIvj89/qDD9LX57hj3uQwu/N/o5zARxi4ijvh32f1P8W4Tk73/TX2udT11w49IZnIglQl1P8wV4wfwhO9a3wZUS8NKwCb56xCv1nl4sP39eTviLKIhH3tHBY+QKWzkLyGNuqHPIyEJ8Sx3xoxRjRTcVXWw12r2s2xlMuu/DdWgz4NL3BR0hMq2P5WUDfxeLJhiaUvs8m9yG5+iytM7N77UFIgLoZOchpbBE103+QCRM2+rkj0ZL7g7BV5YvCLKjPzFJiFkLsj9zVdeiEwM8ExtgF0QTXrOXXlX1/UC2+dgWoznWry17DlrSeqo/p4NP0rKE4r3wuypJR40elm23RiIyuQOKv+IwZmnBtnvWUpVsLEqBu5ol1Ic6vIVvR4j/Az39l5f33xMZsSIb7wvgmt+Giawcu64iIpRA4JEANX/cawjWP6XjMwtV5GvIV4qZmz57Bqy9puc8emgCZfd8ehASo13hKahV/wul1R3oK0sGuhW352O44hYcFyIy/BzAvy/IudAOQABGdRU39Dr6BaGONZxfTXwKE43rivI1Pp5MAER2lPP4TgYsKs74E2Utdd411307DszA2/oLPlP20GFfoPL0oQOWyjVox2vjPPweVyzd4l9vYXavkNh4UZe1M/fcAJEDdTI51g9A/ZFAtQnyAxEdZeY3jP5k9mMYZlTvxZ/YrE59IEop30tzv6lmTXhSg8lic1ZR46WYHJvhgdUBYrJr3dPX4D8664YznhYWeXUVEWauZsl6CBKiL0Vo7tvGZMLXwETe9oEg/E5+aGeGVfAUS7L0oHq/wz6//6uY5k14UoPI6sNoDySU43ZTnVzNhv1UX2iBcj6xLN14vnlBZ9LyyKp4EqItRYwVjoTPIP2aEyEwGxXTtzK7wK7lPw/46Rs3T+cdoCzCzEEdnOqMfUReBDoHF3BmEpVUw8DoCpzkZ1lTeR/a7EFHclr9ntcWlomwStqQf0MDUNlwYyubPItoLX122uesqND8hC98rIVIvYDmeZuWTXGS4Aykrhw1GEb/LX3YPc+zZLrEume781aip/yD3HfICJEDdDBOQn5sYUAsr3QhMb4qgWvnkDiyo5QL+F7AQOTOvuHfYLRMVvTvtH2WBiCURE3yTSyzQIpEWCC7FEMfYJu87fGmvLG7D4+I4X+7AnQpbua786oiySM18iBSPaYgtiaUg/FnFrpnoyCU0/PuwbXQWVr/iflmmFsqaw5AcclevQwLkYUTrOwR8vBKDpXenCvUwakzuDUR/y10uoiziV96Z/iQB8ixqsPP1NmudS3ARGml4toaoT+k8xK0b90VBiV15osELkAB5lcKRMNdxcSoGTFdjRoSzPGXaM87GZzm7uTvdHCRAnkWuAfOPwPMZXDEvdxOOo9ZxOZ4RQwNn04aYyK25E1a2g5AAEYQD3MZnRffIBQ3i/kE4je/BLjQJEEE4AcZp4kkEjeFCWoTnGxuCBaZsXhzBIwEiCKdAEYq+henotUNikYXEpyCsGuIUeYnmBKiQhNgn4Z/ie7mmi1+C4w6sFTCucSEIgjChcQF6TMKmcqBS2+gK7GNYUNZXHfMvwn5XL3wkCKJbaFiAeG6i8Q2IX6JregFuL48gqjw9BwPgXHbINMQw3a1e6JraAhDzjt8WQXiKhgUon05Cyrj49zEN0bmRqpi6BEEQVrQ+CF1MwtZkENZ/1BCfvyKkhG/dTvjQ7sDckqKNNtrMtlZoTYDuDmD5ZRDCZ7XjoRR/bcDw4Ags95B1ZPYj00YbbeZbKzQtQMXkNkyMLkL8jxe9EwiCaAdNCVD2xwaMmYX5NM3i0Cw0CE0QXqdhAcp+WwTfeMQ0xYtwGV+BxD379589WB4PwNjoEEx/SXvSi5MgiNZoSICyhyLNLQZqmv8UgdjhGaTuHqB4l4ZE9C13TBz7fA2lEvoKLXLfIBGA2yyzJEEQ/Y5tASpdRmAMB5NjrIuVu4bonIzIp9tUSEmMQ/wKg2armDRTFAqCIIhqbApQho/HVIQbeMpCQoYL9b2chc3DdDmk5COzih6ZaMlgTe6FKSAcQT4vR8DIixR6sTPIWNb6uNUt4WS9qEFTg9D2wCh8aCW5E6KAcAaxfGYWor8dEg2Pr97uWh4zEF8agrH1kzqB7e0jej0B2LRws2kV9wSoeAKr2DV7L7plpXuqjt2GqGDOx5lR5yXP+DYh0zINzOw5PtQhxn2dXGJViWsCpPJRLX9n6onZGVz4cYgWuGcNhJ91n10K8SkCdAVg66Z/G57iWYRnvqiMGOE8qS8BF3saKhrjIiRcmEhyTYCuPmP36y3E7x7gIszMuF/eSKTmDTDBHT4fF2cnS9ci59b4DqT6suHRJT3EcVK3Mpmmd3g42FcRp2IQmSDTAfmM2XgdwL0umFwDNjwagHmeC0nuJzqPzM7pw4D1cpcb5Jn5zq3g4/5sfFIxkYAQ84+tn7jxG7QvU8ZVBBss59MPuTgITXQnKr1vG3JZqcR9mEivTxsgnoHVrZkkZv3wrK8fj9xf6K1SSjt8LRKgfsNONk/HUN2Q9iTu6zd4bC72LCtyzLvFE+tS87TQYqWDU5AA9Rkqu+ZAm/JL3caD/HqvvtCCPGeRTr5tyxOPg9F4vWew+sO57iQJkBX6HOHjI+zfYtA2f74Lq+9kLnHML/5uDWJJ2QoV05CIrGi5xLHscuQIbk3N8BJkf+G5gvCcty5swzGzT9uQSFc/ZDWmUrVFcICzPOipcprzXOiY61wU54ilMc9g/puNKfJcEqIf5HfjOejFfZT+HMDmnIiCOTy+AjErH6KkzNFu+B7dgPo9ee54/rzQKnyA1HEElifL+dwnPkQgoaI+sN8k9umtzDOPZd/CaiwJ+QprsjL3PP/9+DMS2LoultVf18jfA5jnn4vART1LFs99GIJ5XofZdWdkfX0qwEV0RdwLu8/pTWsfouy3t7y8T3cvrUICZAV6luYykAjhNCc+7CAsLAXAN7cDVwVRMfLHa7LCBCB8uAcL/gCsH2eEN2rxGqKYNRMf2tJR1QuY/YpT1ez4f0NwKjvWxd8HsM6jAJg486GX8f0DZC932HXEeQemInAqZ7KK5xHRT2eVdz5yAKe/swav5Czs/yPK8XzxFnBfHn8Qwr+y/F60c+NLw++R7c+dwSZ+V38ILmq8J/oX5Uru6hrw97xjzwh9aPh3XGTPF535ZIPBXtyrqHxG/kWIH+JvMgvRy4J4vrkj7sqAx/kaSH5SAY795H9jfRDH9QLU6HVNp7+VsNdLACF9hIY/7kEK6yz7O87FcQgmJmVdZt819b9FeM7ON/3V4mwuNCYkQHaQlhDfqvI+ZSA+JY+ZON+VzkLymHHKu3KaNnwpdzPw5RdCEoBaKcfVwuCKz/CBQisHQNaP52WGYOtG7jKD+wgZHRRVWSamvPsmlueoe0vUGoZQKaKr7r970FuWaj2jRomJrDzGGxmDQ57qYpqPqemesYnVYPe6ZmM8WlmThk0P9xEy+OCVryt8h4onG7Iuse9h5S5wsy0/51xjQgJkB50AVVcGXSUzq4S6snqRQdBRbYK1kMMzrBXSi9pTUnvZa4+dFCChWtDxbWaRMYFggvCKmTa1jBG7YsAHN1mlrZAxzZJ5JoVJXZ+1pIbWvxIlXO0aq2ic8gs5VPWM9MJr1vKXy5r9pnYFqM51TcoKPzvzYxp8woE1JOeVT0d1wwemmDCxv1VqacxuY+kwKf2BBgaZxV77gTcECZAdLESkopKZmaaWZWthXfk0HlkrKU384VHWv6+XlVOrQNYChIkHru4qa5g2eD24Bj8bmoct34v9+28v1iJi/SycEaDGr3sVqX1M4zELV+dpw/iUGrxuYmLAZv1pBBIgO7gtQPcZuDjchc0PQdYvV4OQcrOqYAyx7kp8drWes1sLFcjYatqnGQFi1hV3sGth+2jdNdHjWQEyo8qSbQASoA7hlgAVryEmc6r5Xi5C+DAJtzkMp2Bd+SrA1efqs35mGjtgAVXTQqvZlAC1l34SoOYtWQYJUIdwQ4AeWQWTg7hVA5B1Kp8GznCwa/uWNuTMGfu35Xoddd4Gx2N0raZx9qx0d13VXatAG0x927UZc3tRgLSyDfpzaZascfaMx3M3dtcMZPZggn8nG1P/NiEBsoMLApT6IrxYzcXAUPkKJ7A+tQE/K8a/ZR5+P2vJMAa3GkgctJoFK1syta0R7P4IfxElZiqyQVWr+ZSB+EydqVtdq1lzpqzD9KIAaZaMWZ2TlG52+CSHNlOqIpSyckZLFmdrfVbuFIiqy/5tmgVrK+0WINbSTMsyvPLxl7iykprF8hFhGViZmvn51TowC29WrZVjG58JUwsecaucfuWuAPW6fWq9Eq4Hk7u6jV4UoPq/a9lzGRslPhOm1nOxrWI2V1rj9SKXaqLn4Kp4EiArZIjL26+qorAH9zWjhb3kCw1zSdhSfkBT23CRY/vQ+U+WzX5X1gPGRsJ8+jJkZk5EC8D9vqU9SOH6GlYmzx0RAzA9JwVqZheuTmTrhOf7fQb7EZEAACvW6nG5IhVvdrWu0sB4CBKZQvl6EuW3UtPfQwWSY+VPcyW4xSwo/iBM8C7eLMT+sM9wz9oNJoD1A1Xlv8vfrk1LPxqCO3YW4DSsGoM3PMmm9nzZ/4t3cu0cbh+PIIv7UHBl2YvP0g+INSRbSUPZ3JkISaIviz9CK9dVaF3b2palaORewHI8zc6d5CIzNilW6KMvF0aoKP1l9XduyGQYoBo19T/hoD8FCZAV0npRSxv4JpdYhC9V6ybc7cVxuTwDLSHLsvL8hmUb3B3+0y4TMXbsMS0HqIfg+VwI9v+wLhcOPOLSAN25ylaJrsUcDZhfD1GtoIXpnv0R4oG0sOzwDBMyDPVQSEJUDpij8D1fisCpjRAQIoyDs+uHnEJZIHxJhHxGYokFWiTy91S/N25qhpJZJLbK4qaehVwGgZZHK9ctoyxTMx8iiVaHxPkWeFgc1qgclp/vwOgsrH61Ey5H1fc3NZ1jm4EEqN/QxgGcXdVsTjuv1X+osblXTipCLZRl7HBoFRKgPkStQXM9UFibAp/1LZh/DweZ2xBvSYmd5YRDE5AA9SPKg9rVON0yljBlRXEVEXvb+cQCFcjZzrp+Zk1AAtSn4OAwji+5VnHlTN4YxQFyl9I1bOEEgYuNCZ+iZ3XFjXE8EqC+RWU7cL5V01pMVy0sQkOKvSsJQKW17Fb2FBKgfkbGirEzBWsfnK0LiGn8LnU89CLFJPqFOZy/S6sfrCFxupGSkAD1O1jJPgVh9diZwcXS5Q5Mv9uBC6c81QjbFJM7MI8BxhzSoOzxGkx8crJxqoYEiCCIjkECRBBExyABIgiiY5AAEQTRMUiACILoGCRABEF0DBIggiA6BgkQQRAdgwSIIIiOQQJEEETHIAEiCKJjkAARBNEhAP4PLeSnRWmom00AAAAASUVORK5CYII="
}
}
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import numpy as np\nfrom sklearn.preprocessing import MinMaxScaler\n\n# Create feature\nx = np.array([[-1000.1],\n [-200.2],\n [500.5],\n [600.6],\n [9000.9]])\n\n# Create scaler\nscaler = MinMaxScaler()\n\n# Transform the feature\nstandardized = scaler.fit_transform(x)",
"execution_count": 183,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "standardized",
"execution_count": 184,
"outputs": [
{
"data": {
"text/plain": "array([[0. ],\n [0.079982 ],\n [0.150045 ],\n [0.16005399],\n [1. ]])"
},
"execution_count": 184,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 1. Let's implement MinMax scalar to our existing dataset"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn import preprocessing\nimport pandas as pd\ndataset = '/Data.csv'\n\ndf = pd.read_csv(dataset)\ndf = df.loc[:, ~df.columns.str.contains('^Unnamed')]\n\ndf_dummies = pd.get_dummies(df, drop_first=False)",
"execution_count": 2,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_dummies.head()",
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 3,
"data": {
"text/plain": " id radius_mean texture_mean perimeter_mean area_mean \\\n0 842302 17.99 10.38 122.80 1001.0 \n1 842517 20.57 17.77 132.90 1326.0 \n2 84300903 19.69 21.25 130.00 1203.0 \n3 84348301 11.42 20.38 77.58 386.1 \n4 84358402 20.29 14.34 135.10 1297.0 \n\n smoothness_mean compactness_mean concavity_mean concave points_mean \\\n0 0.11840 0.27760 0.3001 0.14710 \n1 0.08474 0.07864 0.0869 0.07017 \n2 0.10960 0.15990 0.1974 0.12790 \n3 0.14250 0.28390 0.2414 0.10520 \n4 0.10030 0.13280 0.1980 0.10430 \n\n symmetry_mean ... perimeter_worst area_worst smoothness_worst \\\n0 0.2419 ... 184.60 2019.0 0.1622 \n1 0.1812 ... 158.80 1956.0 0.1238 \n2 0.2069 ... 152.50 1709.0 0.1444 \n3 0.2597 ... 98.87 567.7 0.2098 \n4 0.1809 ... 152.20 1575.0 0.1374 \n\n compactness_worst concavity_worst concave points_worst symmetry_worst \\\n0 0.6656 0.7119 0.2654 0.4601 \n1 0.1866 0.2416 0.1860 0.2750 \n2 0.4245 0.4504 0.2430 0.3613 \n3 0.8663 0.6869 0.2575 0.6638 \n4 0.2050 0.4000 0.1625 0.2364 \n\n fractal_dimension_worst diagnosis_B diagnosis_M \n0 0.11890 0 1 \n1 0.08902 0 1 \n2 0.08758 0 1 \n3 0.17300 0 1 \n4 0.07678 0 1 \n\n[5 rows x 33 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>radius_mean</th>\n <th>texture_mean</th>\n <th>perimeter_mean</th>\n <th>area_mean</th>\n <th>smoothness_mean</th>\n <th>compactness_mean</th>\n <th>concavity_mean</th>\n <th>concave points_mean</th>\n <th>symmetry_mean</th>\n <th>...</th>\n <th>perimeter_worst</th>\n <th>area_worst</th>\n <th>smoothness_worst</th>\n <th>compactness_worst</th>\n <th>concavity_worst</th>\n <th>concave points_worst</th>\n <th>symmetry_worst</th>\n <th>fractal_dimension_worst</th>\n <th>diagnosis_B</th>\n <th>diagnosis_M</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>842302</td>\n <td>17.99</td>\n <td>10.38</td>\n <td>122.80</td>\n <td>1001.0</td>\n <td>0.11840</td>\n <td>0.27760</td>\n <td>0.3001</td>\n <td>0.14710</td>\n <td>0.2419</td>\n <td>...</td>\n <td>184.60</td>\n <td>2019.0</td>\n <td>0.1622</td>\n <td>0.6656</td>\n <td>0.7119</td>\n <td>0.2654</td>\n <td>0.4601</td>\n <td>0.11890</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>842517</td>\n <td>20.57</td>\n <td>17.77</td>\n <td>132.90</td>\n <td>1326.0</td>\n <td>0.08474</td>\n <td>0.07864</td>\n <td>0.0869</td>\n <td>0.07017</td>\n <td>0.1812</td>\n <td>...</td>\n <td>158.80</td>\n <td>1956.0</td>\n <td>0.1238</td>\n <td>0.1866</td>\n <td>0.2416</td>\n <td>0.1860</td>\n <td>0.2750</td>\n <td>0.08902</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>84300903</td>\n <td>19.69</td>\n <td>21.25</td>\n <td>130.00</td>\n <td>1203.0</td>\n <td>0.10960</td>\n <td>0.15990</td>\n <td>0.1974</td>\n <td>0.12790</td>\n <td>0.2069</td>\n <td>...</td>\n <td>152.50</td>\n <td>1709.0</td>\n <td>0.1444</td>\n <td>0.4245</td>\n <td>0.4504</td>\n <td>0.2430</td>\n <td>0.3613</td>\n <td>0.08758</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>84348301</td>\n <td>11.42</td>\n <td>20.38</td>\n <td>77.58</td>\n <td>386.1</td>\n <td>0.14250</td>\n <td>0.28390</td>\n <td>0.2414</td>\n <td>0.10520</td>\n <td>0.2597</td>\n <td>...</td>\n <td>98.87</td>\n <td>567.7</td>\n <td>0.2098</td>\n <td>0.8663</td>\n <td>0.6869</td>\n <td>0.2575</td>\n <td>0.6638</td>\n <td>0.17300</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>84358402</td>\n <td>20.29</td>\n <td>14.34</td>\n <td>135.10</td>\n <td>1297.0</td>\n <td>0.10030</td>\n <td>0.13280</td>\n <td>0.1980</td>\n <td>0.10430</td>\n <td>0.1809</td>\n <td>...</td>\n <td>152.20</td>\n <td>1575.0</td>\n <td>0.1374</td>\n <td>0.2050</td>\n <td>0.4000</td>\n <td>0.1625</td>\n <td>0.2364</td>\n <td>0.07678</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 33 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "minmax_scaled_data = preprocessing.MinMaxScaler().fit_transform(df_dummies)\nminmax_scaled_data_frame = pd.DataFrame(minmax_scaled_data,columns=df_dummies.columns)",
"execution_count": 190,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "minmax_scaled_data_frame.head()",
"execution_count": 191,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>radius_mean</th>\n <th>texture_mean</th>\n <th>perimeter_mean</th>\n <th>area_mean</th>\n <th>smoothness_mean</th>\n <th>compactness_mean</th>\n <th>concavity_mean</th>\n <th>concave points_mean</th>\n <th>symmetry_mean</th>\n <th>...</th>\n <th>perimeter_worst</th>\n <th>area_worst</th>\n <th>smoothness_worst</th>\n <th>compactness_worst</th>\n <th>concavity_worst</th>\n <th>concave points_worst</th>\n <th>symmetry_worst</th>\n <th>fractal_dimension_worst</th>\n <th>diagnosis_B</th>\n <th>diagnosis_M</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0.000915</td>\n <td>0.521037</td>\n <td>0.022658</td>\n <td>0.545989</td>\n <td>0.363733</td>\n <td>0.593753</td>\n <td>0.792037</td>\n <td>0.703140</td>\n <td>0.731113</td>\n <td>0.686364</td>\n <td>...</td>\n <td>0.668310</td>\n <td>0.450698</td>\n <td>0.601136</td>\n <td>0.619292</td>\n <td>0.568610</td>\n <td>0.912027</td>\n <td>0.598462</td>\n <td>0.418864</td>\n <td>0.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0.000915</td>\n <td>0.643144</td>\n <td>0.272574</td>\n <td>0.615783</td>\n <td>0.501591</td>\n <td>0.289880</td>\n <td>0.181768</td>\n <td>0.203608</td>\n <td>0.348757</td>\n <td>0.379798</td>\n <td>...</td>\n <td>0.539818</td>\n <td>0.435214</td>\n <td>0.347553</td>\n <td>0.154563</td>\n <td>0.192971</td>\n <td>0.639175</td>\n <td>0.233590</td>\n <td>0.222878</td>\n <td>0.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0.092495</td>\n <td>0.601496</td>\n <td>0.390260</td>\n <td>0.595743</td>\n <td>0.449417</td>\n <td>0.514309</td>\n <td>0.431017</td>\n <td>0.462512</td>\n <td>0.635686</td>\n <td>0.509596</td>\n <td>...</td>\n <td>0.508442</td>\n <td>0.374508</td>\n <td>0.483590</td>\n <td>0.385375</td>\n <td>0.359744</td>\n <td>0.835052</td>\n <td>0.403706</td>\n <td>0.213433</td>\n <td>0.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0.092547</td>\n <td>0.210090</td>\n <td>0.360839</td>\n <td>0.233501</td>\n <td>0.102906</td>\n <td>0.811321</td>\n <td>0.811361</td>\n <td>0.565604</td>\n <td>0.522863</td>\n <td>0.776263</td>\n <td>...</td>\n <td>0.241347</td>\n <td>0.094008</td>\n <td>0.915472</td>\n <td>0.814012</td>\n <td>0.548642</td>\n <td>0.884880</td>\n <td>1.000000</td>\n <td>0.773711</td>\n <td>0.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0.092559</td>\n <td>0.629893</td>\n <td>0.156578</td>\n <td>0.630986</td>\n <td>0.489290</td>\n <td>0.430351</td>\n <td>0.347893</td>\n <td>0.463918</td>\n <td>0.518390</td>\n <td>0.378283</td>\n <td>...</td>\n <td>0.506948</td>\n <td>0.341575</td>\n <td>0.437364</td>\n <td>0.172415</td>\n <td>0.319489</td>\n <td>0.558419</td>\n <td>0.157500</td>\n <td>0.142595</td>\n <td>0.0</td>\n <td>1.0</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 33 columns</p>\n</div>",
"text/plain": " id radius_mean texture_mean perimeter_mean area_mean \\\n0 0.000915 0.521037 0.022658 0.545989 0.363733 \n1 0.000915 0.643144 0.272574 0.615783 0.501591 \n2 0.092495 0.601496 0.390260 0.595743 0.449417 \n3 0.092547 0.210090 0.360839 0.233501 0.102906 \n4 0.092559 0.629893 0.156578 0.630986 0.489290 \n\n smoothness_mean compactness_mean concavity_mean concave points_mean \\\n0 0.593753 0.792037 0.703140 0.731113 \n1 0.289880 0.181768 0.203608 0.348757 \n2 0.514309 0.431017 0.462512 0.635686 \n3 0.811321 0.811361 0.565604 0.522863 \n4 0.430351 0.347893 0.463918 0.518390 \n\n symmetry_mean ... perimeter_worst area_worst smoothness_worst \\\n0 0.686364 ... 0.668310 0.450698 0.601136 \n1 0.379798 ... 0.539818 0.435214 0.347553 \n2 0.509596 ... 0.508442 0.374508 0.483590 \n3 0.776263 ... 0.241347 0.094008 0.915472 \n4 0.378283 ... 0.506948 0.341575 0.437364 \n\n compactness_worst concavity_worst concave points_worst symmetry_worst \\\n0 0.619292 0.568610 0.912027 0.598462 \n1 0.154563 0.192971 0.639175 0.233590 \n2 0.385375 0.359744 0.835052 0.403706 \n3 0.814012 0.548642 0.884880 1.000000 \n4 0.172415 0.319489 0.558419 0.157500 \n\n fractal_dimension_worst diagnosis_B diagnosis_M \n0 0.418864 0.0 1.0 \n1 0.222878 0.0 1.0 \n2 0.213433 0.0 1.0 \n3 0.773711 0.0 1.0 \n4 0.142595 0.0 1.0 \n\n[5 rows x 33 columns]"
},
"execution_count": 191,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "As you could see above, we have used MinMax scalar method here to scale our data. Now the value of all the features in the table has been transformed to a uniform range between 0 and 1"
},
{
"metadata": {
"trusted": true
},
"cell_type": "markdown",
"source": "## Bucketing of Data\n\nIn order to bucket the data for easy maintainability, we group the continuous data into discrete buckets. The reason we do this is, when we are training the model, discrete data works well and faster in comparison to the continuous data.\nEven if continuous data contains more information, it will make the model slow. \n\nThere are few techniques for data discretization/bucketing:\n\n1. Binning\n2. Using a Histogram\n\nThere are few challenges though in data discretization. Picking the range of each bucket is a challenge,choosing the number of intervals or bins and deciding their width. \n\nHere we make use of a function called **pandas.cut()**. This function is useful to achieve the bucketing and sorting of segmented data.\n\n#### Lets import out dataset and implement data discretization to see how it looks"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_dummies.head()",
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 4,
"data": {
"text/plain": " id radius_mean texture_mean perimeter_mean area_mean \\\n0 842302 17.99 10.38 122.80 1001.0 \n1 842517 20.57 17.77 132.90 1326.0 \n2 84300903 19.69 21.25 130.00 1203.0 \n3 84348301 11.42 20.38 77.58 386.1 \n4 84358402 20.29 14.34 135.10 1297.0 \n\n smoothness_mean compactness_mean concavity_mean concave points_mean \\\n0 0.11840 0.27760 0.3001 0.14710 \n1 0.08474 0.07864 0.0869 0.07017 \n2 0.10960 0.15990 0.1974 0.12790 \n3 0.14250 0.28390 0.2414 0.10520 \n4 0.10030 0.13280 0.1980 0.10430 \n\n symmetry_mean ... perimeter_worst area_worst smoothness_worst \\\n0 0.2419 ... 184.60 2019.0 0.1622 \n1 0.1812 ... 158.80 1956.0 0.1238 \n2 0.2069 ... 152.50 1709.0 0.1444 \n3 0.2597 ... 98.87 567.7 0.2098 \n4 0.1809 ... 152.20 1575.0 0.1374 \n\n compactness_worst concavity_worst concave points_worst symmetry_worst \\\n0 0.6656 0.7119 0.2654 0.4601 \n1 0.1866 0.2416 0.1860 0.2750 \n2 0.4245 0.4504 0.2430 0.3613 \n3 0.8663 0.6869 0.2575 0.6638 \n4 0.2050 0.4000 0.1625 0.2364 \n\n fractal_dimension_worst diagnosis_B diagnosis_M \n0 0.11890 0 1 \n1 0.08902 0 1 \n2 0.08758 0 1 \n3 0.17300 0 1 \n4 0.07678 0 1 \n\n[5 rows x 33 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>radius_mean</th>\n <th>texture_mean</th>\n <th>perimeter_mean</th>\n <th>area_mean</th>\n <th>smoothness_mean</th>\n <th>compactness_mean</th>\n <th>concavity_mean</th>\n <th>concave points_mean</th>\n <th>symmetry_mean</th>\n <th>...</th>\n <th>perimeter_worst</th>\n <th>area_worst</th>\n <th>smoothness_worst</th>\n <th>compactness_worst</th>\n <th>concavity_worst</th>\n <th>concave points_worst</th>\n <th>symmetry_worst</th>\n <th>fractal_dimension_worst</th>\n <th>diagnosis_B</th>\n <th>diagnosis_M</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>842302</td>\n <td>17.99</td>\n <td>10.38</td>\n <td>122.80</td>\n <td>1001.0</td>\n <td>0.11840</td>\n <td>0.27760</td>\n <td>0.3001</td>\n <td>0.14710</td>\n <td>0.2419</td>\n <td>...</td>\n <td>184.60</td>\n <td>2019.0</td>\n <td>0.1622</td>\n <td>0.6656</td>\n <td>0.7119</td>\n <td>0.2654</td>\n <td>0.4601</td>\n <td>0.11890</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>842517</td>\n <td>20.57</td>\n <td>17.77</td>\n <td>132.90</td>\n <td>1326.0</td>\n <td>0.08474</td>\n <td>0.07864</td>\n <td>0.0869</td>\n <td>0.07017</td>\n <td>0.1812</td>\n <td>...</td>\n <td>158.80</td>\n <td>1956.0</td>\n <td>0.1238</td>\n <td>0.1866</td>\n <td>0.2416</td>\n <td>0.1860</td>\n <td>0.2750</td>\n <td>0.08902</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>84300903</td>\n <td>19.69</td>\n <td>21.25</td>\n <td>130.00</td>\n <td>1203.0</td>\n <td>0.10960</td>\n <td>0.15990</td>\n <td>0.1974</td>\n <td>0.12790</td>\n <td>0.2069</td>\n <td>...</td>\n <td>152.50</td>\n <td>1709.0</td>\n <td>0.1444</td>\n <td>0.4245</td>\n <td>0.4504</td>\n <td>0.2430</td>\n <td>0.3613</td>\n <td>0.08758</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>84348301</td>\n <td>11.42</td>\n <td>20.38</td>\n <td>77.58</td>\n <td>386.1</td>\n <td>0.14250</td>\n <td>0.28390</td>\n <td>0.2414</td>\n <td>0.10520</td>\n <td>0.2597</td>\n <td>...</td>\n <td>98.87</td>\n <td>567.7</td>\n <td>0.2098</td>\n <td>0.8663</td>\n <td>0.6869</td>\n <td>0.2575</td>\n <td>0.6638</td>\n <td>0.17300</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>84358402</td>\n <td>20.29</td>\n <td>14.34</td>\n <td>135.10</td>\n <td>1297.0</td>\n <td>0.10030</td>\n <td>0.13280</td>\n <td>0.1980</td>\n <td>0.10430</td>\n <td>0.1809</td>\n <td>...</td>\n <td>152.20</td>\n <td>1575.0</td>\n <td>0.1374</td>\n <td>0.2050</td>\n <td>0.4000</td>\n <td>0.1625</td>\n <td>0.2364</td>\n <td>0.07678</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 33 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_dummies['Bucketed'] = pd.cut(df_dummies['radius_mean'],3,labels=['Low_mean','Average_mean','High_mean'])",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_dummies.head()",
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 8,
"data": {
"text/plain": " id radius_mean texture_mean perimeter_mean area_mean \\\n0 842302 17.99 10.38 122.80 1001.0 \n1 842517 20.57 17.77 132.90 1326.0 \n2 84300903 19.69 21.25 130.00 1203.0 \n3 84348301 11.42 20.38 77.58 386.1 \n4 84358402 20.29 14.34 135.10 1297.0 \n\n smoothness_mean compactness_mean concavity_mean concave points_mean \\\n0 0.11840 0.27760 0.3001 0.14710 \n1 0.08474 0.07864 0.0869 0.07017 \n2 0.10960 0.15990 0.1974 0.12790 \n3 0.14250 0.28390 0.2414 0.10520 \n4 0.10030 0.13280 0.1980 0.10430 \n\n symmetry_mean ... area_worst smoothness_worst \\\n0 0.2419 ... 2019.0 0.1622 \n1 0.1812 ... 1956.0 0.1238 \n2 0.2069 ... 1709.0 0.1444 \n3 0.2597 ... 567.7 0.2098 \n4 0.1809 ... 1575.0 0.1374 \n\n compactness_worst concavity_worst concave points_worst symmetry_worst \\\n0 0.6656 0.7119 0.2654 0.4601 \n1 0.1866 0.2416 0.1860 0.2750 \n2 0.4245 0.4504 0.2430 0.3613 \n3 0.8663 0.6869 0.2575 0.6638 \n4 0.2050 0.4000 0.1625 0.2364 \n\n fractal_dimension_worst diagnosis_B diagnosis_M Bucketed \n0 0.11890 0 1 Average_mean \n1 0.08902 0 1 Average_mean \n2 0.08758 0 1 Average_mean \n3 0.17300 0 1 Low_mean \n4 0.07678 0 1 Average_mean \n\n[5 rows x 34 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>radius_mean</th>\n <th>texture_mean</th>\n <th>perimeter_mean</th>\n <th>area_mean</th>\n <th>smoothness_mean</th>\n <th>compactness_mean</th>\n <th>concavity_mean</th>\n <th>concave points_mean</th>\n <th>symmetry_mean</th>\n <th>...</th>\n <th>area_worst</th>\n <th>smoothness_worst</th>\n <th>compactness_worst</th>\n <th>concavity_worst</th>\n <th>concave points_worst</th>\n <th>symmetry_worst</th>\n <th>fractal_dimension_worst</th>\n <th>diagnosis_B</th>\n <th>diagnosis_M</th>\n <th>Bucketed</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>842302</td>\n <td>17.99</td>\n <td>10.38</td>\n <td>122.80</td>\n <td>1001.0</td>\n <td>0.11840</td>\n <td>0.27760</td>\n <td>0.3001</td>\n <td>0.14710</td>\n <td>0.2419</td>\n <td>...</td>\n <td>2019.0</td>\n <td>0.1622</td>\n <td>0.6656</td>\n <td>0.7119</td>\n <td>0.2654</td>\n <td>0.4601</td>\n <td>0.11890</td>\n <td>0</td>\n <td>1</td>\n <td>Average_mean</td>\n </tr>\n <tr>\n <th>1</th>\n <td>842517</td>\n <td>20.57</td>\n <td>17.77</td>\n <td>132.90</td>\n <td>1326.0</td>\n <td>0.08474</td>\n <td>0.07864</td>\n <td>0.0869</td>\n <td>0.07017</td>\n <td>0.1812</td>\n <td>...</td>\n <td>1956.0</td>\n <td>0.1238</td>\n <td>0.1866</td>\n <td>0.2416</td>\n <td>0.1860</td>\n <td>0.2750</td>\n <td>0.08902</td>\n <td>0</td>\n <td>1</td>\n <td>Average_mean</td>\n </tr>\n <tr>\n <th>2</th>\n <td>84300903</td>\n <td>19.69</td>\n <td>21.25</td>\n <td>130.00</td>\n <td>1203.0</td>\n <td>0.10960</td>\n <td>0.15990</td>\n <td>0.1974</td>\n <td>0.12790</td>\n <td>0.2069</td>\n <td>...</td>\n <td>1709.0</td>\n <td>0.1444</td>\n <td>0.4245</td>\n <td>0.4504</td>\n <td>0.2430</td>\n <td>0.3613</td>\n <td>0.08758</td>\n <td>0</td>\n <td>1</td>\n <td>Average_mean</td>\n </tr>\n <tr>\n <th>3</th>\n <td>84348301</td>\n <td>11.42</td>\n <td>20.38</td>\n <td>77.58</td>\n <td>386.1</td>\n <td>0.14250</td>\n <td>0.28390</td>\n <td>0.2414</td>\n <td>0.10520</td>\n <td>0.2597</td>\n <td>...</td>\n <td>567.7</td>\n <td>0.2098</td>\n <td>0.8663</td>\n <td>0.6869</td>\n <td>0.2575</td>\n <td>0.6638</td>\n <td>0.17300</td>\n <td>0</td>\n <td>1</td>\n <td>Low_mean</td>\n </tr>\n <tr>\n <th>4</th>\n <td>84358402</td>\n <td>20.29</td>\n <td>14.34</td>\n <td>135.10</td>\n <td>1297.0</td>\n <td>0.10030</td>\n <td>0.13280</td>\n <td>0.1980</td>\n <td>0.10430</td>\n <td>0.1809</td>\n <td>...</td>\n <td>1575.0</td>\n <td>0.1374</td>\n <td>0.2050</td>\n <td>0.4000</td>\n <td>0.1625</td>\n <td>0.2364</td>\n <td>0.07678</td>\n <td>0</td>\n <td>1</td>\n <td>Average_mean</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 34 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Train Test Split\n\nNow that our data is processed and everything is ready, we need to split it into training and testing sets. \n\nIn supervised learning we need to first train the model so that it will understand the underlying features and patterns for the data. We will do this by using the training set. Once the model has been trained, it will make predictions about the data using the test set based on the learning that it had.\n\nLater we can compare the results of the test set with the actual results and see the accuracy of our model.\n\nTo do this we will use the **Graduate Admissions** dataset:\n\nThe dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :\n\n1. GRE Scores ( out of 340 )\n2. TOEFL Scores ( out of 120 )\n3. University Rating ( out of 5 )\n4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )\n5. Undergraduate GPA ( out of 10 )\n6. Research Experience ( either 0 or 1 )\n7. Chance of Admit ( ranging from 0 to 1 )"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd",
"execution_count": 15,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dataset = '/graduate-admissions/Admission_Predict_Ver1.2.csv'",
"execution_count": 16,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv(dataset,header = 0)",
"execution_count": 17,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.head()",
"execution_count": 18,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 18,
"data": {
"text/plain": " Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA \\\n0 1 337 118 4 4.5 4.5 9.65 \n1 2 324 107 4 4.0 4.5 8.87 \n2 3 316 104 3 3.0 3.5 8.00 \n3 4 322 110 3 3.5 2.5 8.67 \n4 5 314 103 2 2.0 3.0 8.21 \n\n Research Admit \n0 1 0.92 \n1 1 0.76 \n2 1 0.72 \n3 1 0.80 \n4 0 0.65 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Serial No.</th>\n <th>GRE Score</th>\n <th>TOEFL Score</th>\n <th>University Rating</th>\n <th>SOP</th>\n <th>LOR</th>\n <th>CGPA</th>\n <th>Research</th>\n <th>Admit</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>337</td>\n <td>118</td>\n <td>4</td>\n <td>4.5</td>\n <td>4.5</td>\n <td>9.65</td>\n <td>1</td>\n <td>0.92</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>324</td>\n <td>107</td>\n <td>4</td>\n <td>4.0</td>\n <td>4.5</td>\n <td>8.87</td>\n <td>1</td>\n <td>0.76</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>316</td>\n <td>104</td>\n <td>3</td>\n <td>3.0</td>\n <td>3.5</td>\n <td>8.00</td>\n <td>1</td>\n <td>0.72</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>322</td>\n <td>110</td>\n <td>3</td>\n <td>3.5</td>\n <td>2.5</td>\n <td>8.67</td>\n <td>1</td>\n <td>0.80</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>314</td>\n <td>103</td>\n <td>2</td>\n <td>2.0</td>\n <td>3.0</td>\n <td>8.21</td>\n <td>0</td>\n <td>0.65</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 1. Let's create a variable X to store the independent features."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "X = df.drop('Admit ',axis = 1)",
"execution_count": 21,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "X.head()",
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 22,
"data": {
"text/plain": " Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA \\\n0 1 337 118 4 4.5 4.5 9.65 \n1 2 324 107 4 4.0 4.5 8.87 \n2 3 316 104 3 3.0 3.5 8.00 \n3 4 322 110 3 3.5 2.5 8.67 \n4 5 314 103 2 2.0 3.0 8.21 \n\n Research \n0 1 \n1 1 \n2 1 \n3 1 \n4 0 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Serial No.</th>\n <th>GRE Score</th>\n <th>TOEFL Score</th>\n <th>University Rating</th>\n <th>SOP</th>\n <th>LOR</th>\n <th>CGPA</th>\n <th>Research</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>337</td>\n <td>118</td>\n <td>4</td>\n <td>4.5</td>\n <td>4.5</td>\n <td>9.65</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>324</td>\n <td>107</td>\n <td>4</td>\n <td>4.0</td>\n <td>4.5</td>\n <td>8.87</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>316</td>\n <td>104</td>\n <td>3</td>\n <td>3.0</td>\n <td>3.5</td>\n <td>8.00</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>322</td>\n <td>110</td>\n <td>3</td>\n <td>3.5</td>\n <td>2.5</td>\n <td>8.67</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>314</td>\n <td>103</td>\n <td>2</td>\n <td>2.0</td>\n <td>3.0</td>\n <td>8.21</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 2. Printing the shape of the new feature"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "X.shape",
"execution_count": 23,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 23,
"data": {
"text/plain": "(500, 8)"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The first value above represents the number of observations(500), and the second value represents the number of features(8)"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 3. We will use a variable **y** for the target value. "
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "y = df['Admit ']",
"execution_count": 24,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "y.head()",
"execution_count": 25,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 25,
"data": {
"text/plain": "0 0.92\n1 0.76\n2 0.72\n3 0.80\n4 0.65\nName: Admit , dtype: float64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 4. Now lets split the data set into test train sets. We will typically split the data here in 80:20 ratio where 80% is the training set and 20% is the testing set."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.model_selection import train_test_split",
"execution_count": 26,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2, random_state = 0)",
"execution_count": 27,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### So, in the above code, the test size is 0.2 which means that it is a 80:20 split. **train_test_split** splits the arrays or matrices into train and test subsets in a random way. If we run the code everytime without the random_state, we will get a different result."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 5. Printing the shape of the train and test sets"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "print(\"X-Train Shape: \",X_train.shape)\nprint(\"X-Test Shape: \",X_test.shape)\nprint(\"y-Train Shape: \",y_train.shape)\nprint(\"y-Test Shape: \",y_test.shape)",
"execution_count": 28,
"outputs": [
{
"output_type": "stream",
"text": "X-Train Shape: (400, 8)\nX-Test Shape: (100, 8)\ny-Train Shape: (400,)\ny-Test Shape: (100,)\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.0",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "683943c3329aa1566a9306b16d8cdb49",
"data": {
"description": "Data Cleaning -Part 2.ipynb",
"public": true
}
},
"_draft": {
"nbviewer_url": "https://gist.github.com/683943c3329aa1566a9306b16d8cdb49"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment