Z30G0D/Random_forest_regressor

## Random_forest_regressor
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Automobile prices regression\n",
    "## hello all\n",
    "This is a simple random forest classifier for regression of automobile prices, the dataset is located <a href=\"https://archive.ics.uci.edu/ml/datasets/automobile\">here</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random Forest Classification\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn import model_selection\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "place = \"imports-85-preprocessed.csv\" \n",
    "dataframe = pandas.read_csv(place)\n",
    "X = dataframe.iloc[:,0:14]\n",
    "Y = dataframe.iloc[:,14]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I edited the csv file and left only the stated below features (I wanted to create a pure regression problem with little amount of preprocessing) Let's visualize the first few rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>num_of_doors</th>\n",
       "      <th>wheel_base</th>\n",
       "      <th>length</th>\n",
       "      <th>width</th>\n",
       "      <th>height</th>\n",
       "      <th>curb_weight</th>\n",
       "      <th>engine_size</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression_ratio</th>\n",
       "      <th>horse_power</th>\n",
       "      <th>peak_rpm</th>\n",
       "      <th>city_mpg</th>\n",
       "      <th>highway_mpg</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>two</td>\n",
       "      <td>88.6</td>\n",
       "      <td>168.8</td>\n",
       "      <td>64.1</td>\n",
       "      <td>48.8</td>\n",
       "      <td>2548</td>\n",
       "      <td>130</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>88.6</td>\n",
       "      <td>168.8</td>\n",
       "      <td>64.1</td>\n",
       "      <td>48.8</td>\n",
       "      <td>2548</td>\n",
       "      <td>130</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>two</td>\n",
       "      <td>94.5</td>\n",
       "      <td>171.2</td>\n",
       "      <td>65.5</td>\n",
       "      <td>52.4</td>\n",
       "      <td>2823</td>\n",
       "      <td>152</td>\n",
       "      <td>2.68</td>\n",
       "      <td>3.47</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000</td>\n",
       "      <td>19</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  num_of_doors  wheel_base  length  width  height  curb_weight  engine_size  \\\n",
       "0          two        88.6   168.8   64.1    48.8         2548          130   \n",
       "1          two        88.6   168.8   64.1    48.8         2548          130   \n",
       "2          two        94.5   171.2   65.5    52.4         2823          152   \n",
       "\n",
       "   bore  stroke  compression_ratio  horse_power  peak_rpm  city_mpg  \\\n",
       "0  3.47    2.68                9.0          111      5000        21   \n",
       "1  3.47    2.68                9.0          111      5000        21   \n",
       "2  2.68    3.47                9.0          154      5000        19   \n",
       "\n",
       "   highway_mpg  \n",
       "0           27  \n",
       "1           27  \n",
       "2           26  "
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head(n=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    13495.0\n",
       "1    16500.0\n",
       "2    16500.0\n",
       "Name: price, dtype: float64"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Y.head(n=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [],
   "source": [
    "def num_doors_dummy(data):\n",
    "    \"\"\"This function is turning the 'num_of_doors' column to a dummy \"\"\"\n",
    "    dummies =pd.get_dummies(data.num_of_doors)\n",
    "    data.drop(['num_of_doors'], axis=1, inplace=True)\n",
    "    data = pd.concat([data, dummies], axis=1)\n",
    "    return data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "X = num_doors_dummy(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>wheel_base</th>\n",
       "      <th>length</th>\n",
       "      <th>width</th>\n",
       "      <th>height</th>\n",
       "      <th>curb_weight</th>\n",
       "      <th>engine_size</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression_ratio</th>\n",
       "      <th>horse_power</th>\n",
       "      <th>peak_rpm</th>\n",
       "      <th>city_mpg</th>\n",
       "      <th>highway_mpg</th>\n",
       "      <th>four</th>\n",
       "      <th>two</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>88.6</td>\n",
       "      <td>168.8</td>\n",
       "      <td>64.1</td>\n",
       "      <td>48.8</td>\n",
       "      <td>2548</td>\n",
       "      <td>130</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>88.6</td>\n",
       "      <td>168.8</td>\n",
       "      <td>64.1</td>\n",
       "      <td>48.8</td>\n",
       "      <td>2548</td>\n",
       "      <td>130</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>94.5</td>\n",
       "      <td>171.2</td>\n",
       "      <td>65.5</td>\n",
       "      <td>52.4</td>\n",
       "      <td>2823</td>\n",
       "      <td>152</td>\n",
       "      <td>2.68</td>\n",
       "      <td>3.47</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000</td>\n",
       "      <td>19</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>99.8</td>\n",
       "      <td>176.6</td>\n",
       "      <td>66.2</td>\n",
       "      <td>54.3</td>\n",
       "      <td>2337</td>\n",
       "      <td>109</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>10.0</td>\n",
       "      <td>102</td>\n",
       "      <td>5500</td>\n",
       "      <td>24</td>\n",
       "      <td>30</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>99.4</td>\n",
       "      <td>176.6</td>\n",
       "      <td>66.4</td>\n",
       "      <td>54.3</td>\n",
       "      <td>2824</td>\n",
       "      <td>136</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.0</td>\n",
       "      <td>115</td>\n",
       "      <td>5500</td>\n",
       "      <td>18</td>\n",
       "      <td>22</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>99.8</td>\n",
       "      <td>177.3</td>\n",
       "      <td>66.3</td>\n",
       "      <td>53.1</td>\n",
       "      <td>2507</td>\n",
       "      <td>136</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.5</td>\n",
       "      <td>110</td>\n",
       "      <td>5500</td>\n",
       "      <td>19</td>\n",
       "      <td>25</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>105.8</td>\n",
       "      <td>192.7</td>\n",
       "      <td>71.4</td>\n",
       "      <td>55.7</td>\n",
       "      <td>2844</td>\n",
       "      <td>136</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.5</td>\n",
       "      <td>110</td>\n",
       "      <td>5500</td>\n",
       "      <td>19</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>105.8</td>\n",
       "      <td>192.7</td>\n",
       "      <td>71.4</td>\n",
       "      <td>55.7</td>\n",
       "      <td>2954</td>\n",
       "      <td>136</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.5</td>\n",
       "      <td>110</td>\n",
       "      <td>5500</td>\n",
       "      <td>19</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   wheel_base  length  width  height  curb_weight  engine_size  bore  stroke  \\\n",
       "0        88.6   168.8   64.1    48.8         2548          130  3.47    2.68   \n",
       "1        88.6   168.8   64.1    48.8         2548          130  3.47    2.68   \n",
       "2        94.5   171.2   65.5    52.4         2823          152  2.68    3.47   \n",
       "3        99.8   176.6   66.2    54.3         2337          109  3.19    3.40   \n",
       "4        99.4   176.6   66.4    54.3         2824          136  3.19    3.40   \n",
       "5        99.8   177.3   66.3    53.1         2507          136  3.19    3.40   \n",
       "6       105.8   192.7   71.4    55.7         2844          136  3.19    3.40   \n",
       "7       105.8   192.7   71.4    55.7         2954          136  3.19    3.40   \n",
       "\n",
       "   compression_ratio  horse_power  peak_rpm  city_mpg  highway_mpg  four  two  \n",
       "0                9.0          111      5000        21           27     0    1  \n",
       "1                9.0          111      5000        21           27     0    1  \n",
       "2                9.0          154      5000        19           26     0    1  \n",
       "3               10.0          102      5500        24           30     1    0  \n",
       "4                8.0          115      5500        18           22     1    0  \n",
       "5                8.5          110      5500        19           25     0    1  \n",
       "6                8.5          110      5500        19           25     1    0  \n",
       "7                8.5          110      5500        19           25     1    0  "
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head(n=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see if there are missing values and imputate them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((array([], dtype=int64), array([], dtype=int64)),\n",
       " (array([  9,  44,  45, 125], dtype=int64),))"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.where(pd.isnull(X)), np.where(pd.isnull(Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, we have empty data values in the Y column (predicted values), we have only 4 values out of 199 samples. Let's complete them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y.fillna(Y.mean(), inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We changed the missing values to the mean. Let's check again for missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([], dtype=int64),)"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.where(pd.isnull(Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great, no missing values. Let's continue with the random forest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7446599177861954\n"
     ]
    }
   ],
   "source": [
    "seed = 7\n",
    "num_trees = 20\n",
    "max_features = 9\n",
    "kfold = model_selection.KFold(n_splits=3, random_state=seed)\n",
    "model = RandomForestRegressor(n_estimators=num_trees, max_features=max_features)\n",
    "results = model_selection.cross_val_score(model, X, Y, cv=kfold)\n",
    "print(results.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, we got a 74% accuracy on the cross validated k fold test set. Quite fair to the fact we reduced so many features."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (myenv)",
   "language": "python",
   "name": "myenv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Automobile prices regression\n",
	"## hello all\n",
	"This is a simple random forest classifier for regression of automobile prices, the dataset is located <a href=\"https://archive.ics.uci.edu/ml/datasets/automobile\">here</a>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 75,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Random Forest Classification\n",
	"import pandas as pd\n",
	"import numpy as np\n",
	"from sklearn import model_selection\n",
	"from sklearn.ensemble import RandomForestRegressor\n",
	"place = \"imports-85-preprocessed.csv\" \n",
	"dataframe = pandas.read_csv(place)\n",
	"X = dataframe.iloc[:,0:14]\n",
	"Y = dataframe.iloc[:,14]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I edited the csv file and left only the stated below features (I wanted to create a pure regression problem with little amount of preprocessing) Let's visualize the first few rows"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 76,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>num_of_doors</th>\n",
	" <th>wheel_base</th>\n",
	" <th>length</th>\n",
	" <th>width</th>\n",
	" <th>height</th>\n",
	" <th>curb_weight</th>\n",
	" <th>engine_size</th>\n",
	" <th>bore</th>\n",
	" <th>stroke</th>\n",
	" <th>compression_ratio</th>\n",
	" <th>horse_power</th>\n",
	" <th>peak_rpm</th>\n",
	" <th>city_mpg</th>\n",
	" <th>highway_mpg</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>two</td>\n",
	" <td>88.6</td>\n",
	" <td>168.8</td>\n",
	" <td>64.1</td>\n",
	" <td>48.8</td>\n",
	" <td>2548</td>\n",
	" <td>130</td>\n",
	" <td>3.47</td>\n",
	" <td>2.68</td>\n",
	" <td>9.0</td>\n",
	" <td>111</td>\n",
	" <td>5000</td>\n",
	" <td>21</td>\n",
	" <td>27</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>two</td>\n",
	" <td>88.6</td>\n",
	" <td>168.8</td>\n",
	" <td>64.1</td>\n",
	" <td>48.8</td>\n",
	" <td>2548</td>\n",
	" <td>130</td>\n",
	" <td>3.47</td>\n",
	" <td>2.68</td>\n",
	" <td>9.0</td>\n",
	" <td>111</td>\n",
	" <td>5000</td>\n",
	" <td>21</td>\n",
	" <td>27</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>two</td>\n",
	" <td>94.5</td>\n",
	" <td>171.2</td>\n",
	" <td>65.5</td>\n",
	" <td>52.4</td>\n",
	" <td>2823</td>\n",
	" <td>152</td>\n",
	" <td>2.68</td>\n",
	" <td>3.47</td>\n",
	" <td>9.0</td>\n",
	" <td>154</td>\n",
	" <td>5000</td>\n",
	" <td>19</td>\n",
	" <td>26</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" num_of_doors wheel_base length width height curb_weight engine_size \\\n",
	"0 two 88.6 168.8 64.1 48.8 2548 130 \n",
	"1 two 88.6 168.8 64.1 48.8 2548 130 \n",
	"2 two 94.5 171.2 65.5 52.4 2823 152 \n",
	"\n",
	" bore stroke compression_ratio horse_power peak_rpm city_mpg \\\n",
	"0 3.47 2.68 9.0 111 5000 21 \n",
	"1 3.47 2.68 9.0 111 5000 21 \n",
	"2 2.68 3.47 9.0 154 5000 19 \n",
	"\n",
	" highway_mpg \n",
	"0 27 \n",
	"1 27 \n",
	"2 26 "
	]
	},
	"execution_count": 76,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X.head(n=3)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 77,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0 13495.0\n",
	"1 16500.0\n",
	"2 16500.0\n",
	"Name: price, dtype: float64"
	]
	},
	"execution_count": 77,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Y.head(n=3)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 78,
	"metadata": {},
	"outputs": [],
	"source": [
	"def num_doors_dummy(data):\n",
	" \"\"\"This function is turning the 'num_of_doors' column to a dummy \"\"\"\n",
	" dummies =pd.get_dummies(data.num_of_doors)\n",
	" data.drop(['num_of_doors'], axis=1, inplace=True)\n",
	" data = pd.concat([data, dummies], axis=1)\n",
	" return data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 79,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"X = num_doors_dummy(X)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 80,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>wheel_base</th>\n",
	" <th>length</th>\n",
	" <th>width</th>\n",
	" <th>height</th>\n",
	" <th>curb_weight</th>\n",
	" <th>engine_size</th>\n",
	" <th>bore</th>\n",
	" <th>stroke</th>\n",
	" <th>compression_ratio</th>\n",
	" <th>horse_power</th>\n",
	" <th>peak_rpm</th>\n",
	" <th>city_mpg</th>\n",
	" <th>highway_mpg</th>\n",
	" <th>four</th>\n",
	" <th>two</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>88.6</td>\n",
	" <td>168.8</td>\n",
	" <td>64.1</td>\n",
	" <td>48.8</td>\n",
	" <td>2548</td>\n",
	" <td>130</td>\n",
	" <td>3.47</td>\n",
	" <td>2.68</td>\n",
	" <td>9.0</td>\n",
	" <td>111</td>\n",
	" <td>5000</td>\n",
	" <td>21</td>\n",
	" <td>27</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>88.6</td>\n",
	" <td>168.8</td>\n",
	" <td>64.1</td>\n",
	" <td>48.8</td>\n",
	" <td>2548</td>\n",
	" <td>130</td>\n",
	" <td>3.47</td>\n",
	" <td>2.68</td>\n",
	" <td>9.0</td>\n",
	" <td>111</td>\n",
	" <td>5000</td>\n",
	" <td>21</td>\n",
	" <td>27</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>94.5</td>\n",
	" <td>171.2</td>\n",
	" <td>65.5</td>\n",
	" <td>52.4</td>\n",
	" <td>2823</td>\n",
	" <td>152</td>\n",
	" <td>2.68</td>\n",
	" <td>3.47</td>\n",
	" <td>9.0</td>\n",
	" <td>154</td>\n",
	" <td>5000</td>\n",
	" <td>19</td>\n",
	" <td>26</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>99.8</td>\n",
	" <td>176.6</td>\n",
	" <td>66.2</td>\n",
	" <td>54.3</td>\n",
	" <td>2337</td>\n",
	" <td>109</td>\n",
	" <td>3.19</td>\n",
	" <td>3.40</td>\n",
	" <td>10.0</td>\n",
	" <td>102</td>\n",
	" <td>5500</td>\n",
	" <td>24</td>\n",
	" <td>30</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>99.4</td>\n",
	" <td>176.6</td>\n",
	" <td>66.4</td>\n",
	" <td>54.3</td>\n",
	" <td>2824</td>\n",
	" <td>136</td>\n",
	" <td>3.19</td>\n",
	" <td>3.40</td>\n",
	" <td>8.0</td>\n",
	" <td>115</td>\n",
	" <td>5500</td>\n",
	" <td>18</td>\n",
	" <td>22</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>5</th>\n",
	" <td>99.8</td>\n",
	" <td>177.3</td>\n",
	" <td>66.3</td>\n",
	" <td>53.1</td>\n",
	" <td>2507</td>\n",
	" <td>136</td>\n",
	" <td>3.19</td>\n",
	" <td>3.40</td>\n",
	" <td>8.5</td>\n",
	" <td>110</td>\n",
	" <td>5500</td>\n",
	" <td>19</td>\n",
	" <td>25</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>6</th>\n",
	" <td>105.8</td>\n",
	" <td>192.7</td>\n",
	" <td>71.4</td>\n",
	" <td>55.7</td>\n",
	" <td>2844</td>\n",
	" <td>136</td>\n",
	" <td>3.19</td>\n",
	" <td>3.40</td>\n",
	" <td>8.5</td>\n",
	" <td>110</td>\n",
	" <td>5500</td>\n",
	" <td>19</td>\n",
	" <td>25</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>7</th>\n",
	" <td>105.8</td>\n",
	" <td>192.7</td>\n",
	" <td>71.4</td>\n",
	" <td>55.7</td>\n",
	" <td>2954</td>\n",
	" <td>136</td>\n",
	" <td>3.19</td>\n",
	" <td>3.40</td>\n",
	" <td>8.5</td>\n",
	" <td>110</td>\n",
	" <td>5500</td>\n",
	" <td>19</td>\n",
	" <td>25</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" wheel_base length width height curb_weight engine_size bore stroke \\\n",
	"0 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 \n",
	"1 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 \n",
	"2 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 \n",
	"3 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 \n",
	"4 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 \n",
	"5 99.8 177.3 66.3 53.1 2507 136 3.19 3.40 \n",
	"6 105.8 192.7 71.4 55.7 2844 136 3.19 3.40 \n",
	"7 105.8 192.7 71.4 55.7 2954 136 3.19 3.40 \n",
	"\n",
	" compression_ratio horse_power peak_rpm city_mpg highway_mpg four two \n",
	"0 9.0 111 5000 21 27 0 1 \n",
	"1 9.0 111 5000 21 27 0 1 \n",
	"2 9.0 154 5000 19 26 0 1 \n",
	"3 10.0 102 5500 24 30 1 0 \n",
	"4 8.0 115 5500 18 22 1 0 \n",
	"5 8.5 110 5500 19 25 0 1 \n",
	"6 8.5 110 5500 19 25 1 0 \n",
	"7 8.5 110 5500 19 25 1 0 "
	]
	},
	"execution_count": 80,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X.head(n=8)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's see if there are missing values and imputate them"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 81,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"((array([], dtype=int64), array([], dtype=int64)),\n",
	" (array([ 9, 44, 45, 125], dtype=int64),))"
	]
	},
	"execution_count": 81,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.where(pd.isnull(X)), np.where(pd.isnull(Y))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Ok, we have empty data values in the Y column (predicted values), we have only 4 values out of 199 samples. Let's complete them."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 82,
	"metadata": {},
	"outputs": [],
	"source": [
	"Y.fillna(Y.mean(), inplace=True)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We changed the missing values to the mean. Let's check again for missing values."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 83,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(array([], dtype=int64),)"
	]
	},
	"execution_count": 83,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.where(pd.isnull(Y))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Great, no missing values. Let's continue with the random forest"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 100,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"0.7446599177861954\n"
	]
	}
	],
	"source": [
	"seed = 7\n",
	"num_trees = 20\n",
	"max_features = 9\n",
	"kfold = model_selection.KFold(n_splits=3, random_state=seed)\n",
	"model = RandomForestRegressor(n_estimators=num_trees, max_features=max_features)\n",
	"results = model_selection.cross_val_score(model, X, Y, cv=kfold)\n",
	"print(results.mean())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Ok, we got a 74% accuracy on the cross validated k fold test set. Quite fair to the fact we reduced so many features."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python (myenv)",
	"language": "python",
	"name": "myenv"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}