Skip to content

Instantly share code, notes, and snippets.

@Z30G0D
Created May 16, 2018 15:38
Show Gist options
  • Save Z30G0D/f98bfde848082297bdeaa54a18e7862f to your computer and use it in GitHub Desktop.
Save Z30G0D/f98bfde848082297bdeaa54a18e7862f to your computer and use it in GitHub Desktop.
A simple random forest regressor for automobile dataset
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automobile prices regression\n",
"## hello all\n",
"This is a simple random forest classifier for regression of automobile prices, the dataset is located <a href=\"https://archive.ics.uci.edu/ml/datasets/automobile\">here</a>"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"# Random Forest Classification\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn import model_selection\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"place = \"imports-85-preprocessed.csv\" \n",
"dataframe = pandas.read_csv(place)\n",
"X = dataframe.iloc[:,0:14]\n",
"Y = dataframe.iloc[:,14]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I edited the csv file and left only the stated below features (I wanted to create a pure regression problem with little amount of preprocessing) Let's visualize the first few rows"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>num_of_doors</th>\n",
" <th>wheel_base</th>\n",
" <th>length</th>\n",
" <th>width</th>\n",
" <th>height</th>\n",
" <th>curb_weight</th>\n",
" <th>engine_size</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression_ratio</th>\n",
" <th>horse_power</th>\n",
" <th>peak_rpm</th>\n",
" <th>city_mpg</th>\n",
" <th>highway_mpg</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>two</td>\n",
" <td>88.6</td>\n",
" <td>168.8</td>\n",
" <td>64.1</td>\n",
" <td>48.8</td>\n",
" <td>2548</td>\n",
" <td>130</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>two</td>\n",
" <td>88.6</td>\n",
" <td>168.8</td>\n",
" <td>64.1</td>\n",
" <td>48.8</td>\n",
" <td>2548</td>\n",
" <td>130</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>two</td>\n",
" <td>94.5</td>\n",
" <td>171.2</td>\n",
" <td>65.5</td>\n",
" <td>52.4</td>\n",
" <td>2823</td>\n",
" <td>152</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000</td>\n",
" <td>19</td>\n",
" <td>26</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" num_of_doors wheel_base length width height curb_weight engine_size \\\n",
"0 two 88.6 168.8 64.1 48.8 2548 130 \n",
"1 two 88.6 168.8 64.1 48.8 2548 130 \n",
"2 two 94.5 171.2 65.5 52.4 2823 152 \n",
"\n",
" bore stroke compression_ratio horse_power peak_rpm city_mpg \\\n",
"0 3.47 2.68 9.0 111 5000 21 \n",
"1 3.47 2.68 9.0 111 5000 21 \n",
"2 2.68 3.47 9.0 154 5000 19 \n",
"\n",
" highway_mpg \n",
"0 27 \n",
"1 27 \n",
"2 26 "
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head(n=3)"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 13495.0\n",
"1 16500.0\n",
"2 16500.0\n",
"Name: price, dtype: float64"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y.head(n=3)"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"def num_doors_dummy(data):\n",
" \"\"\"This function is turning the 'num_of_doors' column to a dummy \"\"\"\n",
" dummies =pd.get_dummies(data.num_of_doors)\n",
" data.drop(['num_of_doors'], axis=1, inplace=True)\n",
" data = pd.concat([data, dummies], axis=1)\n",
" return data"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"X = num_doors_dummy(X)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>wheel_base</th>\n",
" <th>length</th>\n",
" <th>width</th>\n",
" <th>height</th>\n",
" <th>curb_weight</th>\n",
" <th>engine_size</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression_ratio</th>\n",
" <th>horse_power</th>\n",
" <th>peak_rpm</th>\n",
" <th>city_mpg</th>\n",
" <th>highway_mpg</th>\n",
" <th>four</th>\n",
" <th>two</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>88.6</td>\n",
" <td>168.8</td>\n",
" <td>64.1</td>\n",
" <td>48.8</td>\n",
" <td>2548</td>\n",
" <td>130</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>88.6</td>\n",
" <td>168.8</td>\n",
" <td>64.1</td>\n",
" <td>48.8</td>\n",
" <td>2548</td>\n",
" <td>130</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>94.5</td>\n",
" <td>171.2</td>\n",
" <td>65.5</td>\n",
" <td>52.4</td>\n",
" <td>2823</td>\n",
" <td>152</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000</td>\n",
" <td>19</td>\n",
" <td>26</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>99.8</td>\n",
" <td>176.6</td>\n",
" <td>66.2</td>\n",
" <td>54.3</td>\n",
" <td>2337</td>\n",
" <td>109</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500</td>\n",
" <td>24</td>\n",
" <td>30</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>99.4</td>\n",
" <td>176.6</td>\n",
" <td>66.4</td>\n",
" <td>54.3</td>\n",
" <td>2824</td>\n",
" <td>136</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500</td>\n",
" <td>18</td>\n",
" <td>22</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>99.8</td>\n",
" <td>177.3</td>\n",
" <td>66.3</td>\n",
" <td>53.1</td>\n",
" <td>2507</td>\n",
" <td>136</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.5</td>\n",
" <td>110</td>\n",
" <td>5500</td>\n",
" <td>19</td>\n",
" <td>25</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>105.8</td>\n",
" <td>192.7</td>\n",
" <td>71.4</td>\n",
" <td>55.7</td>\n",
" <td>2844</td>\n",
" <td>136</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.5</td>\n",
" <td>110</td>\n",
" <td>5500</td>\n",
" <td>19</td>\n",
" <td>25</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>105.8</td>\n",
" <td>192.7</td>\n",
" <td>71.4</td>\n",
" <td>55.7</td>\n",
" <td>2954</td>\n",
" <td>136</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.5</td>\n",
" <td>110</td>\n",
" <td>5500</td>\n",
" <td>19</td>\n",
" <td>25</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" wheel_base length width height curb_weight engine_size bore stroke \\\n",
"0 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 \n",
"1 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 \n",
"2 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 \n",
"3 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 \n",
"4 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 \n",
"5 99.8 177.3 66.3 53.1 2507 136 3.19 3.40 \n",
"6 105.8 192.7 71.4 55.7 2844 136 3.19 3.40 \n",
"7 105.8 192.7 71.4 55.7 2954 136 3.19 3.40 \n",
"\n",
" compression_ratio horse_power peak_rpm city_mpg highway_mpg four two \n",
"0 9.0 111 5000 21 27 0 1 \n",
"1 9.0 111 5000 21 27 0 1 \n",
"2 9.0 154 5000 19 26 0 1 \n",
"3 10.0 102 5500 24 30 1 0 \n",
"4 8.0 115 5500 18 22 1 0 \n",
"5 8.5 110 5500 19 25 0 1 \n",
"6 8.5 110 5500 19 25 1 0 \n",
"7 8.5 110 5500 19 25 1 0 "
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head(n=8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if there are missing values and imputate them"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"((array([], dtype=int64), array([], dtype=int64)),\n",
" (array([ 9, 44, 45, 125], dtype=int64),))"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.where(pd.isnull(X)), np.where(pd.isnull(Y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, we have empty data values in the Y column (predicted values), we have only 4 values out of 199 samples. Let's complete them."
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"Y.fillna(Y.mean(), inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We changed the missing values to the mean. Let's check again for missing values."
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"(array([], dtype=int64),)"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.where(pd.isnull(Y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great, no missing values. Let's continue with the random forest"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.7446599177861954\n"
]
}
],
"source": [
"seed = 7\n",
"num_trees = 20\n",
"max_features = 9\n",
"kfold = model_selection.KFold(n_splits=3, random_state=seed)\n",
"model = RandomForestRegressor(n_estimators=num_trees, max_features=max_features)\n",
"results = model_selection.cross_val_score(model, X, Y, cv=kfold)\n",
"print(results.mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, we got a 74% accuracy on the cross validated k fold test set. Quite fair to the fact we reduced so many features."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (myenv)",
"language": "python",
"name": "myenv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment