Skip to content

Instantly share code, notes, and snippets.

@clettieri
Created July 7, 2017 12:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save clettieri/72aa15da91338cb4585b004588e16934 to your computer and use it in GitHub Desktop.
Save clettieri/72aa15da91338cb4585b004588e16934 to your computer and use it in GitHub Desktop.
RentHopKaggle
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# RentHop: Rental Listing Inquiries\n",
"\n",
"url = https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/\n",
"\n",
"### Understanding the Question\n",
"\n",
"Given a set of features for a rental listing, we are to predict how much interest (low, medium, high) a rental listing will receive. We are given labels for our data. Our predictions should be represented as class probability (as per the competition rules).\n",
"\n",
"This is a supervised classification problem.\n",
"\n",
"### Getting Started - Load & Inspect Data\n",
"\n",
"The data is available on kaggle at https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data. We are given 14 features in our data set and the label column is called 'interest_level'."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>building_id</th>\n",
" <th>created</th>\n",
" <th>description</th>\n",
" <th>display_address</th>\n",
" <th>features</th>\n",
" <th>interest_level</th>\n",
" <th>latitude</th>\n",
" <th>listing_id</th>\n",
" <th>longitude</th>\n",
" <th>manager_id</th>\n",
" <th>photos</th>\n",
" <th>price</th>\n",
" <th>street_address</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1.5</td>\n",
" <td>3</td>\n",
" <td>53a5b119ba8f7b61d4e010512e0dfc85</td>\n",
" <td>2016-06-24 07:54:24</td>\n",
" <td>A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...</td>\n",
" <td>Metropolitan Avenue</td>\n",
" <td>[]</td>\n",
" <td>medium</td>\n",
" <td>40.7145</td>\n",
" <td>7211212</td>\n",
" <td>-73.9425</td>\n",
" <td>5ba989232d0489da1b5f2c45f6688adc</td>\n",
" <td>[https://photos.renthop.com/2/7211212_1ed4542e...</td>\n",
" <td>3000</td>\n",
" <td>792 Metropolitan Avenue</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10000</th>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>c5c8a357cba207596b04d1afd1e4f130</td>\n",
" <td>2016-06-12 12:19:27</td>\n",
" <td></td>\n",
" <td>Columbus Avenue</td>\n",
" <td>[Doorman, Elevator, Fitness Center, Cats Allow...</td>\n",
" <td>low</td>\n",
" <td>40.7947</td>\n",
" <td>7150865</td>\n",
" <td>-73.9667</td>\n",
" <td>7533621a882f71e25173b27e3139d83d</td>\n",
" <td>[https://photos.renthop.com/2/7150865_be3306c5...</td>\n",
" <td>5465</td>\n",
" <td>808 Columbus Avenue</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100004</th>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>c3ba40552e2120b0acfc3cb5730bb2aa</td>\n",
" <td>2016-04-17 03:26:41</td>\n",
" <td>Top Top West Village location, beautiful Pre-w...</td>\n",
" <td>W 13 Street</td>\n",
" <td>[Laundry In Building, Dishwasher, Hardwood Flo...</td>\n",
" <td>high</td>\n",
" <td>40.7388</td>\n",
" <td>6887163</td>\n",
" <td>-74.0018</td>\n",
" <td>d9039c43983f6e564b1482b273bd7b01</td>\n",
" <td>[https://photos.renthop.com/2/6887163_de85c427...</td>\n",
" <td>2850</td>\n",
" <td>241 W 13 Street</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100007</th>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>28d9ad350afeaab8027513a3e52ac8d5</td>\n",
" <td>2016-04-18 02:22:02</td>\n",
" <td>Building Amenities - Garage - Garden - fitness...</td>\n",
" <td>East 49th Street</td>\n",
" <td>[Hardwood Floors, No Fee]</td>\n",
" <td>low</td>\n",
" <td>40.7539</td>\n",
" <td>6888711</td>\n",
" <td>-73.9677</td>\n",
" <td>1067e078446a7897d2da493d2f741316</td>\n",
" <td>[https://photos.renthop.com/2/6888711_6e660cee...</td>\n",
" <td>3275</td>\n",
" <td>333 East 49th Street</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100013</th>\n",
" <td>1.0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>2016-04-28 01:32:41</td>\n",
" <td>Beautifully renovated 3 bedroom flex 4 bedroom...</td>\n",
" <td>West 143rd Street</td>\n",
" <td>[Pre-War]</td>\n",
" <td>low</td>\n",
" <td>40.8241</td>\n",
" <td>6934781</td>\n",
" <td>-73.9493</td>\n",
" <td>98e13ad4b495b9613cef886d79a6291f</td>\n",
" <td>[https://photos.renthop.com/2/6934781_1fa4b41a...</td>\n",
" <td>3350</td>\n",
" <td>500 West 143rd Street</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" bathrooms bedrooms building_id \\\n",
"10 1.5 3 53a5b119ba8f7b61d4e010512e0dfc85 \n",
"10000 1.0 2 c5c8a357cba207596b04d1afd1e4f130 \n",
"100004 1.0 1 c3ba40552e2120b0acfc3cb5730bb2aa \n",
"100007 1.0 1 28d9ad350afeaab8027513a3e52ac8d5 \n",
"100013 1.0 4 0 \n",
"\n",
" created \\\n",
"10 2016-06-24 07:54:24 \n",
"10000 2016-06-12 12:19:27 \n",
"100004 2016-04-17 03:26:41 \n",
"100007 2016-04-18 02:22:02 \n",
"100013 2016-04-28 01:32:41 \n",
"\n",
" description \\\n",
"10 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... \n",
"10000 \n",
"100004 Top Top West Village location, beautiful Pre-w... \n",
"100007 Building Amenities - Garage - Garden - fitness... \n",
"100013 Beautifully renovated 3 bedroom flex 4 bedroom... \n",
"\n",
" display_address \\\n",
"10 Metropolitan Avenue \n",
"10000 Columbus Avenue \n",
"100004 W 13 Street \n",
"100007 East 49th Street \n",
"100013 West 143rd Street \n",
"\n",
" features interest_level \\\n",
"10 [] medium \n",
"10000 [Doorman, Elevator, Fitness Center, Cats Allow... low \n",
"100004 [Laundry In Building, Dishwasher, Hardwood Flo... high \n",
"100007 [Hardwood Floors, No Fee] low \n",
"100013 [Pre-War] low \n",
"\n",
" latitude listing_id longitude manager_id \\\n",
"10 40.7145 7211212 -73.9425 5ba989232d0489da1b5f2c45f6688adc \n",
"10000 40.7947 7150865 -73.9667 7533621a882f71e25173b27e3139d83d \n",
"100004 40.7388 6887163 -74.0018 d9039c43983f6e564b1482b273bd7b01 \n",
"100007 40.7539 6888711 -73.9677 1067e078446a7897d2da493d2f741316 \n",
"100013 40.8241 6934781 -73.9493 98e13ad4b495b9613cef886d79a6291f \n",
"\n",
" photos price \\\n",
"10 [https://photos.renthop.com/2/7211212_1ed4542e... 3000 \n",
"10000 [https://photos.renthop.com/2/7150865_be3306c5... 5465 \n",
"100004 [https://photos.renthop.com/2/6887163_de85c427... 2850 \n",
"100007 [https://photos.renthop.com/2/6888711_6e660cee... 3275 \n",
"100013 [https://photos.renthop.com/2/6934781_1fa4b41a... 3350 \n",
"\n",
" street_address \n",
"10 792 Metropolitan Avenue \n",
"10000 808 Columbus Avenue \n",
"100004 241 W 13 Street \n",
"100007 333 East 49th Street \n",
"100013 500 West 143rd Street "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"train_df = pd.read_json('train.json')\n",
"train_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(49352, 15)"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"bathrooms float64\n",
"bedrooms int64\n",
"building_id object\n",
"created object\n",
"description object\n",
"display_address object\n",
"features object\n",
"interest_level object\n",
"latitude float64\n",
"listing_id int64\n",
"longitude float64\n",
"manager_id object\n",
"photos object\n",
"price int64\n",
"street_address object\n",
"dtype: object"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"bathrooms 0\n",
"bedrooms 0\n",
"building_id 0\n",
"created 0\n",
"description 0\n",
"display_address 0\n",
"features 0\n",
"interest_level 0\n",
"latitude 0\n",
"listing_id 0\n",
"longitude 0\n",
"manager_id 0\n",
"photos 0\n",
"price 0\n",
"street_address 0\n",
"dtype: int64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Check for NaNs\n",
"train_df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Engineering\n",
"\n",
"Here I develop various features from the data set.\n",
"\n",
"##### Date Features - Month, Day, Hour\n",
"Perhaps the timing of the listing can predict how popular it will be. Leases tend to start on the 1st of the month maybe the day of the listing will matter. Also summers tend to see more transitions as school ends and students graduate into new jobs or internships.\n",
"\n",
"##### Price Vs Location Avg\n",
"This rounds off the latitude and longitude to make a location 'box'. Then it calculates the average price per room of all listings in that area. We then divide the actual price per room by the average price per room in the area to give a ratio of over or under priced for that location.\n",
"\n",
"##### Manager Skill\n",
"Heavily inspired from den3b's notebook @ https://www.kaggle.com/den3b81/two-sigma-connect-rental-listing-inquiries/improve-perfomances-using-manager-features. This is a score for all managers in the training set with at least 30 listings. The score is based on the % of their listings that are at the 3 interest levels.\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Date Features\n",
"def add_date_features(df):\n",
" '''(DataFrame) -> DataFrame\n",
" \n",
" Will add some specific columns based on the date\n",
" the listing was created.\n",
" '''\n",
" #Convert to datetime to make extraction easier\n",
" df['created'] = pd.to_datetime(df['created'])\n",
" #Extract features\n",
" df['created_month'] = df['created'].dt.month\n",
" df['created_day'] = df['created'].dt.day\n",
" df['created_hour'] = df['created'].dt.hour\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def compute_manager_skill(train_df):\n",
" '''(DataFrame) -> DataFrame\n",
" \n",
" Given the training data, build a column for manager skill.\n",
" Return this dataframe with manager skill. Only compute skill\n",
" for managers with 30+ listings.\n",
" '''\n",
" #Get dummies creates new binary columns for the categories in 'interest level' \n",
" #This creates 3 new cols = low, medium, high, with value 0 or 1\n",
" dummies = pd.get_dummies(train_df['interest_level'])\n",
" #Build new temporary dataframe\n",
" man_skill = pd.concat([train_df['manager_id'], dummies], axis=1)\n",
" #Get mean and total count for each manager\n",
" man_skill = pd.concat([man_skill.groupby('manager_id').mean(), man_skill.groupby('manager_id').count()], axis=1).iloc[:,:-2] #remove extra count cols\n",
" man_skill.columns = ['low', 'medium', 'high', 'count']\n",
" man_skill = man_skill.sort_values(by='count', ascending=False)\n",
" #Using man_skill['count'].describe(percentiles=[.8, .9, .95])\n",
" #looks like 10% about have 30 or more listings, that seems like a fair sample size to judge a managers skill\n",
" man_skill = man_skill[man_skill['count'] >= 30]\n",
" #Compute skill as average * weighting -> 0 for low, 1 for medium, 2 for high\n",
" #This inspired from den3b's notebook @\n",
" #https://www.kaggle.com/den3b81/two-sigma-connect-rental-listing-inquiries/improve-perfomances-using-manager-features\n",
" man_skill['skill'] = man_skill['medium']*1 + man_skill['high']*2\n",
" man_skill['manager_id'] = man_skill.index\n",
"\n",
" return man_skill"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Manager Skill\n",
"def add_manager_skill(data_df, man_skill_df):\n",
" '''(DataFrame, DataFrame) -> DataFrame\n",
" \n",
" Will add the skill columns to testing/ training sets\n",
" only for managers that in the training set have over 30 listings.\n",
" This info is passed from the man_skill_df\n",
" '''\n",
" #Now add Man_skill to train set\n",
" data_df = data_df.merge(man_skill, how='left', left_on='manager_id', right_on='manager_id')\n",
" data_df = data_df.drop('low', 1)\n",
" data_df = data_df.drop('medium', 1)\n",
" data_df = data_df.drop('high', 1)\n",
" data_df.fillna(0, inplace=True)\n",
" \n",
" return data_df"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Price vs Location Avg\n",
"def add_price_vs_loc_avg(df, per_room=True, per_listing=False):\n",
" '''(DataFrame, bool, bool) -> DataFrame\n",
" per_room will use the price per room as base\n",
" per_listing will use the 'price' as base\n",
" \n",
" Will add 'PriceVsLocAvg' to the current DataFrame.\n",
" '''\n",
" #Build Location\n",
" df['lat_round'] = df.apply(lambda x : round(x['latitude'],2), axis=1)\n",
" df['lon_round'] = df.apply(lambda x : round(x['longitude'],2), axis=1)\n",
" df['loc'] = df.apply(lambda x : tuple([x['lat_round'], x['lon_round']]), axis=1)\n",
" if per_room:\n",
" df['AvgLocPricePerRoom'] = df.apply(lambda x: df['PricePerRoom'][df['loc']==x['loc']].mean(), axis=1)\n",
" df['PricePerRoomVsLocAvg'] = df['PricePerRoom'] / df['AvgLocPricePerRoom']\n",
" if per_listing:\n",
" df['AvgLocPrice'] = df.apply(lambda x: df['price'][df['loc']==x['loc']].mean(), axis=1)\n",
" df['PriceVsLocAvg'] = df['price'] / df['AvgLocPrice']\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>building_id</th>\n",
" <th>created</th>\n",
" <th>description</th>\n",
" <th>display_address</th>\n",
" <th>features</th>\n",
" <th>interest_level</th>\n",
" <th>latitude</th>\n",
" <th>listing_id</th>\n",
" <th>...</th>\n",
" <th>lat_round</th>\n",
" <th>lon_round</th>\n",
" <th>loc</th>\n",
" <th>AvgLocPricePerRoom</th>\n",
" <th>PricePerRoomVsLocAvg</th>\n",
" <th>created_month</th>\n",
" <th>created_day</th>\n",
" <th>created_hour</th>\n",
" <th>count</th>\n",
" <th>skill</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.5</td>\n",
" <td>3</td>\n",
" <td>53a5b119ba8f7b61d4e010512e0dfc85</td>\n",
" <td>2016-06-24 07:54:24</td>\n",
" <td>A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...</td>\n",
" <td>Metropolitan Avenue</td>\n",
" <td>[]</td>\n",
" <td>medium</td>\n",
" <td>40.7145</td>\n",
" <td>7211212</td>\n",
" <td>...</td>\n",
" <td>40.71</td>\n",
" <td>-73.94</td>\n",
" <td>(40.71, -73.94)</td>\n",
" <td>765.314223</td>\n",
" <td>0.712720</td>\n",
" <td>6</td>\n",
" <td>24</td>\n",
" <td>7</td>\n",
" <td>90.0</td>\n",
" <td>1.255556</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>c5c8a357cba207596b04d1afd1e4f130</td>\n",
" <td>2016-06-12 12:19:27</td>\n",
" <td></td>\n",
" <td>Columbus Avenue</td>\n",
" <td>[Doorman, Elevator, Fitness Center, Cats Allow...</td>\n",
" <td>low</td>\n",
" <td>40.7947</td>\n",
" <td>7150865</td>\n",
" <td>...</td>\n",
" <td>40.79</td>\n",
" <td>-73.97</td>\n",
" <td>(40.79, -73.97)</td>\n",
" <td>1133.218273</td>\n",
" <td>1.205637</td>\n",
" <td>6</td>\n",
" <td>12</td>\n",
" <td>12</td>\n",
" <td>86.0</td>\n",
" <td>1.011628</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>c3ba40552e2120b0acfc3cb5730bb2aa</td>\n",
" <td>2016-04-17 03:26:41</td>\n",
" <td>Top Top West Village location, beautiful Pre-w...</td>\n",
" <td>W 13 Street</td>\n",
" <td>[Laundry In Building, Dishwasher, Hardwood Flo...</td>\n",
" <td>high</td>\n",
" <td>40.7388</td>\n",
" <td>6887163</td>\n",
" <td>...</td>\n",
" <td>40.74</td>\n",
" <td>-74.00</td>\n",
" <td>(40.74, -74.0)</td>\n",
" <td>1205.296462</td>\n",
" <td>0.788188</td>\n",
" <td>4</td>\n",
" <td>17</td>\n",
" <td>3</td>\n",
" <td>134.0</td>\n",
" <td>1.305970</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>28d9ad350afeaab8027513a3e52ac8d5</td>\n",
" <td>2016-04-18 02:22:02</td>\n",
" <td>Building Amenities - Garage - Garden - fitness...</td>\n",
" <td>East 49th Street</td>\n",
" <td>[Hardwood Floors, No Fee]</td>\n",
" <td>low</td>\n",
" <td>40.7539</td>\n",
" <td>6888711</td>\n",
" <td>...</td>\n",
" <td>40.75</td>\n",
" <td>-73.97</td>\n",
" <td>(40.75, -73.97)</td>\n",
" <td>1078.049624</td>\n",
" <td>1.012631</td>\n",
" <td>4</td>\n",
" <td>18</td>\n",
" <td>2</td>\n",
" <td>191.0</td>\n",
" <td>1.057592</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1.0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>2016-04-28 01:32:41</td>\n",
" <td>Beautifully renovated 3 bedroom flex 4 bedroom...</td>\n",
" <td>West 143rd Street</td>\n",
" <td>[Pre-War]</td>\n",
" <td>low</td>\n",
" <td>40.8241</td>\n",
" <td>6934781</td>\n",
" <td>...</td>\n",
" <td>40.82</td>\n",
" <td>-73.95</td>\n",
" <td>(40.82, -73.95)</td>\n",
" <td>573.579717</td>\n",
" <td>0.973419</td>\n",
" <td>4</td>\n",
" <td>28</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 31 columns</p>\n",
"</div>"
],
"text/plain": [
" bathrooms bedrooms building_id created \\\n",
"0 1.5 3 53a5b119ba8f7b61d4e010512e0dfc85 2016-06-24 07:54:24 \n",
"1 1.0 2 c5c8a357cba207596b04d1afd1e4f130 2016-06-12 12:19:27 \n",
"2 1.0 1 c3ba40552e2120b0acfc3cb5730bb2aa 2016-04-17 03:26:41 \n",
"3 1.0 1 28d9ad350afeaab8027513a3e52ac8d5 2016-04-18 02:22:02 \n",
"4 1.0 4 0 2016-04-28 01:32:41 \n",
"\n",
" description display_address \\\n",
"0 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... Metropolitan Avenue \n",
"1 Columbus Avenue \n",
"2 Top Top West Village location, beautiful Pre-w... W 13 Street \n",
"3 Building Amenities - Garage - Garden - fitness... East 49th Street \n",
"4 Beautifully renovated 3 bedroom flex 4 bedroom... West 143rd Street \n",
"\n",
" features interest_level latitude \\\n",
"0 [] medium 40.7145 \n",
"1 [Doorman, Elevator, Fitness Center, Cats Allow... low 40.7947 \n",
"2 [Laundry In Building, Dishwasher, Hardwood Flo... high 40.7388 \n",
"3 [Hardwood Floors, No Fee] low 40.7539 \n",
"4 [Pre-War] low 40.8241 \n",
"\n",
" listing_id ... lat_round lon_round loc \\\n",
"0 7211212 ... 40.71 -73.94 (40.71, -73.94) \n",
"1 7150865 ... 40.79 -73.97 (40.79, -73.97) \n",
"2 6887163 ... 40.74 -74.00 (40.74, -74.0) \n",
"3 6888711 ... 40.75 -73.97 (40.75, -73.97) \n",
"4 6934781 ... 40.82 -73.95 (40.82, -73.95) \n",
"\n",
" AvgLocPricePerRoom PricePerRoomVsLocAvg created_month created_day \\\n",
"0 765.314223 0.712720 6 24 \n",
"1 1133.218273 1.205637 6 12 \n",
"2 1205.296462 0.788188 4 17 \n",
"3 1078.049624 1.012631 4 18 \n",
"4 573.579717 0.973419 4 28 \n",
"\n",
" created_hour count skill \n",
"0 7 90.0 1.255556 \n",
"1 12 86.0 1.011628 \n",
"2 3 134.0 1.305970 \n",
"3 2 191.0 1.057592 \n",
"4 1 0.0 0.000000 \n",
"\n",
"[5 rows x 31 columns]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def add_features(df):\n",
" '''(DataFrame) -> DataFrame\n",
" \n",
" Will add new features to the current DataFrame.\n",
" '''\n",
" #Create # of Photos Column\n",
" df['NumPhotos'] = df.photos.str.len()\n",
" #Create # of Features Column\n",
" df['NumFeatures'] = df.features.str.len()\n",
" df['NumDescription'] = df.description.str.len()\n",
" #Total Rooms\n",
" df['TotalRooms'] = df['bathrooms'] + df['bedrooms']\n",
" #Room / Price\n",
" #Add one too all -assume every apartment is at least 1 room (studios)\n",
" df['PricePerRoom'] = df['price'] / (df['TotalRooms'] + 1.0)\n",
" df['PricePerBedRoom'] = df['price'] / (df['bedrooms'] + 1.0)\n",
" #Add Price vs Loc\n",
" df = add_price_vs_loc_avg(df)\n",
" #Add Date Features\n",
" df = add_date_features(df)\n",
" return df\n",
" \n",
"#Add features to Training Data\n",
"\n",
"train_df = add_features(train_df)\n",
"man_skill = compute_manager_skill(train_df)\n",
"train_df = add_manager_skill(train_df, man_skill)\n",
"train_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>building_id</th>\n",
" <th>created</th>\n",
" <th>description</th>\n",
" <th>display_address</th>\n",
" <th>features</th>\n",
" <th>latitude</th>\n",
" <th>listing_id</th>\n",
" <th>longitude</th>\n",
" <th>...</th>\n",
" <th>lat_round</th>\n",
" <th>lon_round</th>\n",
" <th>loc</th>\n",
" <th>AvgLocPricePerRoom</th>\n",
" <th>PricePerRoomVsLocAvg</th>\n",
" <th>created_month</th>\n",
" <th>created_day</th>\n",
" <th>created_hour</th>\n",
" <th>count</th>\n",
" <th>skill</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>79780be1514f645d7e6be99a3de696c5</td>\n",
" <td>2016-06-11 05:29:41</td>\n",
" <td>Large with awesome terrace--accessible via bed...</td>\n",
" <td>Suffolk Street</td>\n",
" <td>[Elevator, Laundry in Building, Laundry in Uni...</td>\n",
" <td>40.7185</td>\n",
" <td>7142618</td>\n",
" <td>-73.9865</td>\n",
" <td>...</td>\n",
" <td>40.72</td>\n",
" <td>-73.99</td>\n",
" <td>(40.72, -73.99)</td>\n",
" <td>1096.100837</td>\n",
" <td>0.897119</td>\n",
" <td>6</td>\n",
" <td>11</td>\n",
" <td>5</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>2016-06-24 06:36:34</td>\n",
" <td>Prime Soho - between Bleecker and Houston - Ne...</td>\n",
" <td>Thompson Street</td>\n",
" <td>[Pre-War, Dogs Allowed, Cats Allowed]</td>\n",
" <td>40.7278</td>\n",
" <td>7210040</td>\n",
" <td>-74.0000</td>\n",
" <td>...</td>\n",
" <td>40.73</td>\n",
" <td>-74.00</td>\n",
" <td>(40.73, -74.0)</td>\n",
" <td>1213.440460</td>\n",
" <td>0.587173</td>\n",
" <td>6</td>\n",
" <td>24</td>\n",
" <td>6</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>3dbbb69fd52e0d25131aa1cd459c87eb</td>\n",
" <td>2016-06-03 04:29:40</td>\n",
" <td>New York chic has reached a new level ...</td>\n",
" <td>101 East 10th Street</td>\n",
" <td>[Doorman, Elevator, No Fee]</td>\n",
" <td>40.7306</td>\n",
" <td>7103890</td>\n",
" <td>-73.9890</td>\n",
" <td>...</td>\n",
" <td>40.73</td>\n",
" <td>-73.99</td>\n",
" <td>(40.73, -73.99)</td>\n",
" <td>1162.180848</td>\n",
" <td>1.077859</td>\n",
" <td>6</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>783d21d013a7e655bddc4ed0d461cc5e</td>\n",
" <td>2016-06-11 06:17:35</td>\n",
" <td>Step into this fantastic new Construction in t...</td>\n",
" <td>South Third Street\\r</td>\n",
" <td>[Roof Deck, Balcony, Elevator, Laundry in Buil...</td>\n",
" <td>40.7109</td>\n",
" <td>7143442</td>\n",
" <td>-73.9571</td>\n",
" <td>...</td>\n",
" <td>40.71</td>\n",
" <td>-73.96</td>\n",
" <td>(40.71, -73.96)</td>\n",
" <td>816.200305</td>\n",
" <td>1.010781</td>\n",
" <td>6</td>\n",
" <td>11</td>\n",
" <td>6</td>\n",
" <td>61.0</td>\n",
" <td>1.032787</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2.0</td>\n",
" <td>2</td>\n",
" <td>6134e7c4dd1a98d9aee36623c9872b49</td>\n",
" <td>2016-04-12 05:24:17</td>\n",
" <td>~Take a stroll in Central Park, enjoy the ente...</td>\n",
" <td>Midtown West, 8th Ave</td>\n",
" <td>[Common Outdoor Space, Cats Allowed, Dogs Allo...</td>\n",
" <td>40.7650</td>\n",
" <td>6860601</td>\n",
" <td>-73.9845</td>\n",
" <td>...</td>\n",
" <td>40.77</td>\n",
" <td>-73.98</td>\n",
" <td>(40.77, -73.98)</td>\n",
" <td>2060.549580</td>\n",
" <td>0.475601</td>\n",
" <td>4</td>\n",
" <td>12</td>\n",
" <td>5</td>\n",
" <td>72.0</td>\n",
" <td>1.236111</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 30 columns</p>\n",
"</div>"
],
"text/plain": [
" bathrooms bedrooms building_id created \\\n",
"0 1.0 1 79780be1514f645d7e6be99a3de696c5 2016-06-11 05:29:41 \n",
"1 1.0 2 0 2016-06-24 06:36:34 \n",
"2 1.0 1 3dbbb69fd52e0d25131aa1cd459c87eb 2016-06-03 04:29:40 \n",
"3 1.0 2 783d21d013a7e655bddc4ed0d461cc5e 2016-06-11 06:17:35 \n",
"4 2.0 2 6134e7c4dd1a98d9aee36623c9872b49 2016-04-12 05:24:17 \n",
"\n",
" description display_address \\\n",
"0 Large with awesome terrace--accessible via bed... Suffolk Street \n",
"1 Prime Soho - between Bleecker and Houston - Ne... Thompson Street \n",
"2 New York chic has reached a new level ... 101 East 10th Street \n",
"3 Step into this fantastic new Construction in t... South Third Street\\r \n",
"4 ~Take a stroll in Central Park, enjoy the ente... Midtown West, 8th Ave \n",
"\n",
" features latitude listing_id \\\n",
"0 [Elevator, Laundry in Building, Laundry in Uni... 40.7185 7142618 \n",
"1 [Pre-War, Dogs Allowed, Cats Allowed] 40.7278 7210040 \n",
"2 [Doorman, Elevator, No Fee] 40.7306 7103890 \n",
"3 [Roof Deck, Balcony, Elevator, Laundry in Buil... 40.7109 7143442 \n",
"4 [Common Outdoor Space, Cats Allowed, Dogs Allo... 40.7650 6860601 \n",
"\n",
" longitude ... lat_round lon_round loc \\\n",
"0 -73.9865 ... 40.72 -73.99 (40.72, -73.99) \n",
"1 -74.0000 ... 40.73 -74.00 (40.73, -74.0) \n",
"2 -73.9890 ... 40.73 -73.99 (40.73, -73.99) \n",
"3 -73.9571 ... 40.71 -73.96 (40.71, -73.96) \n",
"4 -73.9845 ... 40.77 -73.98 (40.77, -73.98) \n",
"\n",
" AvgLocPricePerRoom PricePerRoomVsLocAvg created_month created_day \\\n",
"0 1096.100837 0.897119 6 11 \n",
"1 1213.440460 0.587173 6 24 \n",
"2 1162.180848 1.077859 6 3 \n",
"3 816.200305 1.010781 6 11 \n",
"4 2060.549580 0.475601 4 12 \n",
"\n",
" created_hour count skill \n",
"0 5 0.0 0.000000 \n",
"1 6 0.0 0.000000 \n",
"2 4 0.0 0.000000 \n",
"3 6 61.0 1.032787 \n",
"4 5 72.0 1.236111 \n",
"\n",
"[5 rows x 30 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Add features to Test Data\n",
"#Load test data\n",
"test_df = pd.read_json('test.json')\n",
"#Add engineered features\n",
"test_df = add_features(test_df)\n",
"test_df = add_manager_skill(test_df, man_skill)\n",
"test_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare Data for ML & Transform Features\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Apply same transforms to test features"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>building_id</th>\n",
" <th>created</th>\n",
" <th>description</th>\n",
" <th>display_address</th>\n",
" <th>features</th>\n",
" <th>latitude</th>\n",
" <th>listing_id</th>\n",
" <th>longitude</th>\n",
" <th>...</th>\n",
" <th>AvgLocPricePerRoom</th>\n",
" <th>PricePerRoomVsLocAvg</th>\n",
" <th>created_month</th>\n",
" <th>created_day</th>\n",
" <th>created_hour</th>\n",
" <th>count</th>\n",
" <th>skill</th>\n",
" <th>BuildingID</th>\n",
" <th>ManagerID</th>\n",
" <th>LocID</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>79780be1514f645d7e6be99a3de696c5</td>\n",
" <td>2016-06-11 05:29:41</td>\n",
" <td>Large with awesome terrace--accessible via bed...</td>\n",
" <td>Suffolk Street</td>\n",
" <td>[Elevator, Laundry in Building, Laundry in Uni...</td>\n",
" <td>40.7185</td>\n",
" <td>7142618</td>\n",
" <td>-73.9865</td>\n",
" <td>...</td>\n",
" <td>1096.100837</td>\n",
" <td>0.897119</td>\n",
" <td>6</td>\n",
" <td>11</td>\n",
" <td>5</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>5535</td>\n",
" <td>3076</td>\n",
" <td>264</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>2016-06-24 06:36:34</td>\n",
" <td>Prime Soho - between Bleecker and Houston - Ne...</td>\n",
" <td>Thompson Street</td>\n",
" <td>[Pre-War, Dogs Allowed, Cats Allowed]</td>\n",
" <td>40.7278</td>\n",
" <td>7210040</td>\n",
" <td>-74.0000</td>\n",
" <td>...</td>\n",
" <td>1213.440460</td>\n",
" <td>0.587173</td>\n",
" <td>6</td>\n",
" <td>24</td>\n",
" <td>6</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0</td>\n",
" <td>3593</td>\n",
" <td>292</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>3dbbb69fd52e0d25131aa1cd459c87eb</td>\n",
" <td>2016-06-03 04:29:40</td>\n",
" <td>New York chic has reached a new level ...</td>\n",
" <td>101 East 10th Street</td>\n",
" <td>[Doorman, Elevator, No Fee]</td>\n",
" <td>40.7306</td>\n",
" <td>7103890</td>\n",
" <td>-73.9890</td>\n",
" <td>...</td>\n",
" <td>1162.180848</td>\n",
" <td>1.077859</td>\n",
" <td>6</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>2813</td>\n",
" <td>2677</td>\n",
" <td>293</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>783d21d013a7e655bddc4ed0d461cc5e</td>\n",
" <td>2016-06-11 06:17:35</td>\n",
" <td>Step into this fantastic new Construction in t...</td>\n",
" <td>South Third Street\\r</td>\n",
" <td>[Roof Deck, Balcony, Elevator, Laundry in Buil...</td>\n",
" <td>40.7109</td>\n",
" <td>7143442</td>\n",
" <td>-73.9571</td>\n",
" <td>...</td>\n",
" <td>816.200305</td>\n",
" <td>1.010781</td>\n",
" <td>6</td>\n",
" <td>11</td>\n",
" <td>6</td>\n",
" <td>61.0</td>\n",
" <td>1.032787</td>\n",
" <td>5477</td>\n",
" <td>201</td>\n",
" <td>235</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2.0</td>\n",
" <td>2</td>\n",
" <td>6134e7c4dd1a98d9aee36623c9872b49</td>\n",
" <td>2016-04-12 05:24:17</td>\n",
" <td>~Take a stroll in Central Park, enjoy the ente...</td>\n",
" <td>Midtown West, 8th Ave</td>\n",
" <td>[Common Outdoor Space, Cats Allowed, Dogs Allo...</td>\n",
" <td>40.7650</td>\n",
" <td>6860601</td>\n",
" <td>-73.9845</td>\n",
" <td>...</td>\n",
" <td>2060.549580</td>\n",
" <td>0.475601</td>\n",
" <td>4</td>\n",
" <td>12</td>\n",
" <td>5</td>\n",
" <td>72.0</td>\n",
" <td>1.236111</td>\n",
" <td>4428</td>\n",
" <td>3157</td>\n",
" <td>384</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 33 columns</p>\n",
"</div>"
],
"text/plain": [
" bathrooms bedrooms building_id created \\\n",
"0 1.0 1 79780be1514f645d7e6be99a3de696c5 2016-06-11 05:29:41 \n",
"1 1.0 2 0 2016-06-24 06:36:34 \n",
"2 1.0 1 3dbbb69fd52e0d25131aa1cd459c87eb 2016-06-03 04:29:40 \n",
"3 1.0 2 783d21d013a7e655bddc4ed0d461cc5e 2016-06-11 06:17:35 \n",
"4 2.0 2 6134e7c4dd1a98d9aee36623c9872b49 2016-04-12 05:24:17 \n",
"\n",
" description display_address \\\n",
"0 Large with awesome terrace--accessible via bed... Suffolk Street \n",
"1 Prime Soho - between Bleecker and Houston - Ne... Thompson Street \n",
"2 New York chic has reached a new level ... 101 East 10th Street \n",
"3 Step into this fantastic new Construction in t... South Third Street\\r \n",
"4 ~Take a stroll in Central Park, enjoy the ente... Midtown West, 8th Ave \n",
"\n",
" features latitude listing_id \\\n",
"0 [Elevator, Laundry in Building, Laundry in Uni... 40.7185 7142618 \n",
"1 [Pre-War, Dogs Allowed, Cats Allowed] 40.7278 7210040 \n",
"2 [Doorman, Elevator, No Fee] 40.7306 7103890 \n",
"3 [Roof Deck, Balcony, Elevator, Laundry in Buil... 40.7109 7143442 \n",
"4 [Common Outdoor Space, Cats Allowed, Dogs Allo... 40.7650 6860601 \n",
"\n",
" longitude ... AvgLocPricePerRoom PricePerRoomVsLocAvg created_month \\\n",
"0 -73.9865 ... 1096.100837 0.897119 6 \n",
"1 -74.0000 ... 1213.440460 0.587173 6 \n",
"2 -73.9890 ... 1162.180848 1.077859 6 \n",
"3 -73.9571 ... 816.200305 1.010781 6 \n",
"4 -73.9845 ... 2060.549580 0.475601 4 \n",
"\n",
" created_day created_hour count skill BuildingID ManagerID LocID \n",
"0 11 5 0.0 0.000000 5535 3076 264 \n",
"1 24 6 0.0 0.000000 0 3593 292 \n",
"2 3 4 0.0 0.000000 2813 2677 293 \n",
"3 11 6 61.0 1.032787 5477 201 235 \n",
"4 12 5 72.0 1.236111 4428 3157 384 \n",
"\n",
"[5 rows x 33 columns]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"\n",
"#ENCODE TEXT FEATURES\n",
"#Combine the train and test columns\n",
"manager_combo = train_df['manager_id'].append(test_df['manager_id'])\n",
"building_combo = train_df['building_id'].append(test_df['building_id'])\n",
"loc_combo = train_df['loc'].append(test_df['loc'])\n",
"#Encode building_id\n",
"le_building = LabelEncoder()\n",
"le_building.fit(building_combo)\n",
"#Transform Train & Test set\n",
"train_df['BuildingID'] = le_building.transform(train_df['building_id'])\n",
"test_df['BuildingID'] = le_building.transform(test_df['building_id'])\n",
"#Encode manager_id\n",
"le_manager = LabelEncoder()\n",
"le_manager.fit(manager_combo)\n",
"#Transform Train & Test set\n",
"train_df['ManagerID'] = le_manager.transform(train_df['manager_id'])\n",
"test_df['ManagerID'] = le_manager.transform(test_df['manager_id'])\n",
"#Encode loc\n",
"le_loc = LabelEncoder()\n",
"le_loc.fit(loc_combo)\n",
"#Transform Train & Test set\n",
"train_df['LocID'] = le_loc.transform(train_df['loc'])\n",
"test_df['LocID'] = le_loc.transform(test_df['loc'])\n",
"\n",
"#Inspect to verify\n",
"test_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Pickle for easy backup\n",
"train_df.to_pickle('train_df.pickle')\n",
"test_df.to_pickle('test_df.pickle')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Select Features for Model"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Select Features\n",
"feature_cols = ['price', 'PricePerRoom', 'PricePerRoomVsLocAvg', 'BuildingID', 'NumDescription', 'ManagerID', 'NumPhotos',\n",
" 'NumFeatures', 'latitude', 'longitude', 'bedrooms', 'bathrooms', 'created_month', 'created_day', 'created_hour',\n",
" 'skill']\n",
"\n",
"#Prepare data for ML\n",
"X_train = train_df[feature_cols].values\n",
"X_test = test_df[feature_cols].values\n",
"\n",
"#Encode 'interest_level' to numerical\n",
"le_interest = LabelEncoder()\n",
"train_df['IL'] = le_interest.fit_transform(train_df['interest_level'])\n",
"#Set Train Y\n",
"Y = train_df['IL'].values"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0 1\n",
"1 0.087919 PricePerRoom\n",
"0 0.082722 price\n",
"3 0.078318 BuildingID\n",
"2 0.077875 PricePerRoomVsLocAvg\n",
"14 0.070911 created_hour\n",
"15 0.070178 skill\n",
"13 0.069552 created_day\n",
"4 0.068600 NumDescription\n",
"8 0.063100 latitude\n",
"9 0.061070 longitude\n",
"7 0.059877 NumFeatures\n",
"5 0.058974 ManagerID\n",
"6 0.058770 NumPhotos\n",
"12 0.037824 created_month\n",
"10 0.036804 bedrooms\n",
"11 0.017506 bathrooms\n"
]
}
],
"source": [
"#Find important features\n",
"from sklearn.ensemble import ExtraTreesClassifier\n",
"\n",
"model = ExtraTreesClassifier()\n",
"model.fit(X_train, Y)\n",
"importances = zip(model.feature_importances_, feature_cols)\n",
"importances = pd.DataFrame(importances)\n",
"print importances.sort_values(0, ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([u'high', u'low', u'medium'], dtype=object)"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Get Label encodings for reference later\n",
"le_interest.classes_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test ML Algorithim\n",
"\n",
"#### Random Forest Classifier"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-0.581278515186\n"
]
}
],
"source": [
"#RandomForest\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import KFold, cross_val_score\n",
"\n",
"#Initialize Model\n",
"rf = RandomForestClassifier(n_estimators=100, min_samples_split=20, criterion='entropy', n_jobs=-1)\n",
"#Create KFold\n",
"kfold = KFold(n_splits=5, random_state=5)\n",
"cross_val_results = cross_val_score(rf, X_train, Y, cv=kfold, scoring='neg_log_loss')\n",
"print cross_val_results.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Grid Search for best RF parameters"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best Score: -0.580035\n",
"{'max_features': 'log2', 'min_samples_split': 20, 'criterion': 'entropy'}\n"
]
}
],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"param_grid = {\n",
" 'min_samples_split' : [2, 4, 10, 20, 40],\n",
" 'criterion' : ['gini', 'entropy'],\n",
" 'max_features' : ['auto', 'log2', None]\n",
"}\n",
"\n",
"rf100 = RandomForestClassifier(n_estimators=100, n_jobs=-1)\n",
"grid_search = GridSearchCV(estimator=rf100, param_grid=param_grid, cv=5, scoring='neg_log_loss')\n",
"grid_search.fit(X_train, Y)\n",
"\n",
"print \"Best Score: %f\" % grid_search.best_score_\n",
"print grid_search.best_params_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### XGB"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-0.564649263704\n"
]
}
],
"source": [
"from xgboost import XGBClassifier\n",
"\n",
"#Initialize Model\n",
"xgb = XGBClassifier(objective='multi:softprob', max_depth=8, subsample=0.7)\n",
"#Create cross validation generator\n",
"kfold = KFold(n_splits=5, random_state=5)\n",
"#Train & Test model\n",
"cross_val_results = cross_val_score(xgb, X_train, Y, cv=kfold, scoring='neg_log_loss')\n",
"print cross_val_results.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### GridSearch for Best XGB Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"parameters = {'learning_rate': [0.1, 0.3],\n",
" 'min_child_weight': [3, 5, 8],\n",
" 'subsample' : [0.6, 0.7, 0.8],\n",
" 'max_depth' : [3, 5, 10]}\n",
"\n",
"xgb = XGBClassifier()\n",
"grid_search = GridSearchCV(xgb, parameters, n_jobs=-1, cv=10, scoring='neg_log_loss')\n",
"grid_search.fit(X_train, Y)\n",
"print \"Best Score: %f\" % grid_search.best_score_\n",
"print grid_search.best_params_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train Model & Make Submission\n"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#XGB\n",
"xgb = XGBClassifier(objective='multi:softprob', max_depth=8, subsample=0.7, n_estimators=100)\n",
"xgb.fit(X_train, Y)\n",
"predictions = xgb.predict_proba(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>listing_id</th>\n",
" <th>high</th>\n",
" <th>medium</th>\n",
" <th>low</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7142618</td>\n",
" <td>0.096804</td>\n",
" <td>0.490520</td>\n",
" <td>0.412676</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>7210040</td>\n",
" <td>0.081878</td>\n",
" <td>0.083418</td>\n",
" <td>0.834704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7103890</td>\n",
" <td>0.025219</td>\n",
" <td>0.104039</td>\n",
" <td>0.870742</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>7143442</td>\n",
" <td>0.057672</td>\n",
" <td>0.364056</td>\n",
" <td>0.578271</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>6860601</td>\n",
" <td>0.072051</td>\n",
" <td>0.355544</td>\n",
" <td>0.572405</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" listing_id high medium low\n",
"0 7142618 0.096804 0.490520 0.412676\n",
"1 7210040 0.081878 0.083418 0.834704\n",
"2 7103890 0.025219 0.104039 0.870742\n",
"3 7143442 0.057672 0.364056 0.578271\n",
"4 6860601 0.072051 0.355544 0.572405"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Submission must be - listing_id, high, medium, low\n",
"#The index of our probabilties is from the label encoder earlier (0=high, 1=low, medium=2)\n",
"submission_df = pd.DataFrame({'listing_id':test_df['listing_id'], 'high':predictions[:, 0],\n",
" 'medium':predictions[:, 2], 'low':predictions[:, 1]})\n",
"#Re-Order Columns for submission\n",
"cols = ['listing_id', 'high', 'medium', 'low']\n",
"submission_df = submission_df[cols]\n",
"submission_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Write to CSV for submission\n",
"submission_df.to_csv('xgb.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Make an RF Submission"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Random Forest\n",
"rf = RandomForestClassifier(n_estimators=1000, min_samples_split=20, criterion='entropy', n_jobs=-1)\n",
"rf.fit(X_train, Y)\n",
"prediction_probabilites = rf.predict_proba(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.117786</td>\n",
" <td>price</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.122826</td>\n",
" <td>PricePerRoom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.108902</td>\n",
" <td>PricePerRoomVsLocAvg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.095866</td>\n",
" <td>BuildingID</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.067968</td>\n",
" <td>NumDescription</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.051013</td>\n",
" <td>ManagerID</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.044346</td>\n",
" <td>NumPhotos</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0.046239</td>\n",
" <td>NumFeatures</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0.068551</td>\n",
" <td>latitude</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0.064987</td>\n",
" <td>longitude</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>0.019630</td>\n",
" <td>bedrooms</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>0.009462</td>\n",
" <td>bathrooms</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>0.011335</td>\n",
" <td>created_month</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>0.040890</td>\n",
" <td>created_day</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>0.054186</td>\n",
" <td>created_hour</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>0.076013</td>\n",
" <td>skill</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 0.117786 price\n",
"1 0.122826 PricePerRoom\n",
"2 0.108902 PricePerRoomVsLocAvg\n",
"3 0.095866 BuildingID\n",
"4 0.067968 NumDescription\n",
"5 0.051013 ManagerID\n",
"6 0.044346 NumPhotos\n",
"7 0.046239 NumFeatures\n",
"8 0.068551 latitude\n",
"9 0.064987 longitude\n",
"10 0.019630 bedrooms\n",
"11 0.009462 bathrooms\n",
"12 0.011335 created_month\n",
"13 0.040890 created_day\n",
"14 0.054186 created_hour\n",
"15 0.076013 skill"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Checkout feature importance\n",
"importances = zip(rf.feature_importances_, feature_cols)\n",
"importances = pd.DataFrame(importances)\n",
"importances"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>listing_id</th>\n",
" <th>high</th>\n",
" <th>medium</th>\n",
" <th>low</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7142618</td>\n",
" <td>0.073706</td>\n",
" <td>0.430420</td>\n",
" <td>0.495873</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>7210040</td>\n",
" <td>0.138937</td>\n",
" <td>0.203632</td>\n",
" <td>0.657431</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7103890</td>\n",
" <td>0.020215</td>\n",
" <td>0.127631</td>\n",
" <td>0.852154</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>7143442</td>\n",
" <td>0.078329</td>\n",
" <td>0.263707</td>\n",
" <td>0.657964</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>6860601</td>\n",
" <td>0.089804</td>\n",
" <td>0.366559</td>\n",
" <td>0.543637</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" listing_id high medium low\n",
"0 7142618 0.073706 0.430420 0.495873\n",
"1 7210040 0.138937 0.203632 0.657431\n",
"2 7103890 0.020215 0.127631 0.852154\n",
"3 7143442 0.078329 0.263707 0.657964\n",
"4 6860601 0.089804 0.366559 0.543637"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Submission must be - listing_id, high, medium, low\n",
"#The index of our probabilties is from the label encoder earlier (0=high, 1=low, medium=2)\n",
"submission_df = pd.DataFrame({'listing_id':test_df['listing_id'], 'high':prediction_probabilites[:, 0],\n",
" 'medium':prediction_probabilites[:, 2], 'low':prediction_probabilites[:, 1]})\n",
"#Re-Order Columns for submission\n",
"cols = ['listing_id', 'high', 'medium', 'low']\n",
"submission_df = submission_df[cols]\n",
"submission_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"listing_id 0\n",
"high 0\n",
"medium 0\n",
"low 0\n",
"dtype: int64"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Verify all is well (no NaNs)\n",
"submission_df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Write to CSV for submission\n",
"submission_df.to_csv('rf.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusion\n",
"\n",
"This XGB scored a 0.56831 and the random forest scored a 0.58048 on Kaggle.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment