Skip to content

Instantly share code, notes, and snippets.

@roycoding
Last active June 27, 2017 18:35
Show Gist options
  • Save roycoding/a41c74becff84b3c085b to your computer and use it in GitHub Desktop.
Save roycoding/a41c74becff84b3c085b to your computer and use it in GitHub Desktop.
Beat the Benchmark: Bike Sharing Demand

Beating the Bike Sharing Demand benchmark

Day 3 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

In the Bike Sharing Demand competition on Kaggle, the goal is to predict the demand for bike share bikes in Washington DC based on historical usage data. For this regression problem, the evaluation metric is RMSLE.

To beat the total mean count benchmark I tried to strategies, one very simple and another slightly sophisticated. The first strategy was to use the per-month mean. The second was to a rolling mean.

Per-month count means

Using pandas I loaded the train and test data sets into Python. I then down sampled by month using the mean and upsampled by hour, filling in each month with the appropriate mean value.

import pandas as pd

train = pd.read_csv('train.csv',parse_dates=[0])
test = pd.read_csv('test.csv',parse_dates=[0])

# Create a time series based on the datetime and count columns
train_time = pd.TimeSeries(train['count'].values,index=train.datetime.values)

# Down-sample to calculate per-month means
train_M = train_time.resample('M')
# Add 2011-01-01 and 2012-12-31 23:00 and sort
train_M[pd.datetime(2011,01,01)] = train_M[0]
train_M[pd.datetime(2012,12,31,23)] = train_M[pd.datetime(2012,12,31)]
train_M = train_M.sort_index()

# Up-sample per hour
train_H = train_M.resample('H',fill_method='backfill')

# Extract only test set timestamp values and write to file
train_H[test.datetime.values].to_csv('month_means.csv')

Finally add a quick header on the command line. (You could easily do this within Python instead.)

sed -i '1idatetime,count' month_means.csv

This scored an RMSLE of 1.50836 on the public leaderboard (beating the benchmark value of 1.58456).

Rolling mean

The second method was similar to the first, but instead of a per-month mean for the "missing" count values, a 10-day rolling mean was used.

# Continuing from above code
import numpy as np

train_i = train_time

# Add the final datetime from the test set
train_i[pd.datetime(2012,12,31,23)] = np.nan

# Calculate the 10 day rolling mean, fill NaNs, and write to file.
pd.rolling_mean(train_i,240).resample('H',fill_method='bfill').fillna(method='pad')[test.datetime.values].to_csv('rolling_mean.csv')

This scored an RMSLE of 1.51226 on the public leaderboard, which was worse than the per-month mean result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment