Skip to content

Instantly share code, notes, and snippets.

@roycoding
Last active June 27, 2017 18:37
Show Gist options
  • Save roycoding/f152842335b7eb690933 to your computer and use it in GitHub Desktop.
Save roycoding/f152842335b7eb690933 to your computer and use it in GitHub Desktop.
Day 3: Mean benchmark of the Bike Sharing Demand Kaggle competition.

Mean benchmark for Bike Sharing Demand competition

In the Bike Sharing Demand competition on Kaggle, the goal is to predict the demand for bike share bikes in Washington DC based on historical usage data. For this regression problem, the evaluation metric is RMSLE.

I decided to recreate the mean value benchmark using unix commandline tools. The benchmark consists of using the overall usage mean from the training set for all test set datetimes (i.e. using the same, single value for all predicted counts).

I used the csvkit suite of tools along with sed to recreate the benchmark. This was my first time using csvkit and I'm happy so far!

# Calculate the mean of the training set counts
MEAN=$(csvcut -c 12 train.csv | csvstat --mean)
# Write the test set datetime stamps and the mean value to csv, modifying the columns line
csvcut -c 1 test.csv | sed -e "1 s/$/,count/; 1n; s/$/,$MEAN/" > mean-benchmark.csv

This scores an RMSLE of 1.58456 on the public leaderboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment