Skip to content

Instantly share code, notes, and snippets.

@roycoding
roycoding / 2014-03-11-gists.md
Last active June 19, 2017 18:47
Gihub Gists: Blogging for the lazy

Github Gists: Blogging for the lazy

Roy Keyes

11 March 2014 - This is a post on my blog.

Recently I decided to revamp my website. I wanted it to be simple, mobile friendly, have Markdown-based blogging, and not pay an arm and a leg to host it.

Static sites are all the rage these days, and not without reason. They're cheap, fast, and portable. Of the several hosting options I looked at, including S3, Github seemed like the easiest. A site is included even with your free account and you can just push a git repo to publish.

Although static site generators are very popular, I decided that I would simply use a CSS framework like Bootstrap. Having built a few websites before, I knew I wanted to start "responsive" out of the box and use something with light mental overhead. At some point I came accross Skeleton and it seemed to fit the bill.

@roycoding
roycoding / klackers.md
Last active September 1, 2020 10:58
Klackers Strategy

Klackers strategy via Monte Carlo

Roy Keyes

19 May 2014 - This is a post on my blog.

Klackers (a.k.a Shut the Box) is a dice game, often played in bars and pubs. It's a game of chance, arithmetic, and strategy. This little project is intended to find the best simple strategy for playing Klackers. Maybe you can become a Klackers shark...?

The game

Klackers is played with dice on a game board like the one pictured below.

A Shut the Box game, via Wikipedia

@roycoding
roycoding / gcf.py
Last active August 29, 2015 14:05
Kaggle - Titanic: match the Gender, Class, Fare benchmark
# Python code for the Kaggle Titanic competition
# https://www.kaggle.com/c/titanic-gettingStarted
# This code implements the gender, class, fare benchmark.
# This is part of the Match 5 Kaggle Benchmarks in 5 Days challenge.
# https://www.kaggle.com/forums/t/9993/match-5-kaggle-benchmarks-in-5-days
import pandas as pd
import numpy as np
@roycoding
roycoding / pizza.md
Last active June 27, 2017 18:38
All zeros benchmark for Random Acts of Pizza

All zeros benchmark for Random Acts of Pizza competition

In the Random Acts of Pizza competition on Kaggle, the goal is to predict whether people posting on Reddit's Random Acts of Pizza sub-Reddit will actually receive a free pizza based on their post. For this classification problem, the evaluation metric is AUC.

I recreated the all-zeros benchmark using a couple of unix commandline tools.

  1. Create the CSV header:
echo "request_id,requester_received_pizza" > zero-benchmark.csv
@roycoding
roycoding / bikeshare.md
Last active June 27, 2017 18:37
Day 3: Mean benchmark of the Bike Sharing Demand Kaggle competition.

Mean benchmark for Bike Sharing Demand competition

In the Bike Sharing Demand competition on Kaggle, the goal is to predict the demand for bike share bikes in Washington DC based on historical usage data. For this regression problem, the evaluation metric is RMSLE.

I decided to recreate the mean value benchmark using unix commandline tools. The benchmark consists of using the overall usage mean from the training set for all test set datetimes (i.e. using the same, single value for all predicted counts).

I used the csvkit suite of tools along with sed to recreate the benchmark. This was my first time using csvkit and I'm happy so far!

@roycoding
roycoding / pizza-rf.md
Last active June 27, 2017 18:36
Beat the Benchmark: Random Acts of Pizza

Beating the Random Acts of Pizza Benchmark

The Random Acts of Pizza competition is about predicting when a request for a free pizza on the Random Acts of Pizza sub-reddit is granted. The benchmark is simply guessing that no pizzas are given (or all). This results in an AUC score of 50.

To beat the AUC = 50 benchmark with a simple model, I first looked at the training and test data to find simple features. I decided to use the word counts of the request title and comment text, as longer comments might be skipped by readers.

To build the model I first extracted only the desired fields from the original JSON files with jq and used json2csv to write out CSV.

@roycoding
roycoding / beat-bike.md
Last active June 27, 2017 18:35
Beat the Benchmark: Bike Sharing Demand

Beating the Bike Sharing Demand benchmark

Day 3 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

In the Bike Sharing Demand competition on Kaggle, the goal is to predict the demand for bike share bikes in Washington DC based on historical usage data. For this regression problem, the evaluation metric is RMSLE.

To beat the total mean count benchmark I tried to strategies, one very simple and another slightly sophisticated. The first strategy was to use the per-month mean. The second was to a rolling mean.

Per-month count means

Using pandas I loaded the train and test data sets into Python. I then down sampled by month using the mean and upsampled by hour, filling in each month with the appropriate mean value.

@roycoding
roycoding / forest.md
Last active January 23, 2018 08:05
Beat the Becnhmark: Forest Cover Type Prediction

Beating the Forest Cover Type Prediction benchmark

Day 4 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

For the Forest Cover Type Prediction competition on Kaggle, the goal is to predict the predominant type of trees in a given section of forest. The score is based on average classification accuracy for the 7 different tree cover classes.

To beat the all fir/spruce benchmark I obviously tried a random forest. Using the default settings of scikit-learn's RandomForestClassifier, I was able to beat the benchmark with an accuracy score of 0.72718 on the competition leaderboard. By using 100 estimators (versus the default of 10), I was able to raise that accuracy score up to 0.75455.

Random Forest Cover Types

Using pandas I loaded the train and test data sets into Python.

@roycoding
roycoding / boxplots.md
Last active August 29, 2015 14:06
Salary box plots using matplotlib

This is some matplotlib scratch code to make a pretty boxplot as seen here.

import matplotlib.pyplot as plt

# Data is external: ra, tar, tas, ta, gar, gas, pa

bp=plt.boxplot([ra,tar+tas,ta,gar+gas,pa],widths=0.2,sym='',patch_artist=True)
plt.setp(bp['caps'],color='blue',alpha=1)
plt.setp(bp['whiskers'],color='blue',alpha=1)
@roycoding
roycoding / mr_patterns.md
Last active May 2, 2018 17:29
MapReduce Patterns

MapReduce Patterns

Roy Keyes

17 Sep 2014 - This is a post on my blog.

MapReduce is a powerful algorithm for processing large sets of data in a distributed, parallel manner. It has proven very popular for many data processing tasks, particularly using the open source Hadoop implementation.

MapReduce basics

The most basic idea powering MapReduce is to break large data sets into smaller chunks, which are then processed separately (in parallel). The results of the chunk processing are then collected.

MapReduce