Skip to content

Instantly share code, notes, and snippets.

@pronojitsaha
Created October 26, 2014 08:10
Show Gist options
  • Save pronojitsaha/8cdb67d4eb522e5e14a5 to your computer and use it in GitHub Desktop.
Save pronojitsaha/8cdb67d4eb522e5e14a5 to your computer and use it in GitHub Desktop.
Kaggle: Bike Sharing Demand using Vowpal Wabbit
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:9f31c32ee67d337553ec82fc0d8e5dc1f6d0ec1967b0bc44df8f5a78b6498bbe"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Kaggle: Bike Sharing Demand using [Vowpal Wabbit](http://hunch.net/~vw)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Vowpal Wabbit or VW is a machine learning algorithm developed by [John Langford](http://research.microsoft.com/en-us/people/jcl/). VW is the essence of speed in machine learning, able to learn from terafeature datasets with ease. Via parallel learning, it can exceed the throughput of any single machine network interface when doing linear learning, a first amongst learning algorithms. For a deeper understanding of large scale online machine learning watch this fine [video tutorial](http://techtalks.tv/talks/online-linear-learning-part-1/57924/) with John Langford. To install and get started follow this [tutorial](https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial).\n",
"\n",
"As mentioned, it is particulary suited for terafeature datasets as its [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick) reduces the feature space to a number of buts, greatly reducing the computing time. Though this current Kaggle competition is not an apt dataset for VW, but I am putting this here just to serve as a primer on getting started with this impressive tool i.e. VW. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-----"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The background, dataset exploration, feature engineering and dataset preparation stages of the analysis are similar to the one done earlier [here](https://hail-data.quora.com/Kaggle-Bike-Sharing-Demand)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data = pd.read_csv('data/bikesharing_train.csv')\n",
"print data.shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(10886, 12)\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from datetime import datetime\n",
"hours = data.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').hour) # 0-23\n",
"hours.name = 'hours'\n",
"years = data.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').year) #2011 & 2012\n",
"years.name = 'years'\n",
"weekdays = data.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').weekday()) # 0-6\n",
"weekdays.name = 'weekdays'\n",
"months = data.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').month) # 0-11\n",
"months.name = 'months'\n",
"\n",
"sinHour = np.sin(2*np.pi*hours/23); sinHour.name = 'sinHour'\n",
"sinWeekDay = np.sin(2*np.pi*weekdays/6); sinWeekDay.name = 'sinWeekDay'\n",
"cosHour = np.cos(2*np.pi*hours/23); cosHour.name = 'cosHour'\n",
"cosWeekDay = np.cos(2*np.pi*weekdays/6); cosWeekDay.name = 'cosWeekDay'\n",
"\n",
"temp1Hbefore = data['atemp'].shift(1).fillna(1); temp1Hbefore.name = 'temp1Hbefore'\n",
"temp2Hbefore = data['atemp'].shift(2).fillna(1); temp2Hbefore.name = 'temp2Hbefore'\n",
"temp3Hbefore = data['atemp'].shift(3).fillna(1); temp3Hbefore.name = 'temp3Hbefore'\n",
"temp6Hbefore = data['atemp'].shift(6).fillna(1); temp6Hbefore.name = 'temp6Hbefore'\n",
"humidity1Hbefore = data['humidity'].shift(1).fillna(1); humidity1Hbefore.name = 'humidity1Hbefore'\n",
"humidity2Hbefore = data['humidity'].shift(2).fillna(1); humidity2Hbefore.name = 'humidity2Hbefore'\n",
"humidity3Hbefore = data['humidity'].shift(3).fillna(1); humidity3Hbefore.name = 'humidity3Hbefore'\n",
"humidity6Hbefore = data['humidity'].shift(6).fillna(1); humidity6Hbefore.name = 'humidity6Hbefore'\n",
"wind1Hbefore = data['windspeed'].shift(1).fillna(1); wind1Hbefore.name = 'wind1Hbefore'\n",
"wind2Hbefore = data['windspeed'].shift(2).fillna(1); wind2Hbefore.name = 'wind2Hbefore'\n",
"wind3Hbefore = data['windspeed'].shift(3).fillna(1); wind3Hbefore.name = 'wind3Hbefore'\n",
"wind6Hbefore = data['windspeed'].shift(6).fillna(1); wind6Hbefore.name = 'wind6Hbefore'\n",
"\n",
"# Average temp, humidity and windspeed of the past 2, 6, 24 hours\n",
"tempavg2Hbefore = pd.stats.moments.rolling_mean(data['atemp'].shift(1),2,min_periods=2).fillna(1); tempavg2Hbefore.name = 'tempavg2Hbefore'\n",
"humidityavg2Hbefore = pd.stats.moments.rolling_mean(data['humidity'].shift(1),2,min_periods=2).fillna(1); humidityavg2Hbefore.name = 'humidityavg2Hbefore'\n",
"windavg2Hbefore = pd.stats.moments.rolling_mean(data['windspeed'].shift(1),2,min_periods=2).fillna(1); windavg2Hbefore.name = 'windavg2Hbefore'\n",
"tempavg6Hbefore = pd.stats.moments.rolling_mean(data['atemp'].shift(1),6,min_periods=6).fillna(1); tempavg6Hbefore.name = 'tempavg6Hbefore'\n",
"humidityavg6Hbefore = pd.stats.moments.rolling_mean(data['humidity'].shift(1),6,min_periods=6).fillna(1); humidityavg6Hbefore.name = 'humidityavg6Hbefore'\n",
"windavg6Hbefore = pd.stats.moments.rolling_mean(data['windspeed'].shift(1),6,min_periods=6).fillna(1); windavg6Hbefore.name = 'windavg6Hbefore'\n",
"tempavg1Dbefore = pd.stats.moments.rolling_mean(data['atemp'].shift(1),24,min_periods=24).fillna(1); tempavg1Dbefore.name = 'tempavg1Dbefore'\n",
"humidityavg1Dbefore = pd.stats.moments.rolling_mean(data['humidity'].shift(1),24,min_periods=24).fillna(1); humidityavg1Dbefore.name = 'humidityavg1Dbefore'\n",
"windavg1Dbefore = pd.stats.moments.rolling_mean(data['windspeed'].shift(1),24,min_periods=24).fillna(1); windavg1Dbefore.name = 'windavg1Dbefore'\n",
"\n",
"TDiff = (data['atemp'] - data['atemp'].shift(1)).fillna(1); TDiff.name = 'TDiff'\n",
"TDiff10H = (data['atemp'] - data['atemp'].shift(10)).fillna(method = 'bfill'); TDiff10H.name = 'TDiff10H'\n",
"HDiff10H = (data['humidity'] - data['humidity'].shift(10)).fillna(method = 'bfill'); HDiff10H.name = 'HDiff10H'\n",
"\n",
"X = data.ix[:,(1,2,3,4,6,7,8,11)].join(hours).join(years).join(weekdays).join(months).join(sinHour).join(sinWeekDay).join(cosHour).join(cosWeekDay)\n",
"X = X.join(windavg6Hbefore).join(windavg2Hbefore).join(windavg1Dbefore).join(wind1Hbefore).join(wind2Hbefore).join(wind6Hbefore)\n",
"X = X.join(tempavg6Hbefore).join(tempavg2Hbefore).join(tempavg1Dbefore).join(temp1Hbefore).join(temp2Hbefore).join(temp6Hbefore)\n",
"X = X.join(humidityavg6Hbefore).join(humidityavg2Hbefore).join(humidityavg1Dbefore).join(humidity1Hbefore).join(humidity2Hbefore).join(humidity6Hbefore)\n",
"X = X.join(TDiff).join(TDiff10H).join(HDiff10H)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"X.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>season</th>\n",
" <th>holiday</th>\n",
" <th>workingday</th>\n",
" <th>weather</th>\n",
" <th>atemp</th>\n",
" <th>humidity</th>\n",
" <th>windspeed</th>\n",
" <th>count</th>\n",
" <th>hours</th>\n",
" <th>years</th>\n",
" <th>...</th>\n",
" <th>temp6Hbefore</th>\n",
" <th>humidityavg6Hbefore</th>\n",
" <th>humidityavg2Hbefore</th>\n",
" <th>humidityavg1Dbefore</th>\n",
" <th>humidity1Hbefore</th>\n",
" <th>humidity2Hbefore</th>\n",
" <th>humidity6Hbefore</th>\n",
" <th>TDiff</th>\n",
" <th>TDiff10H</th>\n",
" <th>HDiff10H</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 14.395</td>\n",
" <td> 81</td>\n",
" <td> 0</td>\n",
" <td> 16</td>\n",
" <td> 0</td>\n",
" <td> 2011</td>\n",
" <td>...</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 1.0</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 1.00</td>\n",
" <td> 5.3</td>\n",
" <td>-5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 13.635</td>\n",
" <td> 80</td>\n",
" <td> 0</td>\n",
" <td> 40</td>\n",
" <td> 1</td>\n",
" <td> 2011</td>\n",
" <td>...</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 1.0</td>\n",
" <td> 1</td>\n",
" <td> 81</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td>-0.76</td>\n",
" <td> 5.3</td>\n",
" <td>-5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 13.635</td>\n",
" <td> 80</td>\n",
" <td> 0</td>\n",
" <td> 32</td>\n",
" <td> 2</td>\n",
" <td> 2011</td>\n",
" <td>...</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 80.5</td>\n",
" <td> 1</td>\n",
" <td> 80</td>\n",
" <td> 81</td>\n",
" <td> 1</td>\n",
" <td> 0.00</td>\n",
" <td> 5.3</td>\n",
" <td>-5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 14.395</td>\n",
" <td> 75</td>\n",
" <td> 0</td>\n",
" <td> 13</td>\n",
" <td> 3</td>\n",
" <td> 2011</td>\n",
" <td>...</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 80.0</td>\n",
" <td> 1</td>\n",
" <td> 80</td>\n",
" <td> 80</td>\n",
" <td> 1</td>\n",
" <td> 0.76</td>\n",
" <td> 5.3</td>\n",
" <td>-5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 14.395</td>\n",
" <td> 75</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 4</td>\n",
" <td> 2011</td>\n",
" <td>...</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 77.5</td>\n",
" <td> 1</td>\n",
" <td> 75</td>\n",
" <td> 80</td>\n",
" <td> 1</td>\n",
" <td> 0.00</td>\n",
" <td> 5.3</td>\n",
" <td>-5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows \u00d7 37 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
" season holiday workingday weather atemp humidity windspeed count \\\n",
"0 1 0 0 1 14.395 81 0 16 \n",
"1 1 0 0 1 13.635 80 0 40 \n",
"2 1 0 0 1 13.635 80 0 32 \n",
"3 1 0 0 1 14.395 75 0 13 \n",
"4 1 0 0 1 14.395 75 0 1 \n",
"\n",
" hours years ... temp6Hbefore humidityavg6Hbefore \\\n",
"0 0 2011 ... 1 1 \n",
"1 1 2011 ... 1 1 \n",
"2 2 2011 ... 1 1 \n",
"3 3 2011 ... 1 1 \n",
"4 4 2011 ... 1 1 \n",
"\n",
" humidityavg2Hbefore humidityavg1Dbefore humidity1Hbefore \\\n",
"0 1.0 1 1 \n",
"1 1.0 1 81 \n",
"2 80.5 1 80 \n",
"3 80.0 1 80 \n",
"4 77.5 1 75 \n",
"\n",
" humidity2Hbefore humidity6Hbefore TDiff TDiff10H HDiff10H \n",
"0 1 1 1.00 5.3 -5 \n",
"1 1 1 -0.76 5.3 -5 \n",
"2 81 1 0.00 5.3 -5 \n",
"3 80 1 0.76 5.3 -5 \n",
"4 80 1 0.00 5.3 -5 \n",
"\n",
"[5 rows x 37 columns]"
]
}
],
"prompt_number": 5
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Data Preparation: Converting to VW format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now convert the data into a format as required by vowpal wabbit. Basically VW requires each row of the dataset in the follwoing format:\n",
" \n",
" 100 |n var1:20 var2:20 var3:2 |c 1.0 2.0 0.0 1.0\n",
"The first number 100 is the dependent variable, the one we want to predict (in our case the ridership value). |n defines a namespace which denotes the begining of the numerical features of our dataset. So this example dataset has 3 numerical features, both the name of the feature and value are required. It is then followed by |c which indicates the categorical features of the dataset. Here only the values are required.\n",
"\n",
"We can generate this format from our raw dataset as follows."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from datetime import datetime\n",
"from csv import DictReader\n",
"\n",
"def csv_to_vw(input, loc_output, train=True):\n",
" \"\"\"\n",
" Munges our input dataset to a VW file (loc_output). Set \"train\"\n",
" to False when munging a test set.\n",
" \"\"\"\n",
" start = datetime.now()\n",
" \n",
" with open(loc_output,\"wb\") as outfile:\n",
" for e, row in input.iterrows():\n",
"\t\n",
"\t #Creating the features\n",
" numerical_features = \"\"\n",
" categorical_features = \"\"\n",
" for k in input.columns:\n",
" if k not in ['season', 'weather', 'workingday', 'holiday', 'count', 'hours','weekdays','sinHour', 'cosHour','sinWeekDay','cosWeekDay']:\n",
" if len(str(row[k])) > 0: #check for empty values\n",
" numerical_features += \" %s:%s\" % (k,row[k])\n",
" elif k not in ['count']:\n",
" if len(str(row[k])) > 0:\n",
" categorical_features += \" %s\" % row[k]\n",
"\t\t\t \n",
"\t #Creating the labels\t\t \n",
" if train: #we care about labels\n",
" count = np.log(row['count']+1)\n",
" outfile.write( \"%s |n%s |c%s\\n\" % (count,numerical_features,categorical_features) )\n",
"\t\t\n",
" else: #we dont care about labels\n",
" outfile.write( \"1 |n%s |c%s\\n\" % (numerical_features,categorical_features) )\n",
" \n",
"\t #Reporting progress\n",
" if e % 1000000 == 0:\n",
" print(\"%s\\t%s\"%(e, str(datetime.now() - start)))\n",
"\n",
" print(\"\\n %s Task execution time:\\n\\t%s\"%(e, str(datetime.now() - start)))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Having written the function we now generate our training dataset in vw format. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csv_to_vw(X, \"data/bikesharing_train.txt\",train=True)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0\t0:00:00.006713\n",
"\n",
" 10885 Task execution time:\n",
"\t0:00:09.618165"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Data Analyis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Picking a loss function**\n",
"\n",
"One of the most important aspects of using vowpal wabbit succefully is to pick the right loss function over which the algorithm optimizes to learn. Online machine learning with VW learns from samples one at a time. When our model is trained it iterates through the train dataset and optimizes this function.\n",
"\n",
"Vowpal Wabbit has five loss functions:\n",
"1. Squared loss. Useful for regression problems, when minimizing expectation. For example: Expected return on a stock.\n",
"2. Classic loss. Vanilla squared loss (without the importance weight aware update).\n",
"3. Quantile loss. Useful for regression problems, for example: predicting house pricing.\n",
"4. Hinge loss. Useful for classification problems, minimizing the yes/no question (closest 0-1 approximation). For example: Keyword_tag or not.\n",
"5. Log loss. Useful for classification problems, minimizer = probability, for example: Probability of click on ad.\n",
"\n",
"As we are predicting the hourly ridership, so the Quantile loss is more suitable for our purpose. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now train a model with the following command from terminal:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#vw -d bikesharing_train.txt -c --passes 25 -f bike.model.vw --loss_function quantile"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the command, -d bikesharing_train.txt says to use our train dataset that we had created earlier. --passes 25 says to iterate over the train dataset 25 times for learning and -c allows VW to cache the data in a faster to handle format so that the 25 passes are quicker. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output of the training process is as follows. The details explanation of various numbers can be found at the tutorial link shared in the begining. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"final_regressor = bike.model.vw\n",
"Num weight bits = 18\n",
"learning rate = 0.5\n",
"initial_t = 0\n",
"power_t = 0.5\n",
"decay_learning_rate = 1\n",
"using cache_file = bikesharing_train.txt.cache\n",
"ignoring text input in favor of cache input\n",
"num sources = 1\n",
"average since example example current current current\n",
"loss last counter weight label predict features\n",
"2.907065 2.907065 1 1.0 5.8141 0.0000 22\n",
"2.499122 2.091178 2 2.0 4.6913 0.5090 22\n",
"1.573641 0.648161 4 4.0 3.5264 4.1038 22\n",
"1.262409 0.951176 8 8.0 5.8435 3.7025 22\n",
"0.876005 0.489601 16 16.0 5.4638 4.4911 22\n",
"0.581526 0.287047 32 32.0 1.3863 3.3024 22\n",
"0.530048 0.478569 64 64.0 6.1420 4.0435 22\n",
"0.489496 0.448944 128 128.0 5.6560 6.3817 22\n",
"0.413665 0.337835 256 256.0 6.5221 6.7074 22\n",
"0.383470 0.353274 512 512.0 4.3820 4.8103 22\n",
"0.378252 0.373035 1024 1024.0 6.4953 4.2516 21\n",
"0.350751 0.323249 2048 2048.0 6.0822 5.8995 22\n",
"0.344446 0.338141 4096 4096.0 5.6733 5.3636 22\n",
"0.349611 0.349611 8192 8192.0 6.5596 4.3820 22 h\n",
"0.345637 0.341667 16384 16384.0 5.6595 4.5835 20 h\n",
"0.343242 0.340848 32768 32768.0 1.9459 3.1003 21 h\n",
"0.341692 0.340143 65536 65536.0 4.6728 4.7843 22 h\n",
"0.339989 0.338285 131072 131072.0 3.3673 3.0850 22 h\n",
"\n",
"finished run\n",
"number of examples per pass = 7348\n",
"passes used = 22\n",
"weighted example sum = 161657\n",
"weighted label sum = 743073\n",
"average loss = 0.338136 h\n",
"best constant = 4.5966\n",
"total feature number = 3475384\n",
"```"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now prepare the test data set in a similar way as we did with the training dataset. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"test = pd.read_csv('data/bikesharing_test.csv')\n",
"test_hours = test.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').hour)\n",
"test_hours.name = 'hours'\n",
"test_years = test.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').year)\n",
"test_years.name = 'years'\n",
"test_months = test.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').month)\n",
"test_months.name = 'months'\n",
"test_weekdays = test.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').weekday())\n",
"test_weekdays.name = 'weekdays'\n",
"\n",
"test_sinHour = np.sin(2*np.pi*test_hours/23); test_sinHour.name = 'sinHour'\n",
"test_sinWeekDay = np.sin(2*np.pi*test_weekdays/6); test_sinWeekDay.name = 'sinWeekDay'\n",
"test_cosHour = np.cos(2*np.pi*test_hours/23); test_cosHour.name = 'cosHour'\n",
"test_cosWeekDay = np.cos(2*np.pi*test_weekdays/6); test_cosWeekDay.name = 'cosWeekDay'\n",
"\n",
"Testtemp1Hbefore = test['atemp'].shift(1).fillna(1); Testtemp1Hbefore.name = 'temp1Hbefore'\n",
"Testtemp2Hbefore = test['atemp'].shift(2).fillna(1); Testtemp2Hbefore.name = 'temp2Hbefore'\n",
"Testtemp3Hbefore = test['atemp'].shift(3).fillna(1); Testtemp3Hbefore.name = 'temp3Hbefore'\n",
"Testtemp6Hbefore = test['atemp'].shift(6).fillna(1); Testtemp6Hbefore.name = 'temp6Hbefore'\n",
"Testhumidity1Hbefore = test['humidity'].shift(1).fillna(1); Testhumidity1Hbefore.name = 'humidity1Hbefore'\n",
"Testhumidity2Hbefore = test['humidity'].shift(2).fillna(1); Testhumidity2Hbefore.name = 'humidity2Hbefore'\n",
"Testhumidity3Hbefore = test['humidity'].shift(3).fillna(1); Testhumidity3Hbefore.name = 'humidity3Hbefore'\n",
"Testhumidity6Hbefore = test['humidity'].shift(6).fillna(1); Testhumidity6Hbefore.name = 'humidity6Hbefore'\n",
"Testwind1Hbefore = test['windspeed'].shift(1).fillna(1); Testwind1Hbefore.name = 'wind1Hbefore'\n",
"Testwind2Hbefore = test['windspeed'].shift(2).fillna(1); Testwind2Hbefore.name = 'wind2Hbefore'\n",
"Testwind3Hbefore = test['windspeed'].shift(3).fillna(1); Testwind3Hbefore.name = 'wind3Hbefore'\n",
"Testwind6Hbefore = test['windspeed'].shift(6).fillna(1); Testwind6Hbefore.name = 'wind6Hbefore'\n",
"\n",
"Testtempavg2Hbefore = pd.stats.moments.rolling_mean(test['atemp'],2,min_periods=2).fillna(1); Testtempavg2Hbefore.name = 'tempavg2Hbefore'\n",
"Testhumidityavg2Hbefore = pd.stats.moments.rolling_mean(test['humidity'].shift(1),2,min_periods=2).fillna(1); Testhumidityavg2Hbefore.name = 'humidityavg2Hbefore'\n",
"Testwindavg2Hbefore = pd.stats.moments.rolling_mean(test['windspeed'].shift(1),2,min_periods=2).fillna(1); Testwindavg2Hbefore.name = 'windavg2Hbefore'\n",
"Testtempavg6Hbefore = pd.stats.moments.rolling_mean(test['atemp'].shift(1),6,min_periods=6).fillna(1); Testtempavg6Hbefore.name = 'tempavg6Hbefore'\n",
"Testhumidityavg6Hbefore = pd.stats.moments.rolling_mean(test['humidity'].shift(1),6,min_periods=6).fillna(1); Testhumidityavg6Hbefore.name = 'humidityavg6Hbefore'\n",
"Testwindavg6Hbefore = pd.stats.moments.rolling_mean(test['windspeed'].shift(1),6,min_periods=6).fillna(1); Testwindavg6Hbefore.name = 'windavg6Hbefore'\n",
"Testtempavg1Dbefore = pd.stats.moments.rolling_mean(test['atemp'].shift(1),24,min_periods=24).fillna(1); Testtempavg1Dbefore.name = 'tempavg1Dbefore'\n",
"Testhumidityavg1Dbefore = pd.stats.moments.rolling_mean(test['humidity'].shift(1),24,min_periods=24).fillna(1); Testhumidityavg1Dbefore.name = 'humidityavg1Dbefore'\n",
"Testwindavg1Dbefore = pd.stats.moments.rolling_mean(test['windspeed'].shift(1),24,min_periods=24).fillna(1); Testwindavg1Dbefore.name = 'windavg1Dbefore'\n",
"\n",
"TestTDiff = (test['atemp'] - test['atemp'].shift(1)).fillna(1); TestTDiff.name = 'TDiff'\n",
"TestTDiff10H = (test['atemp'] - test['atemp'].shift(10)).fillna(method = 'bfill'); TestTDiff10H.name = 'TDiff10H'\n",
"TestHDiff10H = (test['humidity'] - test['humidity'].shift(10)).fillna(method = 'bfill'); TestHDiff10H.name = 'HDiff10H'\n",
"\n",
"\n",
"newtest = test.ix[:,(1,2,3,4,6,7,8)].join(hours).join(years).join(weekdays).join(months).join(sinHour).join(sinWeekDay).join(cosHour).join(cosWeekDay)\n",
"newtest = newtest.join(windavg6Hbefore).join(windavg2Hbefore).join(windavg1Dbefore).join(wind1Hbefore).join(wind2Hbefore).join(wind6Hbefore)\n",
"newtest = newtest.join(tempavg6Hbefore).join(tempavg2Hbefore).join(tempavg1Dbefore).join(temp1Hbefore).join(temp2Hbefore).join(temp6Hbefore)\n",
"newtest = newtest.join(humidityavg6Hbefore).join(humidityavg2Hbefore).join(humidityavg1Dbefore).join(humidity1Hbefore).join(humidity2Hbefore).join(humidity6Hbefore)\n",
"newtest = newtest.join(TDiff).join(TDiff10H).join(HDiff10H)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then munge the test dataset into the VW format using the csv_to_vw function that we defined earlier. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csv_to_vw(newtest, \"data/bikesharing_test.txt\",train=False)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0\t0:00:00.039549\n",
"\n",
" 6492 Task execution time:\n",
"\t0:00:05.276548"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now predict the testset output with the following command from the terminal. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#vw -d bikesharing_test.txt -t -i bike.model.vw -p bikesharing_preds.txt "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The -t says to test only and not train. -i bike.model.vw says to use the model that we learned from the training process. -p saves our predictions to bikesharing_preds.txt. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Upon running the command VW gives the following output and generates the bikesharing_preds.txt file. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"only testing\n",
"Num weight bits = 18\n",
"learning rate = 10\n",
"initial_t = 1\n",
"power_t = 0.5\n",
"predictions = bikesharing_preds.txt\n",
"using no cache\n",
"Reading datafile = bikesharing_test.txt\n",
"num sources = 1\n",
"average since example example current current current\n",
"loss last counter weight label predict features\n",
"8.663692 8.663692 1 1.0 1.0000 3.9434 37\n",
"6.498619 4.333547 2 2.0 1.0000 3.0817 35\n",
"5.028621 3.558624 4 4.0 1.0000 2.6682 34\n",
"4.102480 3.176339 8 8.0 1.0000 2.8937 35\n",
"4.009334 3.916188 16 16.0 1.0000 3.0832 36\n",
"5.337796 6.666259 32 32.0 1.0000 3.7795 35\n",
"7.840713 10.343631 64 64.0 1.0000 4.5281 37\n",
"7.289603 6.738493 128 128.0 1.0000 3.0337 35\n",
"6.635136 5.980668 256 256.0 1.0000 4.7377 36\n",
"6.345960 6.056784 512 512.0 1.0000 3.0137 37\n",
"6.778207 7.210454 1024 1024.0 1.0000 4.7802 36\n",
"7.933761 9.089315 2048 2048.0 1.0000 3.6618 35\n",
"10.755411 13.577061 4096 4096.0 1.0000 3.9191 37\n",
"\n",
"finished run\n",
"number of examples per pass = 6493\n",
"passes used = 1\n",
"weighted example sum = 6493\n",
"weighted label sum = 6493\n",
"average loss = 10.4092\n",
"best constant = 1\n",
"total feature number = 233568\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then read in our predictions. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"predictions = pd.read_csv('data/bikesharing_preds.txt', header = None, index_col = False)\n",
"np.exp(predictions[0]), sum(np.exp(predictions[0]) < 0) # to ensure no negative outputs"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
"(0 51.594496\n",
" 1 21.795794\n",
" 2 21.799325\n",
" 3 14.414592\n",
" 4 15.894275\n",
" 5 12.949834\n",
" 6 17.952780\n",
" 7 18.060405\n",
" 8 17.731594\n",
" 9 18.306039\n",
" 10 18.681598\n",
" 11 19.245206\n",
" 12 21.042664\n",
" 13 19.939207\n",
" 14 20.742671\n",
" ...\n",
" 6478 43.846943\n",
" 6479 36.068648\n",
" 6480 36.442293\n",
" 6481 37.595298\n",
" 6482 40.524795\n",
" 6483 39.265372\n",
" 6484 39.983625\n",
" 6485 38.981766\n",
" 6486 41.348618\n",
" 6487 40.435010\n",
" 6488 39.964598\n",
" 6489 41.430901\n",
" 6490 45.402810\n",
" 6491 51.559114\n",
" 6492 50.929382\n",
" Name: 0, Length: 6493, dtype: float64, 0)"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally write it to a file in the format required by the Kaggle competition. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"output = pd.DataFrame({'datetime': test.ix[:,0], 'count': np.exp(predictions[0])})\n",
"output.to_csv('data/bikesharingVW_submission.csv', index = False)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 19
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": [
"Acknowledgements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I would like to thank MLWave and his blog post on using VW [here](http://mlwave.com/predicting-click-through-rates-with-online-machine-learning/)."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment