pronojitsaha/Kaggle_Bike Sharing VW-Final.ipynb

## Kaggle_Bike Sharing VW-Final.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:9f31c32ee67d337553ec82fc0d8e5dc1f6d0ec1967b0bc44df8f5a78b6498bbe"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Kaggle: Bike Sharing Demand using [Vowpal Wabbit](http://hunch.net/~vw)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "Vowpal Wabbit or VW is a machine learning algorithm developed by [John Langford](http://research.microsoft.com/en-us/people/jcl/). VW is the essence of speed in machine learning, able to learn from terafeature datasets with ease. Via parallel learning, it can exceed the throughput of any single machine network interface when doing linear learning, a first amongst learning algorithms. For a deeper understanding of large scale online machine learning watch this fine [video tutorial](http://techtalks.tv/talks/online-linear-learning-part-1/57924/) with John Langford. To install and get started follow this [tutorial](https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial).\n",
      "\n",
      "As mentioned, it is particulary suited for terafeature datasets as its [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick) reduces the feature space to a number of buts, greatly reducing the computing time. Though this current Kaggle competition is not an apt dataset for VW, but I am putting this here just to serve as a primer on getting started with this impressive tool i.e. VW. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-----"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The background, dataset exploration, feature engineering and dataset preparation stages of the analysis are similar to the one done earlier [here](https://hail-data.quora.com/Kaggle-Bike-Sharing-Demand)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np\n",
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data = pd.read_csv('data/bikesharing_train.csv')\n",
      "print data.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(10886, 12)\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from datetime import datetime\n",
      "hours = data.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').hour) # 0-23\n",
      "hours.name = 'hours'\n",
      "years = data.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').year) #2011 & 2012\n",
      "years.name = 'years'\n",
      "weekdays = data.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').weekday()) # 0-6\n",
      "weekdays.name = 'weekdays'\n",
      "months = data.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').month) # 0-11\n",
      "months.name = 'months'\n",
      "\n",
      "sinHour = np.sin(2*np.pi*hours/23); sinHour.name = 'sinHour'\n",
      "sinWeekDay = np.sin(2*np.pi*weekdays/6); sinWeekDay.name = 'sinWeekDay'\n",
      "cosHour = np.cos(2*np.pi*hours/23); cosHour.name = 'cosHour'\n",
      "cosWeekDay = np.cos(2*np.pi*weekdays/6); cosWeekDay.name = 'cosWeekDay'\n",
      "\n",
      "temp1Hbefore = data['atemp'].shift(1).fillna(1); temp1Hbefore.name = 'temp1Hbefore'\n",
      "temp2Hbefore = data['atemp'].shift(2).fillna(1); temp2Hbefore.name = 'temp2Hbefore'\n",
      "temp3Hbefore = data['atemp'].shift(3).fillna(1); temp3Hbefore.name = 'temp3Hbefore'\n",
      "temp6Hbefore = data['atemp'].shift(6).fillna(1); temp6Hbefore.name = 'temp6Hbefore'\n",
      "humidity1Hbefore = data['humidity'].shift(1).fillna(1); humidity1Hbefore.name = 'humidity1Hbefore'\n",
      "humidity2Hbefore = data['humidity'].shift(2).fillna(1); humidity2Hbefore.name = 'humidity2Hbefore'\n",
      "humidity3Hbefore = data['humidity'].shift(3).fillna(1); humidity3Hbefore.name = 'humidity3Hbefore'\n",
      "humidity6Hbefore = data['humidity'].shift(6).fillna(1); humidity6Hbefore.name = 'humidity6Hbefore'\n",
      "wind1Hbefore = data['windspeed'].shift(1).fillna(1); wind1Hbefore.name = 'wind1Hbefore'\n",
      "wind2Hbefore = data['windspeed'].shift(2).fillna(1); wind2Hbefore.name = 'wind2Hbefore'\n",
      "wind3Hbefore = data['windspeed'].shift(3).fillna(1); wind3Hbefore.name = 'wind3Hbefore'\n",
      "wind6Hbefore = data['windspeed'].shift(6).fillna(1); wind6Hbefore.name = 'wind6Hbefore'\n",
      "\n",
      "# Average temp, humidity and windspeed of the past 2, 6, 24 hours\n",
      "tempavg2Hbefore = pd.stats.moments.rolling_mean(data['atemp'].shift(1),2,min_periods=2).fillna(1); tempavg2Hbefore.name = 'tempavg2Hbefore'\n",
      "humidityavg2Hbefore = pd.stats.moments.rolling_mean(data['humidity'].shift(1),2,min_periods=2).fillna(1); humidityavg2Hbefore.name = 'humidityavg2Hbefore'\n",
      "windavg2Hbefore = pd.stats.moments.rolling_mean(data['windspeed'].shift(1),2,min_periods=2).fillna(1); windavg2Hbefore.name = 'windavg2Hbefore'\n",
      "tempavg6Hbefore = pd.stats.moments.rolling_mean(data['atemp'].shift(1),6,min_periods=6).fillna(1); tempavg6Hbefore.name = 'tempavg6Hbefore'\n",
      "humidityavg6Hbefore = pd.stats.moments.rolling_mean(data['humidity'].shift(1),6,min_periods=6).fillna(1); humidityavg6Hbefore.name = 'humidityavg6Hbefore'\n",
      "windavg6Hbefore = pd.stats.moments.rolling_mean(data['windspeed'].shift(1),6,min_periods=6).fillna(1); windavg6Hbefore.name = 'windavg6Hbefore'\n",
      "tempavg1Dbefore = pd.stats.moments.rolling_mean(data['atemp'].shift(1),24,min_periods=24).fillna(1); tempavg1Dbefore.name = 'tempavg1Dbefore'\n",
      "humidityavg1Dbefore = pd.stats.moments.rolling_mean(data['humidity'].shift(1),24,min_periods=24).fillna(1); humidityavg1Dbefore.name = 'humidityavg1Dbefore'\n",
      "windavg1Dbefore = pd.stats.moments.rolling_mean(data['windspeed'].shift(1),24,min_periods=24).fillna(1); windavg1Dbefore.name = 'windavg1Dbefore'\n",
      "\n",
      "TDiff = (data['atemp'] - data['atemp'].shift(1)).fillna(1); TDiff.name = 'TDiff'\n",
      "TDiff10H = (data['atemp'] - data['atemp'].shift(10)).fillna(method = 'bfill'); TDiff10H.name = 'TDiff10H'\n",
      "HDiff10H = (data['humidity'] - data['humidity'].shift(10)).fillna(method = 'bfill'); HDiff10H.name = 'HDiff10H'\n",
      "\n",
      "X = data.ix[:,(1,2,3,4,6,7,8,11)].join(hours).join(years).join(weekdays).join(months).join(sinHour).join(sinWeekDay).join(cosHour).join(cosWeekDay)\n",
      "X = X.join(windavg6Hbefore).join(windavg2Hbefore).join(windavg1Dbefore).join(wind1Hbefore).join(wind2Hbefore).join(wind6Hbefore)\n",
      "X = X.join(tempavg6Hbefore).join(tempavg2Hbefore).join(tempavg1Dbefore).join(temp1Hbefore).join(temp2Hbefore).join(temp6Hbefore)\n",
      "X = X.join(humidityavg6Hbefore).join(humidityavg2Hbefore).join(humidityavg1Dbefore).join(humidity1Hbefore).join(humidity2Hbefore).join(humidity6Hbefore)\n",
      "X = X.join(TDiff).join(TDiff10H).join(HDiff10H)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>season</th>\n",
        "      <th>holiday</th>\n",
        "      <th>workingday</th>\n",
        "      <th>weather</th>\n",
        "      <th>atemp</th>\n",
        "      <th>humidity</th>\n",
        "      <th>windspeed</th>\n",
        "      <th>count</th>\n",
        "      <th>hours</th>\n",
        "      <th>years</th>\n",
        "      <th>...</th>\n",
        "      <th>temp6Hbefore</th>\n",
        "      <th>humidityavg6Hbefore</th>\n",
        "      <th>humidityavg2Hbefore</th>\n",
        "      <th>humidityavg1Dbefore</th>\n",
        "      <th>humidity1Hbefore</th>\n",
        "      <th>humidity2Hbefore</th>\n",
        "      <th>humidity6Hbefore</th>\n",
        "      <th>TDiff</th>\n",
        "      <th>TDiff10H</th>\n",
        "      <th>HDiff10H</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 14.395</td>\n",
        "      <td> 81</td>\n",
        "      <td> 0</td>\n",
        "      <td> 16</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2011</td>\n",
        "      <td>...</td>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td>  1.0</td>\n",
        "      <td> 1</td>\n",
        "      <td>  1</td>\n",
        "      <td>  1</td>\n",
        "      <td> 1</td>\n",
        "      <td> 1.00</td>\n",
        "      <td> 5.3</td>\n",
        "      <td>-5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 13.635</td>\n",
        "      <td> 80</td>\n",
        "      <td> 0</td>\n",
        "      <td> 40</td>\n",
        "      <td> 1</td>\n",
        "      <td> 2011</td>\n",
        "      <td>...</td>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td>  1.0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 81</td>\n",
        "      <td>  1</td>\n",
        "      <td> 1</td>\n",
        "      <td>-0.76</td>\n",
        "      <td> 5.3</td>\n",
        "      <td>-5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 13.635</td>\n",
        "      <td> 80</td>\n",
        "      <td> 0</td>\n",
        "      <td> 32</td>\n",
        "      <td> 2</td>\n",
        "      <td> 2011</td>\n",
        "      <td>...</td>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td> 80.5</td>\n",
        "      <td> 1</td>\n",
        "      <td> 80</td>\n",
        "      <td> 81</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0.00</td>\n",
        "      <td> 5.3</td>\n",
        "      <td>-5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 14.395</td>\n",
        "      <td> 75</td>\n",
        "      <td> 0</td>\n",
        "      <td> 13</td>\n",
        "      <td> 3</td>\n",
        "      <td> 2011</td>\n",
        "      <td>...</td>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td> 80.0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 80</td>\n",
        "      <td> 80</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0.76</td>\n",
        "      <td> 5.3</td>\n",
        "      <td>-5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 14.395</td>\n",
        "      <td> 75</td>\n",
        "      <td> 0</td>\n",
        "      <td>  1</td>\n",
        "      <td> 4</td>\n",
        "      <td> 2011</td>\n",
        "      <td>...</td>\n",
        "      <td> 1</td>\n",
        "      <td> 1</td>\n",
        "      <td> 77.5</td>\n",
        "      <td> 1</td>\n",
        "      <td> 75</td>\n",
        "      <td> 80</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0.00</td>\n",
        "      <td> 5.3</td>\n",
        "      <td>-5</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 37 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "   season  holiday  workingday  weather   atemp  humidity  windspeed  count  \\\n",
        "0       1        0           0        1  14.395        81          0     16   \n",
        "1       1        0           0        1  13.635        80          0     40   \n",
        "2       1        0           0        1  13.635        80          0     32   \n",
        "3       1        0           0        1  14.395        75          0     13   \n",
        "4       1        0           0        1  14.395        75          0      1   \n",
        "\n",
        "   hours  years  ...    temp6Hbefore  humidityavg6Hbefore  \\\n",
        "0      0   2011  ...               1                    1   \n",
        "1      1   2011  ...               1                    1   \n",
        "2      2   2011  ...               1                    1   \n",
        "3      3   2011  ...               1                    1   \n",
        "4      4   2011  ...               1                    1   \n",
        "\n",
        "   humidityavg2Hbefore  humidityavg1Dbefore  humidity1Hbefore  \\\n",
        "0                  1.0                    1                 1   \n",
        "1                  1.0                    1                81   \n",
        "2                 80.5                    1                80   \n",
        "3                 80.0                    1                80   \n",
        "4                 77.5                    1                75   \n",
        "\n",
        "   humidity2Hbefore  humidity6Hbefore  TDiff  TDiff10H  HDiff10H  \n",
        "0                 1                 1   1.00       5.3        -5  \n",
        "1                 1                 1  -0.76       5.3        -5  \n",
        "2                81                 1   0.00       5.3        -5  \n",
        "3                80                 1   0.76       5.3        -5  \n",
        "4                80                 1   0.00       5.3        -5  \n",
        "\n",
        "[5 rows x 37 columns]"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Data Preparation: Converting to VW format"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now convert the data into a format as required by vowpal wabbit. Basically VW requires each row of the dataset in the follwoing format:\n",
      "    \n",
      "    100 |n var1:20 var2:20 var3:2 |c 1.0 2.0 0.0 1.0\n",
      "The first number 100 is the dependent variable, the one we want to predict (in our case the ridership value). |n defines a namespace which denotes the begining of the numerical features of our dataset. So this example dataset has 3 numerical features, both the name of the feature and value are required. It is then followed by |c which indicates the categorical features of the dataset. Here only the values are required.\n",
      "\n",
      "We can generate this format from our raw dataset as follows."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from datetime import datetime\n",
      "from csv import DictReader\n",
      "\n",
      "def csv_to_vw(input, loc_output, train=True):\n",
      "  \"\"\"\n",
      "  Munges our input dataset to a VW file (loc_output). Set \"train\"\n",
      "  to False when munging a test set.\n",
      "  \"\"\"\n",
      "  start = datetime.now()\n",
      "  \n",
      "  with open(loc_output,\"wb\") as outfile:\n",
      "    for e, row in input.iterrows():\n",
      "\t\n",
      "\t  #Creating the features\n",
      "      numerical_features = \"\"\n",
      "      categorical_features = \"\"\n",
      "      for k in input.columns:\n",
      "        if k not in ['season', 'weather', 'workingday', 'holiday', 'count', 'hours','weekdays','sinHour', 'cosHour','sinWeekDay','cosWeekDay']:\n",
      "            if len(str(row[k])) > 0: #check for empty values\n",
      "              numerical_features += \" %s:%s\" % (k,row[k])\n",
      "        elif k not in ['count']:\n",
      "            if len(str(row[k])) > 0:\n",
      "              categorical_features += \" %s\" % row[k]\n",
      "\t\t\t  \n",
      "\t  #Creating the labels\t\t  \n",
      "      if train: #we care about labels\n",
      "        count = np.log(row['count']+1)\n",
      "        outfile.write( \"%s |n%s |c%s\\n\" % (count,numerical_features,categorical_features) )\n",
      "\t\t\n",
      "      else: #we dont care about labels\n",
      "        outfile.write( \"1 |n%s |c%s\\n\" % (numerical_features,categorical_features) )\n",
      "      \n",
      "\t  #Reporting progress\n",
      "      if e % 1000000 == 0:\n",
      "        print(\"%s\\t%s\"%(e, str(datetime.now() - start)))\n",
      "\n",
      "  print(\"\\n %s Task execution time:\\n\\t%s\"%(e, str(datetime.now() - start)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Having written the function we now generate our training dataset in vw format. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "csv_to_vw(X, \"data/bikesharing_train.txt\",train=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0\t0:00:00.006713\n",
        "\n",
        " 10885 Task execution time:\n",
        "\t0:00:09.618165"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Data Analyis"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Picking a loss function**\n",
      "\n",
      "One of the most important aspects of using vowpal wabbit succefully is to pick the right loss function over which the algorithm optimizes to learn. Online machine learning with VW learns from samples one at a time. When our model is trained it iterates through the train dataset and optimizes this function.\n",
      "\n",
      "Vowpal Wabbit has five loss functions:\n",
      "1. Squared loss. Useful for regression problems, when minimizing expectation. For example: Expected return on a stock.\n",
      "2. Classic loss. Vanilla squared loss (without the importance weight aware update).\n",
      "3. Quantile loss. Useful for regression problems, for example: predicting house pricing.\n",
      "4. Hinge loss. Useful for classification problems, minimizing the yes/no question (closest 0-1 approximation). For example: Keyword_tag or not.\n",
      "5. Log loss. Useful for classification problems, minimizer = probability, for example: Probability of click on ad.\n",
      "\n",
      "As we are predicting the hourly ridership, so the Quantile loss is more suitable for our purpose. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now train a model with the following command from terminal:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#vw -d bikesharing_train.txt -c --passes 25 -f bike.model.vw --loss_function quantile"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the command, -d bikesharing_train.txt says to use our train dataset that we had created earlier. --passes 25 says to iterate over the train dataset 25 times for learning and -c allows VW to cache the data in a faster to handle format so that the 25 passes are quicker.  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The output of the training process is as follows. The details explanation of various numbers can be found at the tutorial link shared in the begining. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "```\n",
      "final_regressor = bike.model.vw\n",
      "Num weight bits = 18\n",
      "learning rate = 0.5\n",
      "initial_t = 0\n",
      "power_t = 0.5\n",
      "decay_learning_rate = 1\n",
      "using cache_file = bikesharing_train.txt.cache\n",
      "ignoring text input in favor of cache input\n",
      "num sources = 1\n",
      "average    since         example     example  current  current  current\n",
      "loss       last          counter      weight    label  predict features\n",
      "2.907065   2.907065            1         1.0   5.8141   0.0000       22\n",
      "2.499122   2.091178            2         2.0   4.6913   0.5090       22\n",
      "1.573641   0.648161            4         4.0   3.5264   4.1038       22\n",
      "1.262409   0.951176            8         8.0   5.8435   3.7025       22\n",
      "0.876005   0.489601           16        16.0   5.4638   4.4911       22\n",
      "0.581526   0.287047           32        32.0   1.3863   3.3024       22\n",
      "0.530048   0.478569           64        64.0   6.1420   4.0435       22\n",
      "0.489496   0.448944          128       128.0   5.6560   6.3817       22\n",
      "0.413665   0.337835          256       256.0   6.5221   6.7074       22\n",
      "0.383470   0.353274          512       512.0   4.3820   4.8103       22\n",
      "0.378252   0.373035         1024      1024.0   6.4953   4.2516       21\n",
      "0.350751   0.323249         2048      2048.0   6.0822   5.8995       22\n",
      "0.344446   0.338141         4096      4096.0   5.6733   5.3636       22\n",
      "0.349611   0.349611         8192      8192.0   6.5596   4.3820       22 h\n",
      "0.345637   0.341667        16384     16384.0   5.6595   4.5835       20 h\n",
      "0.343242   0.340848        32768     32768.0   1.9459   3.1003       21 h\n",
      "0.341692   0.340143        65536     65536.0   4.6728   4.7843       22 h\n",
      "0.339989   0.338285       131072    131072.0   3.3673   3.0850       22 h\n",
      "\n",
      "finished run\n",
      "number of examples per pass = 7348\n",
      "passes used = 22\n",
      "weighted example sum = 161657\n",
      "weighted label sum = 743073\n",
      "average loss = 0.338136 h\n",
      "best constant = 4.5966\n",
      "total feature number = 3475384\n",
      "```"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Predictions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now prepare the test data set in a similar way as we did with the training dataset. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "test = pd.read_csv('data/bikesharing_test.csv')\n",
      "test_hours = test.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').hour)\n",
      "test_hours.name = 'hours'\n",
      "test_years = test.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').year)\n",
      "test_years.name = 'years'\n",
      "test_months = test.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').month)\n",
      "test_months.name = 'months'\n",
      "test_weekdays = test.datetime.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S').weekday())\n",
      "test_weekdays.name = 'weekdays'\n",
      "\n",
      "test_sinHour = np.sin(2*np.pi*test_hours/23); test_sinHour.name = 'sinHour'\n",
      "test_sinWeekDay = np.sin(2*np.pi*test_weekdays/6); test_sinWeekDay.name = 'sinWeekDay'\n",
      "test_cosHour = np.cos(2*np.pi*test_hours/23); test_cosHour.name = 'cosHour'\n",
      "test_cosWeekDay = np.cos(2*np.pi*test_weekdays/6); test_cosWeekDay.name = 'cosWeekDay'\n",
      "\n",
      "Testtemp1Hbefore = test['atemp'].shift(1).fillna(1); Testtemp1Hbefore.name = 'temp1Hbefore'\n",
      "Testtemp2Hbefore = test['atemp'].shift(2).fillna(1); Testtemp2Hbefore.name = 'temp2Hbefore'\n",
      "Testtemp3Hbefore = test['atemp'].shift(3).fillna(1); Testtemp3Hbefore.name = 'temp3Hbefore'\n",
      "Testtemp6Hbefore = test['atemp'].shift(6).fillna(1); Testtemp6Hbefore.name = 'temp6Hbefore'\n",
      "Testhumidity1Hbefore = test['humidity'].shift(1).fillna(1); Testhumidity1Hbefore.name = 'humidity1Hbefore'\n",
      "Testhumidity2Hbefore = test['humidity'].shift(2).fillna(1); Testhumidity2Hbefore.name = 'humidity2Hbefore'\n",
      "Testhumidity3Hbefore = test['humidity'].shift(3).fillna(1); Testhumidity3Hbefore.name = 'humidity3Hbefore'\n",
      "Testhumidity6Hbefore = test['humidity'].shift(6).fillna(1); Testhumidity6Hbefore.name = 'humidity6Hbefore'\n",
      "Testwind1Hbefore = test['windspeed'].shift(1).fillna(1); Testwind1Hbefore.name = 'wind1Hbefore'\n",
      "Testwind2Hbefore = test['windspeed'].shift(2).fillna(1); Testwind2Hbefore.name = 'wind2Hbefore'\n",
      "Testwind3Hbefore = test['windspeed'].shift(3).fillna(1); Testwind3Hbefore.name = 'wind3Hbefore'\n",
      "Testwind6Hbefore = test['windspeed'].shift(6).fillna(1); Testwind6Hbefore.name = 'wind6Hbefore'\n",
      "\n",
      "Testtempavg2Hbefore = pd.stats.moments.rolling_mean(test['atemp'],2,min_periods=2).fillna(1); Testtempavg2Hbefore.name = 'tempavg2Hbefore'\n",
      "Testhumidityavg2Hbefore = pd.stats.moments.rolling_mean(test['humidity'].shift(1),2,min_periods=2).fillna(1); Testhumidityavg2Hbefore.name = 'humidityavg2Hbefore'\n",
      "Testwindavg2Hbefore = pd.stats.moments.rolling_mean(test['windspeed'].shift(1),2,min_periods=2).fillna(1); Testwindavg2Hbefore.name = 'windavg2Hbefore'\n",
      "Testtempavg6Hbefore = pd.stats.moments.rolling_mean(test['atemp'].shift(1),6,min_periods=6).fillna(1); Testtempavg6Hbefore.name = 'tempavg6Hbefore'\n",
      "Testhumidityavg6Hbefore = pd.stats.moments.rolling_mean(test['humidity'].shift(1),6,min_periods=6).fillna(1); Testhumidityavg6Hbefore.name = 'humidityavg6Hbefore'\n",
      "Testwindavg6Hbefore = pd.stats.moments.rolling_mean(test['windspeed'].shift(1),6,min_periods=6).fillna(1); Testwindavg6Hbefore.name = 'windavg6Hbefore'\n",
      "Testtempavg1Dbefore = pd.stats.moments.rolling_mean(test['atemp'].shift(1),24,min_periods=24).fillna(1); Testtempavg1Dbefore.name = 'tempavg1Dbefore'\n",
      "Testhumidityavg1Dbefore = pd.stats.moments.rolling_mean(test['humidity'].shift(1),24,min_periods=24).fillna(1); Testhumidityavg1Dbefore.name = 'humidityavg1Dbefore'\n",
      "Testwindavg1Dbefore = pd.stats.moments.rolling_mean(test['windspeed'].shift(1),24,min_periods=24).fillna(1); Testwindavg1Dbefore.name = 'windavg1Dbefore'\n",
      "\n",
      "TestTDiff = (test['atemp'] - test['atemp'].shift(1)).fillna(1); TestTDiff.name = 'TDiff'\n",
      "TestTDiff10H = (test['atemp'] - test['atemp'].shift(10)).fillna(method = 'bfill'); TestTDiff10H.name = 'TDiff10H'\n",
      "TestHDiff10H = (test['humidity'] - test['humidity'].shift(10)).fillna(method = 'bfill'); TestHDiff10H.name = 'HDiff10H'\n",
      "\n",
      "\n",
      "newtest = test.ix[:,(1,2,3,4,6,7,8)].join(hours).join(years).join(weekdays).join(months).join(sinHour).join(sinWeekDay).join(cosHour).join(cosWeekDay)\n",
      "newtest = newtest.join(windavg6Hbefore).join(windavg2Hbefore).join(windavg1Dbefore).join(wind1Hbefore).join(wind2Hbefore).join(wind6Hbefore)\n",
      "newtest = newtest.join(tempavg6Hbefore).join(tempavg2Hbefore).join(tempavg1Dbefore).join(temp1Hbefore).join(temp2Hbefore).join(temp6Hbefore)\n",
      "newtest = newtest.join(humidityavg6Hbefore).join(humidityavg2Hbefore).join(humidityavg1Dbefore).join(humidity1Hbefore).join(humidity2Hbefore).join(humidity6Hbefore)\n",
      "newtest = newtest.join(TDiff).join(TDiff10H).join(HDiff10H)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We then munge the test dataset into the VW format using the csv_to_vw function that we defined earlier. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "csv_to_vw(newtest, \"data/bikesharing_test.txt\",train=False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0\t0:00:00.039549\n",
        "\n",
        " 6492 Task execution time:\n",
        "\t0:00:05.276548"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now predict the testset output with the following command from the terminal. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#vw -d bikesharing_test.txt -t -i bike.model.vw -p bikesharing_preds.txt "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The -t says to test only and not train. -i bike.model.vw says to use the model that we learned from the training process. -p saves our predictions to bikesharing_preds.txt. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Upon running the command VW gives the following output and generates the bikesharing_preds.txt file. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "```\n",
      "only testing\n",
      "Num weight bits = 18\n",
      "learning rate = 10\n",
      "initial_t = 1\n",
      "power_t = 0.5\n",
      "predictions = bikesharing_preds.txt\n",
      "using no cache\n",
      "Reading datafile = bikesharing_test.txt\n",
      "num sources = 1\n",
      "average    since         example     example  current  current  current\n",
      "loss       last          counter      weight    label  predict features\n",
      "8.663692   8.663692            1         1.0   1.0000   3.9434       37\n",
      "6.498619   4.333547            2         2.0   1.0000   3.0817       35\n",
      "5.028621   3.558624            4         4.0   1.0000   2.6682       34\n",
      "4.102480   3.176339            8         8.0   1.0000   2.8937       35\n",
      "4.009334   3.916188           16        16.0   1.0000   3.0832       36\n",
      "5.337796   6.666259           32        32.0   1.0000   3.7795       35\n",
      "7.840713   10.343631          64        64.0   1.0000   4.5281       37\n",
      "7.289603   6.738493          128       128.0   1.0000   3.0337       35\n",
      "6.635136   5.980668          256       256.0   1.0000   4.7377       36\n",
      "6.345960   6.056784          512       512.0   1.0000   3.0137       37\n",
      "6.778207   7.210454         1024      1024.0   1.0000   4.7802       36\n",
      "7.933761   9.089315         2048      2048.0   1.0000   3.6618       35\n",
      "10.755411  13.577061        4096      4096.0   1.0000   3.9191       37\n",
      "\n",
      "finished run\n",
      "number of examples per pass = 6493\n",
      "passes used = 1\n",
      "weighted example sum = 6493\n",
      "weighted label sum = 6493\n",
      "average loss = 10.4092\n",
      "best constant = 1\n",
      "total feature number = 233568\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We then read in our predictions. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "predictions = pd.read_csv('data/bikesharing_preds.txt', header = None, index_col = False)\n",
      "np.exp(predictions[0]), sum(np.exp(predictions[0]) < 0) # to ensure no negative outputs"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "(0     51.594496\n",
        " 1     21.795794\n",
        " 2     21.799325\n",
        " 3     14.414592\n",
        " 4     15.894275\n",
        " 5     12.949834\n",
        " 6     17.952780\n",
        " 7     18.060405\n",
        " 8     17.731594\n",
        " 9     18.306039\n",
        " 10    18.681598\n",
        " 11    19.245206\n",
        " 12    21.042664\n",
        " 13    19.939207\n",
        " 14    20.742671\n",
        " ...\n",
        " 6478    43.846943\n",
        " 6479    36.068648\n",
        " 6480    36.442293\n",
        " 6481    37.595298\n",
        " 6482    40.524795\n",
        " 6483    39.265372\n",
        " 6484    39.983625\n",
        " 6485    38.981766\n",
        " 6486    41.348618\n",
        " 6487    40.435010\n",
        " 6488    39.964598\n",
        " 6489    41.430901\n",
        " 6490    45.402810\n",
        " 6491    51.559114\n",
        " 6492    50.929382\n",
        " Name: 0, Length: 6493, dtype: float64, 0)"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And finally write it to a file in the format required by the Kaggle competition. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "output = pd.DataFrame({'datetime': test.ix[:,0], 'count': np.exp(predictions[0])})\n",
      "output.to_csv('data/bikesharingVW_submission.csv', index = False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 19
    },
    {
     "cell_type": "heading",
     "level": 5,
     "metadata": {},
     "source": [
      "Acknowledgements"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "I would like to thank MLWave and his blog post on using VW [here](http://mlwave.com/predicting-click-through-rates-with-online-machine-learning/)."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}