martindurant/parquet_example.ipynb

## parquet_example.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A new parquet reader for python\n",
    "\n",{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A new parquet reader for python\n",
    "\n",
    "The parquet format has become ubiquitous in the hadoop/big-data ecosystem as a compact and performant binary data store. It is the default output format in the popular spark computation engine. Parquet provides a number of usefult features:\n",
    "- column-wise storage, allowing selective loading and parsing of only the necessary parts of the dataset\n",
    "- a choice of simple encoding methods per data chunk, such as run-length encoding and delta-encoding\n",
    "- compression of data chunks with a choice of algorithm per column\n",
    "- column statistics per chunk and partitioning of the data, allowing for exclusing some region of columns without loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most common data format for big data is probably still CSV, because of its simplicity and the ability for humans to directly view the data. Here we have a large extract of the airlines dataset, flight departures and arrivals in the USA across several years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r--  1 mdurant  staff  2641451323 19 Oct 12:52 bigone.csv\r\n"
     ]
    }
   ],
   "source": [
    "ls -l bigone.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ",year,month,day,dayofweek,dep_time,crs_dep_time,arr_time,crs_arr_time,carrier,flight_num,tail_num,actual_elapsed_time,crs_elapsed_time,airtime,arrdelay,depdelay,origin,dest,distance,taxi_in,taxi_out,cancelled,cancellation_code,diverted,carrier_delay,weather_delay,nas_delay,late_aircraft_delay\r\n",
      "0,1994,1,7,5,858.0,900,954.0,1003,b'US',227,,56.0,63.0,,-9.0,-2.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "1,1994,1,8,6,859.0,900,952.0,1003,b'US',227,,53.0,63.0,,-11.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "2,1994,1,10,1,935.0,900,1023.0,1003,b'US',227,,48.0,63.0,,20.0,35.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "3,1994,1,11,2,903.0,900,1131.0,1003,b'US',227,,148.0,63.0,,88.0,3.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "4,1994,1,12,3,933.0,900,1024.0,1003,b'US',227,,51.0,63.0,,21.0,33.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "5,1994,1,13,4,,900,,1003,b'US',227,,,63.0,,,,b'CLT',b'ORF',290.0,,,1,,0,,,,\r\n",
      "6,1994,1,14,5,903.0,900,1005.0,1003,b'US',227,,62.0,63.0,,2.0,3.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "7,1994,1,15,6,859.0,900,1004.0,1003,b'US',227,,65.0,63.0,,1.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "8,1994,1,17,1,859.0,900,955.0,1003,b'US',227,,56.0,63.0,,-8.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n"
     ]
    }
   ],
   "source": [
    "!head bigone.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can load this easily with pandas. Note that there are many options to `read_csv()` that we do not investigate here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<string>:2: DtypeWarning: Columns (23) have mixed types. Specify dtype option on import or set low_memory=False.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 7s, sys: 7.94 s, total: 1min 15s\n",
      "Wall time: 1min 16s\n"
     ]
    }
   ],
   "source": [
    "cols = ['year', 'month', 'day', 'dayofweek', 'dep_time', 'crs_dep_time',\n",
    " 'arr_time', 'crs_arr_time', 'carrier', 'flight_num', 'tail_num',\n",
    " 'actual_elapsed_time', 'crs_elapsed_time', 'airtime', 'arrdelay',\n",
    " 'depdelay', 'origin', 'dest', 'distance', 'taxi_in', 'taxi_out',\n",
    " 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n",
    " 'weather_delay', 'nas_delay', 'late_aircraft_delay']\n",
    "%time df = pd.read_csv('bigone.csv', usecols=cols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>dayofweek</th>\n",
       "      <th>dep_time</th>\n",
       "      <th>crs_dep_time</th>\n",
       "      <th>arr_time</th>\n",
       "      <th>crs_arr_time</th>\n",
       "      <th>carrier</th>\n",
       "      <th>flight_num</th>\n",
       "      <th>...</th>\n",
       "      <th>distance</th>\n",
       "      <th>taxi_in</th>\n",
       "      <th>taxi_out</th>\n",
       "      <th>cancelled</th>\n",
       "      <th>cancellation_code</th>\n",
       "      <th>diverted</th>\n",
       "      <th>carrier_delay</th>\n",
       "      <th>weather_delay</th>\n",
       "      <th>nas_delay</th>\n",
       "      <th>late_aircraft_delay</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>858.0</td>\n",
       "      <td>900</td>\n",
       "      <td>954.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>859.0</td>\n",
       "      <td>900</td>\n",
       "      <td>952.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>935.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1023.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>903.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1131.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>12</td>\n",
       "      <td>3</td>\n",
       "      <td>933.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1024.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 28 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   year  month  day  dayofweek  dep_time  crs_dep_time  arr_time  \\\n",
       "0  1994      1    7          5     858.0           900     954.0   \n",
       "1  1994      1    8          6     859.0           900     952.0   \n",
       "2  1994      1   10          1     935.0           900    1023.0   \n",
       "3  1994      1   11          2     903.0           900    1131.0   \n",
       "4  1994      1   12          3     933.0           900    1024.0   \n",
       "\n",
       "   crs_arr_time carrier  flight_num         ...           distance  taxi_in  \\\n",
       "0          1003   b'US'         227         ...              290.0      NaN   \n",
       "1          1003   b'US'         227         ...              290.0      NaN   \n",
       "2          1003   b'US'         227         ...              290.0      NaN   \n",
       "3          1003   b'US'         227         ...              290.0      NaN   \n",
       "4          1003   b'US'         227         ...              290.0      NaN   \n",
       "\n",
       "   taxi_out  cancelled  cancellation_code  diverted carrier_delay  \\\n",
       "0       NaN          0                NaN         0           NaN   \n",
       "1       NaN          0                NaN         0           NaN   \n",
       "2       NaN          0                NaN         0           NaN   \n",
       "3       NaN          0                NaN         0           NaN   \n",
       "4       NaN          0                NaN         0           NaN   \n",
       "\n",
       "  weather_delay  nas_delay  late_aircraft_delay  \n",
       "0           NaN        NaN                  NaN  \n",
       "1           NaN        NaN                  NaN  \n",
       "2           NaN        NaN                  NaN  \n",
       "3           NaN        NaN                  NaN  \n",
       "4           NaN        NaN                  NaN  \n",
       "\n",
       "[5 rows x 28 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The size in memory is about double the original on-disc with the standard datatypes inferred by pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 22651194 entries, 0 to 22651193\n",
      "Data columns (total 28 columns):\n",
      "year                   int64\n",
      "month                  int64\n",
      "day                    int64\n",
      "dayofweek              int64\n",
      "dep_time               float64\n",
      "crs_dep_time           int64\n",
      "arr_time               float64\n",
      "crs_arr_time           int64\n",
      "carrier                object\n",
      "flight_num             int64\n",
      "tail_num               float64\n",
      "actual_elapsed_time    float64\n",
      "crs_elapsed_time       float64\n",
      "airtime                float64\n",
      "arrdelay               float64\n",
      "depdelay               float64\n",
      "origin                 object\n",
      "dest                   object\n",
      "distance               float64\n",
      "taxi_in                float64\n",
      "taxi_out               float64\n",
      "cancelled              int64\n",
      "cancellation_code      object\n",
      "diverted               int64\n",
      "carrier_delay          float64\n",
      "weather_delay          float64\n",
      "nas_delay              float64\n",
      "late_aircraft_delay    float64\n",
      "dtypes: float64(15), int64(9), object(4)\n",
      "memory usage: 8.7 GB\n"
     ]
    }
   ],
   "source": [
    "df.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas allows us to extract only thos columns which are required. This maked the output dataframe much smaller in memory, and saves on parsing time, because pandas does not consider the contents of many columns. (Note that we are not attempting to correct for OS-level caching of file data.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 31.3 s, sys: 2.12 s, total: 33.4 s\n",
      "Wall time: 33.4 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['year', 'month', 'day', 'crs_dep_time',\n",
    " 'carrier', 'flight_num', 'arrdelay',\n",
    " 'origin', 'dest']\n",
    "%time df = pd.read_csv('bigone.csv', usecols=cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, pandas does still have to read every single line of the input file, so as we reduce the number of columns from 28 to one, we only get a speedup of about 4. Pandas runs in optimized C++ code, and is about as fast as reading CSVs could possibly be."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 18.3 s, sys: 838 ms, total: 19.1 s\n",
      "Wall time: 19.1 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['carrier']\n",
    "%time df = pd.read_csv('bigone.csv', usecols=cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parquet file\n",
    "\n",
    "We also have the same data in parquet format. The file size on disk is about one tenth of the CSV version, which will be very important when the data transfer is a significant part of the workflow (e.g., pulling the data across a network). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r--  1 mdurant  staff  265278855  9 Sep  2015 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq\r\n"
     ]
    }
   ],
   "source": [
    "ls -l 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We read this data using the parquet library that we are developing for python. Notice that the metadata (such as number of rows and any partitons) is available immediately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'categories': [],\n",
       " 'columns': ['year',\n",
       "  'month',\n",
       "  'day',\n",
       "  'dayofweek',\n",
       "  'dep_time',\n",
       "  'crs_dep_time',\n",
       "  'arr_time',\n",
       "  'crs_arr_time',\n",
       "  'carrier',\n",
       "  'flight_num',\n",
       "  'tail_num',\n",
       "  'actual_elapsed_time',\n",
       "  'crs_elapsed_time',\n",
       "  'airtime',\n",
       "  'arrdelay',\n",
       "  'depdelay',\n",
       "  'origin',\n",
       "  'dest',\n",
       "  'distance',\n",
       "  'taxi_in',\n",
       "  'taxi_out',\n",
       "  'cancelled',\n",
       "  'cancellation_code',\n",
       "  'diverted',\n",
       "  'carrier_delay',\n",
       "  'weather_delay',\n",
       "  'nas_delay',\n",
       "  'security_delay',\n",
       "  'late_aircraft_delay'],\n",
       " 'name': '4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq',\n",
       " 'rows': 22651194}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pf = parquet.ParquetFile('4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq')\n",
    "pf.info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading this file is about a factor of three faster than the CSV version, above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 18.7 s, sys: 5.35 s, total: 24.1 s\n",
      "Wall time: 24.7 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['year', 'month', 'day', 'dayofweek', 'dep_time', 'crs_dep_time',\n",
    " 'arr_time', 'crs_arr_time', 'carrier', 'flight_num', 'tail_num',\n",
    " 'actual_elapsed_time', 'crs_elapsed_time', 'airtime', 'arrdelay',\n",
    " 'depdelay', 'origin', 'dest', 'distance', 'taxi_in', 'taxi_out',\n",
    " 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n",
    " 'weather_delay', 'nas_delay', 'late_aircraft_delay']\n",
    "%time df = pf.to_pandas(cols, usecats=['carrier', 'origin', 'dest', 'cancellation_code'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual_elapsed_time</th>\n",
       "      <th>airtime</th>\n",
       "      <th>arr_time</th>\n",
       "      <th>arrdelay</th>\n",
       "      <th>cancellation_code</th>\n",
       "      <th>cancelled</th>\n",
       "      <th>carrier</th>\n",
       "      <th>carrier_delay</th>\n",
       "      <th>crs_arr_time</th>\n",
       "      <th>crs_dep_time</th>\n",
       "      <th>...</th>\n",
       "      <th>flight_num</th>\n",
       "      <th>late_aircraft_delay</th>\n",
       "      <th>month</th>\n",
       "      <th>nas_delay</th>\n",
       "      <th>origin</th>\n",
       "      <th>tail_num</th>\n",
       "      <th>taxi_in</th>\n",
       "      <th>taxi_out</th>\n",
       "      <th>weather_delay</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>56.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>954.0</td>\n",
       "      <td>-9.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1003</td>\n",
       "      <td>900</td>\n",
       "      <td>...</td>\n",
       "      <td>227</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>b'CLT'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>53.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>952.0</td>\n",
       "      <td>-11.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1003</td>\n",
       "      <td>900</td>\n",
       "      <td>...</td>\n",
       "      <td>227</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>b'CLT'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>48.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1023.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1003</td>\n",
       "      <td>900</td>\n",
       "      <td>...</td>\n",
       "      <td>227</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>b'CLT'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>148.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1131.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1003</td>\n",
       "      <td>900</td>\n",
       "      <td>...</td>\n",
       "      <td>227</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>b'CLT'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>51.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1024.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1003</td>\n",
       "      <td>900</td>\n",
       "      <td>...</td>\n",
       "      <td>227</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>b'CLT'</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1994</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 28 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   actual_elapsed_time  airtime  arr_time  arrdelay cancellation_code  \\\n",
       "0                 56.0      NaN     954.0      -9.0               NaN   \n",
       "1                 53.0      NaN     952.0     -11.0               NaN   \n",
       "2                 48.0      NaN    1023.0      20.0               NaN   \n",
       "3                148.0      NaN    1131.0      88.0               NaN   \n",
       "4                 51.0      NaN    1024.0      21.0               NaN   \n",
       "\n",
       "   cancelled carrier  carrier_delay  crs_arr_time  crs_dep_time  ...   \\\n",
       "0          0   b'US'            NaN          1003           900  ...    \n",
       "1          0   b'US'            NaN          1003           900  ...    \n",
       "2          0   b'US'            NaN          1003           900  ...    \n",
       "3          0   b'US'            NaN          1003           900  ...    \n",
       "4          0   b'US'            NaN          1003           900  ...    \n",
       "\n",
       "   flight_num  late_aircraft_delay  month  nas_delay  origin tail_num  \\\n",
       "0         227                  NaN      1        NaN  b'CLT'      NaN   \n",
       "1         227                  NaN      1        NaN  b'CLT'      NaN   \n",
       "2         227                  NaN      1        NaN  b'CLT'      NaN   \n",
       "3         227                  NaN      1        NaN  b'CLT'      NaN   \n",
       "4         227                  NaN      1        NaN  b'CLT'      NaN   \n",
       "\n",
       "   taxi_in  taxi_out  weather_delay  year  \n",
       "0      NaN       NaN            NaN  1994  \n",
       "1      NaN       NaN            NaN  1994  \n",
       "2      NaN       NaN            NaN  1994  \n",
       "3      NaN       NaN            NaN  1994  \n",
       "4      NaN       NaN            NaN  1994  \n",
       "\n",
       "[5 rows x 28 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also we notice that some columns, if they were stored using dictionary encoding in the file, can be read directly to categorical pandas columns without any intermediate representation. For this particular dataset, some columns (such as month) were also stored using dictionary encoding, but clearly there have some numerical meaning - so we have chosen to expand out the dictionary to normal values. The use of categoricals and int32 columns makes this version of the dataframe slightly smaller in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 22651194 entries, 0 to 22651193\n",
      "Data columns (total 28 columns):\n",
      "actual_elapsed_time    float64\n",
      "airtime                float64\n",
      "arr_time               float64\n",
      "arrdelay               float64\n",
      "cancellation_code      category\n",
      "cancelled              int32\n",
      "carrier                category\n",
      "carrier_delay          float64\n",
      "crs_arr_time           int32\n",
      "crs_dep_time           int32\n",
      "crs_elapsed_time       float64\n",
      "day                    int32\n",
      "dayofweek              int32\n",
      "dep_time               float64\n",
      "depdelay               float64\n",
      "dest                   category\n",
      "distance               float64\n",
      "diverted               int32\n",
      "flight_num             int32\n",
      "late_aircraft_delay    float64\n",
      "month                  int32\n",
      "nas_delay              float64\n",
      "origin                 category\n",
      "tail_num               float64\n",
      "taxi_in                float64\n",
      "taxi_out               float64\n",
      "weather_delay          float64\n",
      "year                   int32\n",
      "dtypes: category(4), float64(15), int32(9)\n",
      "memory usage: 3.4 GB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start to see a more significant speed up when loading only a selection of the columns. This example is about six times faster than the CSV version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4.45 s, sys: 1.07 s, total: 5.52 s\n",
      "Wall time: 5.53 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['year', 'month', 'day', 'crs_dep_time',\n",
    " 'carrier', 'flight_num', 'arrdelay',\n",
    " 'origin', 'dest']\n",
    "%time df = pf.to_pandas(cols, usecats=['carrier', 'origin', 'dest'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And for a single column, in this case a categorical, a factor of over 50."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 205 ms, sys: 171 ms, total: 376 ms\n",
      "Wall time: 374 ms\n"
     ]
    }
   ],
   "source": [
    "cols = ['carrier']\n",
    "%time  df = pf.to_pandas(cols, usecats=['carrier'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusions\n",
    "\n",
    "The parquet library is a work in progress. It can already handle multi-file and partitioned data, such as those produced by hive/spark, but does not yet allow filtering to avoid reading unnecessary sections of the data. We also have preliminary support for writing parquet files with simple data types.\n",
    "\n",
    "The plan is to be able to load separate row groups and files within a single dataset in parallel using dask, to enable maximal use of data bandwidth and processing power."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}

    "The parquet format has become ubiquitous in the hadoop/big-data ecosystem as a compact and performant binary data store. It is the default output format in the popular spark computation engine. Parquet provides a number of usefult features:\n",
    "- column-wise storage, allowing selective loading and parsing of only the necessary parts of the dataset\n",
    "- a choice of simple encoding methods per data chunk, such as run-length encoding and delta-encoding\n",
    "- compression of data chunks with a choice of algorithm per column\n",
    "- column statistics per chunk and partitioning of the data, allowing for exclusing some region of columns without loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most common data format for big data is probably still CSV, because of its simplicity and the ability for humans to directly view the data. Here we have a large extract of the airlines dataset, flight departures and arrivals in the USA across several years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r--  1 mdurant  staff  2641451323 19 Oct 12:52 bigone.csv\r\n"
     ]
    }
   ],
   "source": [
    "ls -l bigone.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ",year,month,day,dayofweek,dep_time,crs_dep_time,arr_time,crs_arr_time,carrier,flight_num,tail_num,actual_elapsed_time,crs_elapsed_time,airtime,arrdelay,depdelay,origin,dest,distance,taxi_in,taxi_out,cancelled,cancellation_code,diverted,carrier_delay,weather_delay,nas_delay,late_aircraft_delay\r\n",
      "0,1994,1,7,5,858.0,900,954.0,1003,b'US',227,,56.0,63.0,,-9.0,-2.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "1,1994,1,8,6,859.0,900,952.0,1003,b'US',227,,53.0,63.0,,-11.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "2,1994,1,10,1,935.0,900,1023.0,1003,b'US',227,,48.0,63.0,,20.0,35.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "3,1994,1,11,2,903.0,900,1131.0,1003,b'US',227,,148.0,63.0,,88.0,3.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "4,1994,1,12,3,933.0,900,1024.0,1003,b'US',227,,51.0,63.0,,21.0,33.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "5,1994,1,13,4,,900,,1003,b'US',227,,,63.0,,,,b'CLT',b'ORF',290.0,,,1,,0,,,,\r\n",
      "6,1994,1,14,5,903.0,900,1005.0,1003,b'US',227,,62.0,63.0,,2.0,3.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "7,1994,1,15,6,859.0,900,1004.0,1003,b'US',227,,65.0,63.0,,1.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
      "8,1994,1,17,1,859.0,900,955.0,1003,b'US',227,,56.0,63.0,,-8.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n"
     ]
    }
   ],
   "source": [
    "!head bigone.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can load this easily with pandas. Note that there are many options to `read_csv()` that we do not investigate here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<string>:2: DtypeWarning: Columns (23) have mixed types. Specify dtype option on import or set low_memory=False.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 3s, sys: 6.94 s, total: 1min 10s\n",
      "Wall time: 1min 10s\n"
     ]
    }
   ],
   "source": [
    "cols = ['year', 'month', 'day', 'dayofweek', 'dep_time', 'crs_dep_time',\n",
    " 'arr_time', 'crs_arr_time', 'carrier', 'flight_num', 'tail_num',\n",
    " 'actual_elapsed_time', 'crs_elapsed_time', 'airtime', 'arrdelay',\n",
    " 'depdelay', 'origin', 'dest', 'distance', 'taxi_in', 'taxi_out',\n",
    " 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n",
    " 'weather_delay', 'nas_delay', 'late_aircraft_delay']\n",
    "%time df = pd.read_csv('bigone.csv', usecols=cols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>dayofweek</th>\n",
       "      <th>dep_time</th>\n",
       "      <th>crs_dep_time</th>\n",
       "      <th>arr_time</th>\n",
       "      <th>crs_arr_time</th>\n",
       "      <th>carrier</th>\n",
       "      <th>flight_num</th>\n",
       "      <th>...</th>\n",
       "      <th>distance</th>\n",
       "      <th>taxi_in</th>\n",
       "      <th>taxi_out</th>\n",
       "      <th>cancelled</th>\n",
       "      <th>cancellation_code</th>\n",
       "      <th>diverted</th>\n",
       "      <th>carrier_delay</th>\n",
       "      <th>weather_delay</th>\n",
       "      <th>nas_delay</th>\n",
       "      <th>late_aircraft_delay</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>858.0</td>\n",
       "      <td>900</td>\n",
       "      <td>954.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>859.0</td>\n",
       "      <td>900</td>\n",
       "      <td>952.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>935.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1023.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>903.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1131.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>12</td>\n",
       "      <td>3</td>\n",
       "      <td>933.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1024.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 28 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   year  month  day  dayofweek  dep_time  crs_dep_time  arr_time  \\\n",
       "0  1994      1    7          5     858.0           900     954.0   \n",
       "1  1994      1    8          6     859.0           900     952.0   \n",
       "2  1994      1   10          1     935.0           900    1023.0   \n",
       "3  1994      1   11          2     903.0           900    1131.0   \n",
       "4  1994      1   12          3     933.0           900    1024.0   \n",
       "\n",
       "   crs_arr_time carrier  flight_num         ...           distance  taxi_in  \\\n",
       "0          1003   b'US'         227         ...              290.0      NaN   \n",
       "1          1003   b'US'         227         ...              290.0      NaN   \n",
       "2          1003   b'US'         227         ...              290.0      NaN   \n",
       "3          1003   b'US'         227         ...              290.0      NaN   \n",
       "4          1003   b'US'         227         ...              290.0      NaN   \n",
       "\n",
       "   taxi_out  cancelled  cancellation_code  diverted carrier_delay  \\\n",
       "0       NaN          0                NaN         0           NaN   \n",
       "1       NaN          0                NaN         0           NaN   \n",
       "2       NaN          0                NaN         0           NaN   \n",
       "3       NaN          0                NaN         0           NaN   \n",
       "4       NaN          0                NaN         0           NaN   \n",
       "\n",
       "  weather_delay  nas_delay  late_aircraft_delay  \n",
       "0           NaN        NaN                  NaN  \n",
       "1           NaN        NaN                  NaN  \n",
       "2           NaN        NaN                  NaN  \n",
       "3           NaN        NaN                  NaN  \n",
       "4           NaN        NaN                  NaN  \n",
       "\n",
       "[5 rows x 28 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The size in memory is about double the original on-disc with the standard datatypes inferred by pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 22651194 entries, 0 to 22651193\n",
      "Data columns (total 28 columns):\n",
      "year                   int64\n",
      "month                  int64\n",
      "day                    int64\n",
      "dayofweek              int64\n",
      "dep_time               float64\n",
      "crs_dep_time           int64\n",
      "arr_time               float64\n",
      "crs_arr_time           int64\n",
      "carrier                object\n",
      "flight_num             int64\n",
      "tail_num               float64\n",
      "actual_elapsed_time    float64\n",
      "crs_elapsed_time       float64\n",
      "airtime                float64\n",
      "arrdelay               float64\n",
      "depdelay               float64\n",
      "origin                 object\n",
      "dest                   object\n",
      "distance               float64\n",
      "taxi_in                float64\n",
      "taxi_out               float64\n",
      "cancelled              int64\n",
      "cancellation_code      object\n",
      "diverted               int64\n",
      "carrier_delay          float64\n",
      "weather_delay          float64\n",
      "nas_delay              float64\n",
      "late_aircraft_delay    float64\n",
      "dtypes: float64(15), int64(9), object(4)\n",
      "memory usage: 8.7 GB\n"
     ]
    }
   ],
   "source": [
    "df.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas allows us to extract only thos columns which are required. This maked the output dataframe much smaller in memory, and saves on parsing time, because pandas does not consider the contents of many columns. (Note that we are not attempting to correct for OS-level caching of file data.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 31.3 s, sys: 2.22 s, total: 33.5 s\n",
      "Wall time: 33.6 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['year', 'month', 'day', 'crs_dep_time',\n",
    " 'carrier', 'flight_num', 'arrdelay',\n",
    " 'origin', 'dest']\n",
    "%time df = pd.read_csv('bigone.csv', usecols=cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, pandas does still have to read every single line of the input file, so as we reduce the number of columns from 28 to one, we only get a speedup of about 4. Pandas runs in optimized C++ code, and is about as fast as reading CSVs could possibly be."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 19.1 s, sys: 875 ms, total: 19.9 s\n",
      "Wall time: 20 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['carrier']\n",
    "%time df = pd.read_csv('bigone.csv', usecols=cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parquet file\n",
    "\n",
    "We also have the same data in parquet format. The file size on disk is about one tenth of the CSV version, which will be very important when the data transfer is a significant part of the workflow (e.g., pulling the data across a network). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r--  1 mdurant  staff  265278855  9 Sep  2015 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq\r\n"
     ]
    }
   ],
   "source": [
    "ls -l 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We read this data using the parquet library that we are developing for python. Notice that the metadata (such as number of rows and any partitons) is available immediately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'categories': [],\n",
       " 'columns': ['year',\n",
       "  'month',\n",
       "  'day',\n",
       "  'dayofweek',\n",
       "  'dep_time',\n",
       "  'crs_dep_time',\n",
       "  'arr_time',\n",
       "  'crs_arr_time',\n",
       "  'carrier',\n",
       "  'flight_num',\n",
       "  'tail_num',\n",
       "  'actual_elapsed_time',\n",
       "  'crs_elapsed_time',\n",
       "  'airtime',\n",
       "  'arrdelay',\n",
       "  'depdelay',\n",
       "  'origin',\n",
       "  'dest',\n",
       "  'distance',\n",
       "  'taxi_in',\n",
       "  'taxi_out',\n",
       "  'cancelled',\n",
       "  'cancellation_code',\n",
       "  'diverted',\n",
       "  'carrier_delay',\n",
       "  'weather_delay',\n",
       "  'nas_delay',\n",
       "  'security_delay',\n",
       "  'late_aircraft_delay'],\n",
       " 'name': '4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq',\n",
       " 'rows': 22651194}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pf = parquet.ParquetFile('4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq')\n",
    "pf.info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading this file is almost a factor of three faster than the CSV version, above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 22.7 s, sys: 10.1 s, total: 32.8 s\n",
      "Wall time: 33.5 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['year', 'month', 'day', 'dayofweek', 'dep_time', 'crs_dep_time',\n",
    " 'arr_time', 'crs_arr_time', 'carrier', 'flight_num', 'tail_num',\n",
    " 'actual_elapsed_time', 'crs_elapsed_time', 'airtime', 'arrdelay',\n",
    " 'depdelay', 'origin', 'dest', 'distance', 'taxi_in', 'taxi_out',\n",
    " 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n",
    " 'weather_delay', 'nas_delay', 'late_aircraft_delay']\n",
    "%time df = pf.to_pandas(cols, usecats=['carrier', 'origin', 'dest', 'cancellation_code'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>dayofweek</th>\n",
       "      <th>dep_time</th>\n",
       "      <th>crs_dep_time</th>\n",
       "      <th>arr_time</th>\n",
       "      <th>crs_arr_time</th>\n",
       "      <th>carrier</th>\n",
       "      <th>flight_num</th>\n",
       "      <th>...</th>\n",
       "      <th>distance</th>\n",
       "      <th>taxi_in</th>\n",
       "      <th>taxi_out</th>\n",
       "      <th>cancelled</th>\n",
       "      <th>cancellation_code</th>\n",
       "      <th>diverted</th>\n",
       "      <th>carrier_delay</th>\n",
       "      <th>weather_delay</th>\n",
       "      <th>nas_delay</th>\n",
       "      <th>late_aircraft_delay</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>858.0</td>\n",
       "      <td>900</td>\n",
       "      <td>954.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>859.0</td>\n",
       "      <td>900</td>\n",
       "      <td>952.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>935.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1023.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>903.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1131.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1994</td>\n",
       "      <td>1</td>\n",
       "      <td>12</td>\n",
       "      <td>3</td>\n",
       "      <td>933.0</td>\n",
       "      <td>900</td>\n",
       "      <td>1024.0</td>\n",
       "      <td>1003</td>\n",
       "      <td>b'US'</td>\n",
       "      <td>227</td>\n",
       "      <td>...</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 28 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   year  month  day  dayofweek  dep_time  crs_dep_time  arr_time  \\\n",
       "0  1994      1    7          5     858.0           900     954.0   \n",
       "1  1994      1    8          6     859.0           900     952.0   \n",
       "2  1994      1   10          1     935.0           900    1023.0   \n",
       "3  1994      1   11          2     903.0           900    1131.0   \n",
       "4  1994      1   12          3     933.0           900    1024.0   \n",
       "\n",
       "   crs_arr_time carrier  flight_num         ...           distance  taxi_in  \\\n",
       "0          1003   b'US'         227         ...              290.0      NaN   \n",
       "1          1003   b'US'         227         ...              290.0      NaN   \n",
       "2          1003   b'US'         227         ...              290.0      NaN   \n",
       "3          1003   b'US'         227         ...              290.0      NaN   \n",
       "4          1003   b'US'         227         ...              290.0      NaN   \n",
       "\n",
       "   taxi_out  cancelled  cancellation_code  diverted carrier_delay  \\\n",
       "0       NaN          0                NaN         0           NaN   \n",
       "1       NaN          0                NaN         0           NaN   \n",
       "2       NaN          0                NaN         0           NaN   \n",
       "3       NaN          0                NaN         0           NaN   \n",
       "4       NaN          0                NaN         0           NaN   \n",
       "\n",
       "  weather_delay  nas_delay  late_aircraft_delay  \n",
       "0           NaN        NaN                  NaN  \n",
       "1           NaN        NaN                  NaN  \n",
       "2           NaN        NaN                  NaN  \n",
       "3           NaN        NaN                  NaN  \n",
       "4           NaN        NaN                  NaN  \n",
       "\n",
       "[5 rows x 28 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also we notice that some columns, if they were stored using dictionary encoding in the file, can be read directly to categorical pandas columns without any intermediate representation. For this particular dataset, some columns (such as month) were also stored using dictionary encoding, but clearly there have some numerical meaning - so we have chosen to expand out the dictionary to normal values. The use of categoricals and int32 columns makes this version of the dataframe slightly smaller in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 22651194 entries, 0 to 22651193\n",
      "Data columns (total 28 columns):\n",
      "year                   int32\n",
      "month                  int32\n",
      "day                    int32\n",
      "dayofweek              int32\n",
      "dep_time               float64\n",
      "crs_dep_time           int32\n",
      "arr_time               float64\n",
      "crs_arr_time           int32\n",
      "carrier                category\n",
      "flight_num             int32\n",
      "tail_num               float64\n",
      "actual_elapsed_time    float64\n",
      "crs_elapsed_time       float64\n",
      "airtime                float64\n",
      "arrdelay               float64\n",
      "depdelay               float64\n",
      "origin                 category\n",
      "dest                   category\n",
      "distance               float64\n",
      "taxi_in                float64\n",
      "taxi_out               float64\n",
      "cancelled              int32\n",
      "cancellation_code      category\n",
      "diverted               int32\n",
      "carrier_delay          float64\n",
      "weather_delay          float64\n",
      "nas_delay              float64\n",
      "late_aircraft_delay    float64\n",
      "dtypes: category(4), float64(15), int32(9)\n",
      "memory usage: 3.4 GB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start to see a more significant speed up when loading only a selection of the columns. This example is about four and a half times faster than the CSV version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 6.25 s, sys: 1.91 s, total: 8.15 s\n",
      "Wall time: 8.18 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['year', 'month', 'day', 'crs_dep_time',\n",
    " 'carrier', 'flight_num', 'arrdelay',\n",
    " 'origin', 'dest']\n",
    "%time df = pf.to_pandas(cols, usecats=['carrier', 'origin', 'dest'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And for a single column, in this case a categorical, a factor of over 10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.28 s, sys: 595 ms, total: 1.87 s\n",
      "Wall time: 1.88 s\n"
     ]
    }
   ],
   "source": [
    "cols = ['carrier']\n",
    "%time df = pf.to_pandas(cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusions\n",
    "\n",
    "The paruqet library is a work in progress. It can already handle multi-file and partitioned data, such as those produced by hive/spark, but does not yet allow filtering to avoid reading unecessary sections of the data. We also have preliminary support for writing parquet files with simple data types.\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}