Skip to content

Instantly share code, notes, and snippets.

@martindurant
Last active October 19, 2016 20:27
Show Gist options
  • Save martindurant/1a07c751c35938e687407f30e7631430 to your computer and use it in GitHub Desktop.
Save martindurant/1a07c751c35938e687407f30e7631430 to your computer and use it in GitHub Desktop.
First parquet notebook
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A new parquet reader for python\n",
"\n",{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A new parquet reader for python\n",
"\n",
"The parquet format has become ubiquitous in the hadoop/big-data ecosystem as a compact and performant binary data store. It is the default output format in the popular spark computation engine. Parquet provides a number of usefult features:\n",
"- column-wise storage, allowing selective loading and parsing of only the necessary parts of the dataset\n",
"- a choice of simple encoding methods per data chunk, such as run-length encoding and delta-encoding\n",
"- compression of data chunks with a choice of algorithm per column\n",
"- column statistics per chunk and partitioning of the data, allowing for exclusing some region of columns without loading."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import parquet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most common data format for big data is probably still CSV, because of its simplicity and the ability for humans to directly view the data. Here we have a large extract of the airlines dataset, flight departures and arrivals in the USA across several years."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 mdurant staff 2641451323 19 Oct 12:52 bigone.csv\r\n"
]
}
],
"source": [
"ls -l bigone.csv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
",year,month,day,dayofweek,dep_time,crs_dep_time,arr_time,crs_arr_time,carrier,flight_num,tail_num,actual_elapsed_time,crs_elapsed_time,airtime,arrdelay,depdelay,origin,dest,distance,taxi_in,taxi_out,cancelled,cancellation_code,diverted,carrier_delay,weather_delay,nas_delay,late_aircraft_delay\r\n",
"0,1994,1,7,5,858.0,900,954.0,1003,b'US',227,,56.0,63.0,,-9.0,-2.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"1,1994,1,8,6,859.0,900,952.0,1003,b'US',227,,53.0,63.0,,-11.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"2,1994,1,10,1,935.0,900,1023.0,1003,b'US',227,,48.0,63.0,,20.0,35.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"3,1994,1,11,2,903.0,900,1131.0,1003,b'US',227,,148.0,63.0,,88.0,3.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"4,1994,1,12,3,933.0,900,1024.0,1003,b'US',227,,51.0,63.0,,21.0,33.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"5,1994,1,13,4,,900,,1003,b'US',227,,,63.0,,,,b'CLT',b'ORF',290.0,,,1,,0,,,,\r\n",
"6,1994,1,14,5,903.0,900,1005.0,1003,b'US',227,,62.0,63.0,,2.0,3.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"7,1994,1,15,6,859.0,900,1004.0,1003,b'US',227,,65.0,63.0,,1.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"8,1994,1,17,1,859.0,900,955.0,1003,b'US',227,,56.0,63.0,,-8.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n"
]
}
],
"source": [
"!head bigone.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can load this easily with pandas. Note that there are many options to `read_csv()` that we do not investigate here."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"<string>:2: DtypeWarning: Columns (23) have mixed types. Specify dtype option on import or set low_memory=False.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1min 7s, sys: 7.94 s, total: 1min 15s\n",
"Wall time: 1min 16s\n"
]
}
],
"source": [
"cols = ['year', 'month', 'day', 'dayofweek', 'dep_time', 'crs_dep_time',\n",
" 'arr_time', 'crs_arr_time', 'carrier', 'flight_num', 'tail_num',\n",
" 'actual_elapsed_time', 'crs_elapsed_time', 'airtime', 'arrdelay',\n",
" 'depdelay', 'origin', 'dest', 'distance', 'taxi_in', 'taxi_out',\n",
" 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n",
" 'weather_delay', 'nas_delay', 'late_aircraft_delay']\n",
"%time df = pd.read_csv('bigone.csv', usecols=cols)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dayofweek</th>\n",
" <th>dep_time</th>\n",
" <th>crs_dep_time</th>\n",
" <th>arr_time</th>\n",
" <th>crs_arr_time</th>\n",
" <th>carrier</th>\n",
" <th>flight_num</th>\n",
" <th>...</th>\n",
" <th>distance</th>\n",
" <th>taxi_in</th>\n",
" <th>taxi_out</th>\n",
" <th>cancelled</th>\n",
" <th>cancellation_code</th>\n",
" <th>diverted</th>\n",
" <th>carrier_delay</th>\n",
" <th>weather_delay</th>\n",
" <th>nas_delay</th>\n",
" <th>late_aircraft_delay</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>5</td>\n",
" <td>858.0</td>\n",
" <td>900</td>\n",
" <td>954.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>6</td>\n",
" <td>859.0</td>\n",
" <td>900</td>\n",
" <td>952.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>935.0</td>\n",
" <td>900</td>\n",
" <td>1023.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>11</td>\n",
" <td>2</td>\n",
" <td>903.0</td>\n",
" <td>900</td>\n",
" <td>1131.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>12</td>\n",
" <td>3</td>\n",
" <td>933.0</td>\n",
" <td>900</td>\n",
" <td>1024.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" year month day dayofweek dep_time crs_dep_time arr_time \\\n",
"0 1994 1 7 5 858.0 900 954.0 \n",
"1 1994 1 8 6 859.0 900 952.0 \n",
"2 1994 1 10 1 935.0 900 1023.0 \n",
"3 1994 1 11 2 903.0 900 1131.0 \n",
"4 1994 1 12 3 933.0 900 1024.0 \n",
"\n",
" crs_arr_time carrier flight_num ... distance taxi_in \\\n",
"0 1003 b'US' 227 ... 290.0 NaN \n",
"1 1003 b'US' 227 ... 290.0 NaN \n",
"2 1003 b'US' 227 ... 290.0 NaN \n",
"3 1003 b'US' 227 ... 290.0 NaN \n",
"4 1003 b'US' 227 ... 290.0 NaN \n",
"\n",
" taxi_out cancelled cancellation_code diverted carrier_delay \\\n",
"0 NaN 0 NaN 0 NaN \n",
"1 NaN 0 NaN 0 NaN \n",
"2 NaN 0 NaN 0 NaN \n",
"3 NaN 0 NaN 0 NaN \n",
"4 NaN 0 NaN 0 NaN \n",
"\n",
" weather_delay nas_delay late_aircraft_delay \n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The size in memory is about double the original on-disc with the standard datatypes inferred by pandas."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 22651194 entries, 0 to 22651193\n",
"Data columns (total 28 columns):\n",
"year int64\n",
"month int64\n",
"day int64\n",
"dayofweek int64\n",
"dep_time float64\n",
"crs_dep_time int64\n",
"arr_time float64\n",
"crs_arr_time int64\n",
"carrier object\n",
"flight_num int64\n",
"tail_num float64\n",
"actual_elapsed_time float64\n",
"crs_elapsed_time float64\n",
"airtime float64\n",
"arrdelay float64\n",
"depdelay float64\n",
"origin object\n",
"dest object\n",
"distance float64\n",
"taxi_in float64\n",
"taxi_out float64\n",
"cancelled int64\n",
"cancellation_code object\n",
"diverted int64\n",
"carrier_delay float64\n",
"weather_delay float64\n",
"nas_delay float64\n",
"late_aircraft_delay float64\n",
"dtypes: float64(15), int64(9), object(4)\n",
"memory usage: 8.7 GB\n"
]
}
],
"source": [
"df.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas allows us to extract only thos columns which are required. This maked the output dataframe much smaller in memory, and saves on parsing time, because pandas does not consider the contents of many columns. (Note that we are not attempting to correct for OS-level caching of file data.)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 31.3 s, sys: 2.12 s, total: 33.4 s\n",
"Wall time: 33.4 s\n"
]
}
],
"source": [
"cols = ['year', 'month', 'day', 'crs_dep_time',\n",
" 'carrier', 'flight_num', 'arrdelay',\n",
" 'origin', 'dest']\n",
"%time df = pd.read_csv('bigone.csv', usecols=cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, pandas does still have to read every single line of the input file, so as we reduce the number of columns from 28 to one, we only get a speedup of about 4. Pandas runs in optimized C++ code, and is about as fast as reading CSVs could possibly be."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 18.3 s, sys: 838 ms, total: 19.1 s\n",
"Wall time: 19.1 s\n"
]
}
],
"source": [
"cols = ['carrier']\n",
"%time df = pd.read_csv('bigone.csv', usecols=cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parquet file\n",
"\n",
"We also have the same data in parquet format. The file size on disk is about one tenth of the CSV version, which will be very important when the data transfer is a significant part of the workflow (e.g., pulling the data across a network). "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 mdurant staff 265278855 9 Sep 2015 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq\r\n"
]
}
],
"source": [
"ls -l 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We read this data using the parquet library that we are developing for python. Notice that the metadata (such as number of rows and any partitons) is available immediately."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'categories': [],\n",
" 'columns': ['year',\n",
" 'month',\n",
" 'day',\n",
" 'dayofweek',\n",
" 'dep_time',\n",
" 'crs_dep_time',\n",
" 'arr_time',\n",
" 'crs_arr_time',\n",
" 'carrier',\n",
" 'flight_num',\n",
" 'tail_num',\n",
" 'actual_elapsed_time',\n",
" 'crs_elapsed_time',\n",
" 'airtime',\n",
" 'arrdelay',\n",
" 'depdelay',\n",
" 'origin',\n",
" 'dest',\n",
" 'distance',\n",
" 'taxi_in',\n",
" 'taxi_out',\n",
" 'cancelled',\n",
" 'cancellation_code',\n",
" 'diverted',\n",
" 'carrier_delay',\n",
" 'weather_delay',\n",
" 'nas_delay',\n",
" 'security_delay',\n",
" 'late_aircraft_delay'],\n",
" 'name': '4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq',\n",
" 'rows': 22651194}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pf = parquet.ParquetFile('4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq')\n",
"pf.info"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading this file is about a factor of three faster than the CSV version, above."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 18.7 s, sys: 5.35 s, total: 24.1 s\n",
"Wall time: 24.7 s\n"
]
}
],
"source": [
"cols = ['year', 'month', 'day', 'dayofweek', 'dep_time', 'crs_dep_time',\n",
" 'arr_time', 'crs_arr_time', 'carrier', 'flight_num', 'tail_num',\n",
" 'actual_elapsed_time', 'crs_elapsed_time', 'airtime', 'arrdelay',\n",
" 'depdelay', 'origin', 'dest', 'distance', 'taxi_in', 'taxi_out',\n",
" 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n",
" 'weather_delay', 'nas_delay', 'late_aircraft_delay']\n",
"%time df = pf.to_pandas(cols, usecats=['carrier', 'origin', 'dest', 'cancellation_code'])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual_elapsed_time</th>\n",
" <th>airtime</th>\n",
" <th>arr_time</th>\n",
" <th>arrdelay</th>\n",
" <th>cancellation_code</th>\n",
" <th>cancelled</th>\n",
" <th>carrier</th>\n",
" <th>carrier_delay</th>\n",
" <th>crs_arr_time</th>\n",
" <th>crs_dep_time</th>\n",
" <th>...</th>\n",
" <th>flight_num</th>\n",
" <th>late_aircraft_delay</th>\n",
" <th>month</th>\n",
" <th>nas_delay</th>\n",
" <th>origin</th>\n",
" <th>tail_num</th>\n",
" <th>taxi_in</th>\n",
" <th>taxi_out</th>\n",
" <th>weather_delay</th>\n",
" <th>year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>56.0</td>\n",
" <td>NaN</td>\n",
" <td>954.0</td>\n",
" <td>-9.0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>b'US'</td>\n",
" <td>NaN</td>\n",
" <td>1003</td>\n",
" <td>900</td>\n",
" <td>...</td>\n",
" <td>227</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>b'CLT'</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1994</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>53.0</td>\n",
" <td>NaN</td>\n",
" <td>952.0</td>\n",
" <td>-11.0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>b'US'</td>\n",
" <td>NaN</td>\n",
" <td>1003</td>\n",
" <td>900</td>\n",
" <td>...</td>\n",
" <td>227</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>b'CLT'</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1994</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>48.0</td>\n",
" <td>NaN</td>\n",
" <td>1023.0</td>\n",
" <td>20.0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>b'US'</td>\n",
" <td>NaN</td>\n",
" <td>1003</td>\n",
" <td>900</td>\n",
" <td>...</td>\n",
" <td>227</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>b'CLT'</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1994</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>148.0</td>\n",
" <td>NaN</td>\n",
" <td>1131.0</td>\n",
" <td>88.0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>b'US'</td>\n",
" <td>NaN</td>\n",
" <td>1003</td>\n",
" <td>900</td>\n",
" <td>...</td>\n",
" <td>227</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>b'CLT'</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1994</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>51.0</td>\n",
" <td>NaN</td>\n",
" <td>1024.0</td>\n",
" <td>21.0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>b'US'</td>\n",
" <td>NaN</td>\n",
" <td>1003</td>\n",
" <td>900</td>\n",
" <td>...</td>\n",
" <td>227</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>b'CLT'</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1994</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" actual_elapsed_time airtime arr_time arrdelay cancellation_code \\\n",
"0 56.0 NaN 954.0 -9.0 NaN \n",
"1 53.0 NaN 952.0 -11.0 NaN \n",
"2 48.0 NaN 1023.0 20.0 NaN \n",
"3 148.0 NaN 1131.0 88.0 NaN \n",
"4 51.0 NaN 1024.0 21.0 NaN \n",
"\n",
" cancelled carrier carrier_delay crs_arr_time crs_dep_time ... \\\n",
"0 0 b'US' NaN 1003 900 ... \n",
"1 0 b'US' NaN 1003 900 ... \n",
"2 0 b'US' NaN 1003 900 ... \n",
"3 0 b'US' NaN 1003 900 ... \n",
"4 0 b'US' NaN 1003 900 ... \n",
"\n",
" flight_num late_aircraft_delay month nas_delay origin tail_num \\\n",
"0 227 NaN 1 NaN b'CLT' NaN \n",
"1 227 NaN 1 NaN b'CLT' NaN \n",
"2 227 NaN 1 NaN b'CLT' NaN \n",
"3 227 NaN 1 NaN b'CLT' NaN \n",
"4 227 NaN 1 NaN b'CLT' NaN \n",
"\n",
" taxi_in taxi_out weather_delay year \n",
"0 NaN NaN NaN 1994 \n",
"1 NaN NaN NaN 1994 \n",
"2 NaN NaN NaN 1994 \n",
"3 NaN NaN NaN 1994 \n",
"4 NaN NaN NaN 1994 \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also we notice that some columns, if they were stored using dictionary encoding in the file, can be read directly to categorical pandas columns without any intermediate representation. For this particular dataset, some columns (such as month) were also stored using dictionary encoding, but clearly there have some numerical meaning - so we have chosen to expand out the dictionary to normal values. The use of categoricals and int32 columns makes this version of the dataframe slightly smaller in memory."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 22651194 entries, 0 to 22651193\n",
"Data columns (total 28 columns):\n",
"actual_elapsed_time float64\n",
"airtime float64\n",
"arr_time float64\n",
"arrdelay float64\n",
"cancellation_code category\n",
"cancelled int32\n",
"carrier category\n",
"carrier_delay float64\n",
"crs_arr_time int32\n",
"crs_dep_time int32\n",
"crs_elapsed_time float64\n",
"day int32\n",
"dayofweek int32\n",
"dep_time float64\n",
"depdelay float64\n",
"dest category\n",
"distance float64\n",
"diverted int32\n",
"flight_num int32\n",
"late_aircraft_delay float64\n",
"month int32\n",
"nas_delay float64\n",
"origin category\n",
"tail_num float64\n",
"taxi_in float64\n",
"taxi_out float64\n",
"weather_delay float64\n",
"year int32\n",
"dtypes: category(4), float64(15), int32(9)\n",
"memory usage: 3.4 GB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We start to see a more significant speed up when loading only a selection of the columns. This example is about six times faster than the CSV version."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4.45 s, sys: 1.07 s, total: 5.52 s\n",
"Wall time: 5.53 s\n"
]
}
],
"source": [
"cols = ['year', 'month', 'day', 'crs_dep_time',\n",
" 'carrier', 'flight_num', 'arrdelay',\n",
" 'origin', 'dest']\n",
"%time df = pf.to_pandas(cols, usecats=['carrier', 'origin', 'dest'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And for a single column, in this case a categorical, a factor of over 50."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 205 ms, sys: 171 ms, total: 376 ms\n",
"Wall time: 374 ms\n"
]
}
],
"source": [
"cols = ['carrier']\n",
"%time df = pf.to_pandas(cols, usecats=['carrier'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusions\n",
"\n",
"The parquet library is a work in progress. It can already handle multi-file and partitioned data, such as those produced by hive/spark, but does not yet allow filtering to avoid reading unnecessary sections of the data. We also have preliminary support for writing parquet files with simple data types.\n",
"\n",
"The plan is to be able to load separate row groups and files within a single dataset in parallel using dask, to enable maximal use of data bandwidth and processing power."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda root]",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
"The parquet format has become ubiquitous in the hadoop/big-data ecosystem as a compact and performant binary data store. It is the default output format in the popular spark computation engine. Parquet provides a number of usefult features:\n",
"- column-wise storage, allowing selective loading and parsing of only the necessary parts of the dataset\n",
"- a choice of simple encoding methods per data chunk, such as run-length encoding and delta-encoding\n",
"- compression of data chunks with a choice of algorithm per column\n",
"- column statistics per chunk and partitioning of the data, allowing for exclusing some region of columns without loading."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import parquet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most common data format for big data is probably still CSV, because of its simplicity and the ability for humans to directly view the data. Here we have a large extract of the airlines dataset, flight departures and arrivals in the USA across several years."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 mdurant staff 2641451323 19 Oct 12:52 bigone.csv\r\n"
]
}
],
"source": [
"ls -l bigone.csv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
",year,month,day,dayofweek,dep_time,crs_dep_time,arr_time,crs_arr_time,carrier,flight_num,tail_num,actual_elapsed_time,crs_elapsed_time,airtime,arrdelay,depdelay,origin,dest,distance,taxi_in,taxi_out,cancelled,cancellation_code,diverted,carrier_delay,weather_delay,nas_delay,late_aircraft_delay\r\n",
"0,1994,1,7,5,858.0,900,954.0,1003,b'US',227,,56.0,63.0,,-9.0,-2.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"1,1994,1,8,6,859.0,900,952.0,1003,b'US',227,,53.0,63.0,,-11.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"2,1994,1,10,1,935.0,900,1023.0,1003,b'US',227,,48.0,63.0,,20.0,35.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"3,1994,1,11,2,903.0,900,1131.0,1003,b'US',227,,148.0,63.0,,88.0,3.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"4,1994,1,12,3,933.0,900,1024.0,1003,b'US',227,,51.0,63.0,,21.0,33.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"5,1994,1,13,4,,900,,1003,b'US',227,,,63.0,,,,b'CLT',b'ORF',290.0,,,1,,0,,,,\r\n",
"6,1994,1,14,5,903.0,900,1005.0,1003,b'US',227,,62.0,63.0,,2.0,3.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"7,1994,1,15,6,859.0,900,1004.0,1003,b'US',227,,65.0,63.0,,1.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n",
"8,1994,1,17,1,859.0,900,955.0,1003,b'US',227,,56.0,63.0,,-8.0,-1.0,b'CLT',b'ORF',290.0,,,0,,0,,,,\r\n"
]
}
],
"source": [
"!head bigone.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can load this easily with pandas. Note that there are many options to `read_csv()` that we do not investigate here."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"<string>:2: DtypeWarning: Columns (23) have mixed types. Specify dtype option on import or set low_memory=False.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1min 3s, sys: 6.94 s, total: 1min 10s\n",
"Wall time: 1min 10s\n"
]
}
],
"source": [
"cols = ['year', 'month', 'day', 'dayofweek', 'dep_time', 'crs_dep_time',\n",
" 'arr_time', 'crs_arr_time', 'carrier', 'flight_num', 'tail_num',\n",
" 'actual_elapsed_time', 'crs_elapsed_time', 'airtime', 'arrdelay',\n",
" 'depdelay', 'origin', 'dest', 'distance', 'taxi_in', 'taxi_out',\n",
" 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n",
" 'weather_delay', 'nas_delay', 'late_aircraft_delay']\n",
"%time df = pd.read_csv('bigone.csv', usecols=cols)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dayofweek</th>\n",
" <th>dep_time</th>\n",
" <th>crs_dep_time</th>\n",
" <th>arr_time</th>\n",
" <th>crs_arr_time</th>\n",
" <th>carrier</th>\n",
" <th>flight_num</th>\n",
" <th>...</th>\n",
" <th>distance</th>\n",
" <th>taxi_in</th>\n",
" <th>taxi_out</th>\n",
" <th>cancelled</th>\n",
" <th>cancellation_code</th>\n",
" <th>diverted</th>\n",
" <th>carrier_delay</th>\n",
" <th>weather_delay</th>\n",
" <th>nas_delay</th>\n",
" <th>late_aircraft_delay</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>5</td>\n",
" <td>858.0</td>\n",
" <td>900</td>\n",
" <td>954.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>6</td>\n",
" <td>859.0</td>\n",
" <td>900</td>\n",
" <td>952.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>935.0</td>\n",
" <td>900</td>\n",
" <td>1023.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>11</td>\n",
" <td>2</td>\n",
" <td>903.0</td>\n",
" <td>900</td>\n",
" <td>1131.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>12</td>\n",
" <td>3</td>\n",
" <td>933.0</td>\n",
" <td>900</td>\n",
" <td>1024.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" year month day dayofweek dep_time crs_dep_time arr_time \\\n",
"0 1994 1 7 5 858.0 900 954.0 \n",
"1 1994 1 8 6 859.0 900 952.0 \n",
"2 1994 1 10 1 935.0 900 1023.0 \n",
"3 1994 1 11 2 903.0 900 1131.0 \n",
"4 1994 1 12 3 933.0 900 1024.0 \n",
"\n",
" crs_arr_time carrier flight_num ... distance taxi_in \\\n",
"0 1003 b'US' 227 ... 290.0 NaN \n",
"1 1003 b'US' 227 ... 290.0 NaN \n",
"2 1003 b'US' 227 ... 290.0 NaN \n",
"3 1003 b'US' 227 ... 290.0 NaN \n",
"4 1003 b'US' 227 ... 290.0 NaN \n",
"\n",
" taxi_out cancelled cancellation_code diverted carrier_delay \\\n",
"0 NaN 0 NaN 0 NaN \n",
"1 NaN 0 NaN 0 NaN \n",
"2 NaN 0 NaN 0 NaN \n",
"3 NaN 0 NaN 0 NaN \n",
"4 NaN 0 NaN 0 NaN \n",
"\n",
" weather_delay nas_delay late_aircraft_delay \n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The size in memory is about double the original on-disc with the standard datatypes inferred by pandas."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 22651194 entries, 0 to 22651193\n",
"Data columns (total 28 columns):\n",
"year int64\n",
"month int64\n",
"day int64\n",
"dayofweek int64\n",
"dep_time float64\n",
"crs_dep_time int64\n",
"arr_time float64\n",
"crs_arr_time int64\n",
"carrier object\n",
"flight_num int64\n",
"tail_num float64\n",
"actual_elapsed_time float64\n",
"crs_elapsed_time float64\n",
"airtime float64\n",
"arrdelay float64\n",
"depdelay float64\n",
"origin object\n",
"dest object\n",
"distance float64\n",
"taxi_in float64\n",
"taxi_out float64\n",
"cancelled int64\n",
"cancellation_code object\n",
"diverted int64\n",
"carrier_delay float64\n",
"weather_delay float64\n",
"nas_delay float64\n",
"late_aircraft_delay float64\n",
"dtypes: float64(15), int64(9), object(4)\n",
"memory usage: 8.7 GB\n"
]
}
],
"source": [
"df.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas allows us to extract only thos columns which are required. This maked the output dataframe much smaller in memory, and saves on parsing time, because pandas does not consider the contents of many columns. (Note that we are not attempting to correct for OS-level caching of file data.)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 31.3 s, sys: 2.22 s, total: 33.5 s\n",
"Wall time: 33.6 s\n"
]
}
],
"source": [
"cols = ['year', 'month', 'day', 'crs_dep_time',\n",
" 'carrier', 'flight_num', 'arrdelay',\n",
" 'origin', 'dest']\n",
"%time df = pd.read_csv('bigone.csv', usecols=cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, pandas does still have to read every single line of the input file, so as we reduce the number of columns from 28 to one, we only get a speedup of about 4. Pandas runs in optimized C++ code, and is about as fast as reading CSVs could possibly be."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 19.1 s, sys: 875 ms, total: 19.9 s\n",
"Wall time: 20 s\n"
]
}
],
"source": [
"cols = ['carrier']\n",
"%time df = pd.read_csv('bigone.csv', usecols=cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parquet file\n",
"\n",
"We also have the same data in parquet format. The file size on disk is about one tenth of the CSV version, which will be very important when the data transfer is a significant part of the workflow (e.g., pulling the data across a network). "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 mdurant staff 265278855 9 Sep 2015 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq\r\n"
]
}
],
"source": [
"ls -l 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We read this data using the parquet library that we are developing for python. Notice that the metadata (such as number of rows and any partitons) is available immediately."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'categories': [],\n",
" 'columns': ['year',\n",
" 'month',\n",
" 'day',\n",
" 'dayofweek',\n",
" 'dep_time',\n",
" 'crs_dep_time',\n",
" 'arr_time',\n",
" 'crs_arr_time',\n",
" 'carrier',\n",
" 'flight_num',\n",
" 'tail_num',\n",
" 'actual_elapsed_time',\n",
" 'crs_elapsed_time',\n",
" 'airtime',\n",
" 'arrdelay',\n",
" 'depdelay',\n",
" 'origin',\n",
" 'dest',\n",
" 'distance',\n",
" 'taxi_in',\n",
" 'taxi_out',\n",
" 'cancelled',\n",
" 'cancellation_code',\n",
" 'diverted',\n",
" 'carrier_delay',\n",
" 'weather_delay',\n",
" 'nas_delay',\n",
" 'security_delay',\n",
" 'late_aircraft_delay'],\n",
" 'name': '4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq',\n",
" 'rows': 22651194}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pf = parquet.ParquetFile('4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq')\n",
"pf.info"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading this file is almost a factor of three faster than the CSV version, above."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 22.7 s, sys: 10.1 s, total: 32.8 s\n",
"Wall time: 33.5 s\n"
]
}
],
"source": [
"cols = ['year', 'month', 'day', 'dayofweek', 'dep_time', 'crs_dep_time',\n",
" 'arr_time', 'crs_arr_time', 'carrier', 'flight_num', 'tail_num',\n",
" 'actual_elapsed_time', 'crs_elapsed_time', 'airtime', 'arrdelay',\n",
" 'depdelay', 'origin', 'dest', 'distance', 'taxi_in', 'taxi_out',\n",
" 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n",
" 'weather_delay', 'nas_delay', 'late_aircraft_delay']\n",
"%time df = pf.to_pandas(cols, usecats=['carrier', 'origin', 'dest', 'cancellation_code'])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dayofweek</th>\n",
" <th>dep_time</th>\n",
" <th>crs_dep_time</th>\n",
" <th>arr_time</th>\n",
" <th>crs_arr_time</th>\n",
" <th>carrier</th>\n",
" <th>flight_num</th>\n",
" <th>...</th>\n",
" <th>distance</th>\n",
" <th>taxi_in</th>\n",
" <th>taxi_out</th>\n",
" <th>cancelled</th>\n",
" <th>cancellation_code</th>\n",
" <th>diverted</th>\n",
" <th>carrier_delay</th>\n",
" <th>weather_delay</th>\n",
" <th>nas_delay</th>\n",
" <th>late_aircraft_delay</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>5</td>\n",
" <td>858.0</td>\n",
" <td>900</td>\n",
" <td>954.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>6</td>\n",
" <td>859.0</td>\n",
" <td>900</td>\n",
" <td>952.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>935.0</td>\n",
" <td>900</td>\n",
" <td>1023.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>11</td>\n",
" <td>2</td>\n",
" <td>903.0</td>\n",
" <td>900</td>\n",
" <td>1131.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1994</td>\n",
" <td>1</td>\n",
" <td>12</td>\n",
" <td>3</td>\n",
" <td>933.0</td>\n",
" <td>900</td>\n",
" <td>1024.0</td>\n",
" <td>1003</td>\n",
" <td>b'US'</td>\n",
" <td>227</td>\n",
" <td>...</td>\n",
" <td>290.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" year month day dayofweek dep_time crs_dep_time arr_time \\\n",
"0 1994 1 7 5 858.0 900 954.0 \n",
"1 1994 1 8 6 859.0 900 952.0 \n",
"2 1994 1 10 1 935.0 900 1023.0 \n",
"3 1994 1 11 2 903.0 900 1131.0 \n",
"4 1994 1 12 3 933.0 900 1024.0 \n",
"\n",
" crs_arr_time carrier flight_num ... distance taxi_in \\\n",
"0 1003 b'US' 227 ... 290.0 NaN \n",
"1 1003 b'US' 227 ... 290.0 NaN \n",
"2 1003 b'US' 227 ... 290.0 NaN \n",
"3 1003 b'US' 227 ... 290.0 NaN \n",
"4 1003 b'US' 227 ... 290.0 NaN \n",
"\n",
" taxi_out cancelled cancellation_code diverted carrier_delay \\\n",
"0 NaN 0 NaN 0 NaN \n",
"1 NaN 0 NaN 0 NaN \n",
"2 NaN 0 NaN 0 NaN \n",
"3 NaN 0 NaN 0 NaN \n",
"4 NaN 0 NaN 0 NaN \n",
"\n",
" weather_delay nas_delay late_aircraft_delay \n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also we notice that some columns, if they were stored using dictionary encoding in the file, can be read directly to categorical pandas columns without any intermediate representation. For this particular dataset, some columns (such as month) were also stored using dictionary encoding, but clearly there have some numerical meaning - so we have chosen to expand out the dictionary to normal values. The use of categoricals and int32 columns makes this version of the dataframe slightly smaller in memory."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 22651194 entries, 0 to 22651193\n",
"Data columns (total 28 columns):\n",
"year int32\n",
"month int32\n",
"day int32\n",
"dayofweek int32\n",
"dep_time float64\n",
"crs_dep_time int32\n",
"arr_time float64\n",
"crs_arr_time int32\n",
"carrier category\n",
"flight_num int32\n",
"tail_num float64\n",
"actual_elapsed_time float64\n",
"crs_elapsed_time float64\n",
"airtime float64\n",
"arrdelay float64\n",
"depdelay float64\n",
"origin category\n",
"dest category\n",
"distance float64\n",
"taxi_in float64\n",
"taxi_out float64\n",
"cancelled int32\n",
"cancellation_code category\n",
"diverted int32\n",
"carrier_delay float64\n",
"weather_delay float64\n",
"nas_delay float64\n",
"late_aircraft_delay float64\n",
"dtypes: category(4), float64(15), int32(9)\n",
"memory usage: 3.4 GB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We start to see a more significant speed up when loading only a selection of the columns. This example is about four and a half times faster than the CSV version."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 6.25 s, sys: 1.91 s, total: 8.15 s\n",
"Wall time: 8.18 s\n"
]
}
],
"source": [
"cols = ['year', 'month', 'day', 'crs_dep_time',\n",
" 'carrier', 'flight_num', 'arrdelay',\n",
" 'origin', 'dest']\n",
"%time df = pf.to_pandas(cols, usecats=['carrier', 'origin', 'dest'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And for a single column, in this case a categorical, a factor of over 10."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.28 s, sys: 595 ms, total: 1.87 s\n",
"Wall time: 1.88 s\n"
]
}
],
"source": [
"cols = ['carrier']\n",
"%time df = pf.to_pandas(cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusions\n",
"\n",
"The paruqet library is a work in progress. It can already handle multi-file and partitioned data, such as those produced by hive/spark, but does not yet allow filtering to avoid reading unecessary sections of the data. We also have preliminary support for writing parquet files with simple data types.\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda root]",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment