Skip to content

Instantly share code, notes, and snippets.

@wimsy
Created August 27, 2013 10:21
Show Gist options
  • Save wimsy/6351893 to your computer and use it in GitHub Desktop.
Save wimsy/6351893 to your computer and use it in GitHub Desktop.
iPython notebook to prep OpenPaths data for export to CartoDB with some analysis.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "OpenPaths Map Data Prep"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# OpenPaths Map Data Prep\n",
"\n",
"*Prepared for [Maps and the Geospatial Revolution](https://www.coursera.org/course/maps/).*\n",
"\n",
"## Load the OpenPaths data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"\n",
"df = pd.read_json('openpaths.json')\n",
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alt</th>\n",
" <th>device</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>os</th>\n",
" <th>t</th>\n",
" <th>version</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 29.817305</td>\n",
" <td> iPhone5,1</td>\n",
" <td> 40.720699</td>\n",
" <td>-73.997940</td>\n",
" <td> 6.1.4</td>\n",
" <td> 1361142400</td>\n",
" <td> 1.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 84.690826</td>\n",
" <td> iPhone5,1</td>\n",
" <td> 42.782135</td>\n",
" <td>-73.759109</td>\n",
" <td> 6.1.4</td>\n",
" <td> 1371421952</td>\n",
" <td> 1.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 47.967743</td>\n",
" <td> iPhone5,1</td>\n",
" <td> 40.773243</td>\n",
" <td>-73.679054</td>\n",
" <td> 6.0.1</td>\n",
" <td> 1359071360</td>\n",
" <td> 1.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 28.381084</td>\n",
" <td> iPhone5,1</td>\n",
" <td> 40.766212</td>\n",
" <td>-73.266411</td>\n",
" <td> 6.1.4</td>\n",
" <td> 1372082048</td>\n",
" <td> 1.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 27.813568</td>\n",
" <td> iPhone5,1</td>\n",
" <td> 40.760025</td>\n",
" <td>-73.921509</td>\n",
" <td> 6.1.4</td>\n",
" <td> 1360717824</td>\n",
" <td> 1.1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 88,
"text": [
" alt device lat lon os t version\n",
"0 29.817305 iPhone5,1 40.720699 -73.997940 6.1.4 1361142400 1.1\n",
"1 84.690826 iPhone5,1 42.782135 -73.759109 6.1.4 1371421952 1.1\n",
"2 47.967743 iPhone5,1 40.773243 -73.679054 6.0.1 1359071360 1.1\n",
"3 28.381084 iPhone5,1 40.766212 -73.266411 6.1.4 1372082048 1.1\n",
"4 27.813568 iPhone5,1 40.760025 -73.921509 6.1.4 1360717824 1.1"
]
}
],
"prompt_number": 88
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Drop data we don't need\n",
"\n",
"We don't need OS & phone info. Also, I'm keeping only data from the [OpenPaths](http://openpaths.cc/) iPhone app, and dropping some data points from [Foursquare](http://foursquare.com/)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = df[df['device'] == 'iPhone5,1']\n",
"del df['device']\n",
"del df['os']\n",
"del df['version']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 89
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<pre>\n",
"&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
"Int64Index: 7376 entries, 0 to 7397\n",
"Data columns (total 4 columns):\n",
"alt 7376 non-null values\n",
"lat 7376 non-null values\n",
"lon 7376 non-null values\n",
"t 7376 non-null values\n",
"dtypes: float64(3), int64(1)\n",
"</pre>"
],
"output_type": "pyout",
"prompt_number": 90,
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 7376 entries, 0 to 7397\n",
"Data columns (total 4 columns):\n",
"alt 7376 non-null values\n",
"lat 7376 non-null values\n",
"lon 7376 non-null values\n",
"t 7376 non-null values\n",
"dtypes: float64(3), int64(1)"
]
}
],
"prompt_number": 90
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Convert the Epoch timestamps to something useful\n",
"\n",
"The OpenPaths data comes with UNIX Epoch timestamps in the UTC timezone. Since we want to do some math based on time of day, etc., I'd like to make these readable."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"type(df['t'])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 91,
"text": [
"pandas.core.series.Series"
]
}
],
"prompt_number": 91
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['ts'] = pd.to_datetime(df['t'],utc=True,unit='s')\n",
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alt</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>t</th>\n",
" <th>ts</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 29.817305</td>\n",
" <td> 40.720699</td>\n",
" <td>-73.997940</td>\n",
" <td> 1361142400</td>\n",
" <td>2013-02-17 23:06:40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 84.690826</td>\n",
" <td> 42.782135</td>\n",
" <td>-73.759109</td>\n",
" <td> 1371421952</td>\n",
" <td>2013-06-16 22:32:32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 47.967743</td>\n",
" <td> 40.773243</td>\n",
" <td>-73.679054</td>\n",
" <td> 1359071360</td>\n",
" <td>2013-01-24 23:49:20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 28.381084</td>\n",
" <td> 40.766212</td>\n",
" <td>-73.266411</td>\n",
" <td> 1372082048</td>\n",
" <td>2013-06-24 13:54:08</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 27.813568</td>\n",
" <td> 40.760025</td>\n",
" <td>-73.921509</td>\n",
" <td> 1360717824</td>\n",
" <td>2013-02-13 01:10:24</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 92,
"text": [
" alt lat lon t ts\n",
"0 29.817305 40.720699 -73.997940 1361142400 2013-02-17 23:06:40\n",
"1 84.690826 42.782135 -73.759109 1371421952 2013-06-16 22:32:32\n",
"2 47.967743 40.773243 -73.679054 1359071360 2013-01-24 23:49:20\n",
"3 28.381084 40.766212 -73.266411 1372082048 2013-06-24 13:54:08\n",
"4 27.813568 40.760025 -73.921509 1360717824 2013-02-13 01:10:24"
]
}
],
"prompt_number": 92
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sort by timestamp and calculate dwell time for each point\n",
"\n",
"I'd like to represent the length of time at each location to get a sense for the proportion of time spent in various locales. To do that, I can use the Epoch timestamps, sorted in order, and calculate the seconds between each point - assuming the entire time was spent at the previous point. This should be roughly valid since the app triggered a new location whenever the phone moved outside a certain area."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = df.sort(['t'])\n",
"df['dwell'] = df['t'].diff().shift(-1)/3600.\n",
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alt</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>t</th>\n",
" <th>ts</th>\n",
" <th>dwell</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4739</th>\n",
" <td> 26.049835</td>\n",
" <td> 40.762707</td>\n",
" <td>-73.923622</td>\n",
" <td> 1358979584</td>\n",
" <td>2013-01-23 22:19:44</td>\n",
" <td> 0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5160</th>\n",
" <td> 26.049835</td>\n",
" <td> 40.762707</td>\n",
" <td>-73.923622</td>\n",
" <td> 1358979584</td>\n",
" <td>2013-01-23 22:19:44</td>\n",
" <td> 0.035556</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1738</th>\n",
" <td> 21.747589</td>\n",
" <td> 40.762657</td>\n",
" <td>-73.923492</td>\n",
" <td> 1358979712</td>\n",
" <td>2013-01-23 22:21:52</td>\n",
" <td> 0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>877 </th>\n",
" <td> 21.747589</td>\n",
" <td> 40.762657</td>\n",
" <td>-73.923492</td>\n",
" <td> 1358979712</td>\n",
" <td>2013-01-23 22:21:52</td>\n",
" <td> 0.924444</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1522</th>\n",
" <td> 26.512102</td>\n",
" <td> 40.762688</td>\n",
" <td>-73.923630</td>\n",
" <td> 1358983040</td>\n",
" <td>2013-01-23 23:17:20</td>\n",
" <td> 0.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 93,
"text": [
" alt lat lon t ts dwell\n",
"4739 26.049835 40.762707 -73.923622 1358979584 2013-01-23 22:19:44 0.000000\n",
"5160 26.049835 40.762707 -73.923622 1358979584 2013-01-23 22:19:44 0.035556\n",
"1738 21.747589 40.762657 -73.923492 1358979712 2013-01-23 22:21:52 0.000000\n",
"877 21.747589 40.762657 -73.923492 1358979712 2013-01-23 22:21:52 0.924444\n",
"1522 26.512102 40.762688 -73.923630 1358983040 2013-01-23 23:17:20 0.000000"
]
}
],
"prompt_number": 93
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, apparently some duplicates creeped into my data. So, I'll drop them by their timestamp."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = df.drop_duplicates(['t'])\n",
"df['dwell'] = df['t'].diff().shift(-1)/3600.\n",
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alt</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>t</th>\n",
" <th>ts</th>\n",
" <th>dwell</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4739</th>\n",
" <td> 26.049835</td>\n",
" <td> 40.762707</td>\n",
" <td>-73.923622</td>\n",
" <td> 1358979584</td>\n",
" <td>2013-01-23 22:19:44</td>\n",
" <td> 0.035556</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1738</th>\n",
" <td> 21.747589</td>\n",
" <td> 40.762657</td>\n",
" <td>-73.923492</td>\n",
" <td> 1358979712</td>\n",
" <td>2013-01-23 22:21:52</td>\n",
" <td> 0.924444</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1522</th>\n",
" <td> 26.512102</td>\n",
" <td> 40.762688</td>\n",
" <td>-73.923630</td>\n",
" <td> 1358983040</td>\n",
" <td>2013-01-23 23:17:20</td>\n",
" <td> 13.866667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>821 </th>\n",
" <td> 16.265961</td>\n",
" <td> 40.767021</td>\n",
" <td>-73.921288</td>\n",
" <td> 1359032960</td>\n",
" <td>2013-01-24 13:09:20</td>\n",
" <td> 0.106667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4791</th>\n",
" <td> -0.725677</td>\n",
" <td> 40.769920</td>\n",
" <td>-73.880081</td>\n",
" <td> 1359033344</td>\n",
" <td>2013-01-24 13:15:44</td>\n",
" <td> 0.071111</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 94,
"text": [
" alt lat lon t ts dwell\n",
"4739 26.049835 40.762707 -73.923622 1358979584 2013-01-23 22:19:44 0.035556\n",
"1738 21.747589 40.762657 -73.923492 1358979712 2013-01-23 22:21:52 0.924444\n",
"1522 26.512102 40.762688 -73.923630 1358983040 2013-01-23 23:17:20 13.866667\n",
"821 16.265961 40.767021 -73.921288 1359032960 2013-01-24 13:09:20 0.106667\n",
"4791 -0.725677 40.769920 -73.880081 1359033344 2013-01-24 13:15:44 0.071111"
]
}
],
"prompt_number": 94
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculate weekend or weekday for each timestamp\n",
"\n",
"One way I'd like to classify points is by weekend or weekday. I'll make the simple assumption that the beginning of the period is sufficient. Also, I'll assume that weekends run from 1800 Friday until 0500 Monday, local time (mostly `US/Eastern` but some in `US/Pacific`). Technically, some times are in Arizone time, but that was the same as Pacific during the relevant period."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['tz'] = 'US/Eastern'\n",
"df.ix[ df['lon'] < -90.0, 'tz'] = 'US/Pacific'\n",
"df[df['lon'] < -90.0].head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alt</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>t</th>\n",
" <th>ts</th>\n",
" <th>dwell</th>\n",
" <th>tz</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>191 </th>\n",
" <td> 0.000000</td>\n",
" <td> 33.434605</td>\n",
" <td>-112.012856</td>\n",
" <td> 1367643520</td>\n",
" <td>2013-05-04 04:58:40</td>\n",
" <td> 0.568889</td>\n",
" <td> US/Pacific</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4009</th>\n",
" <td> 338.125336</td>\n",
" <td> 33.439400</td>\n",
" <td>-112.036972</td>\n",
" <td> 1367645568</td>\n",
" <td>2013-05-04 05:32:48</td>\n",
" <td> 0.106667</td>\n",
" <td> US/Pacific</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2694</th>\n",
" <td> 351.256317</td>\n",
" <td> 33.506985</td>\n",
" <td>-112.044670</td>\n",
" <td> 1367645952</td>\n",
" <td>2013-05-04 05:39:12</td>\n",
" <td> 0.071111</td>\n",
" <td> US/Pacific</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2204</th>\n",
" <td> 438.804352</td>\n",
" <td> 33.584988</td>\n",
" <td>-112.009651</td>\n",
" <td> 1367646208</td>\n",
" <td>2013-05-04 05:43:28</td>\n",
" <td> 0.071111</td>\n",
" <td> US/Pacific</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4903</th>\n",
" <td> 464.877533</td>\n",
" <td> 33.663094</td>\n",
" <td>-111.997353</td>\n",
" <td> 1367646464</td>\n",
" <td>2013-05-04 05:47:44</td>\n",
" <td> 0.106667</td>\n",
" <td> US/Pacific</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 95,
"text": [
" alt lat lon t ts \\\n",
"191 0.000000 33.434605 -112.012856 1367643520 2013-05-04 04:58:40 \n",
"4009 338.125336 33.439400 -112.036972 1367645568 2013-05-04 05:32:48 \n",
"2694 351.256317 33.506985 -112.044670 1367645952 2013-05-04 05:39:12 \n",
"2204 438.804352 33.584988 -112.009651 1367646208 2013-05-04 05:43:28 \n",
"4903 464.877533 33.663094 -111.997353 1367646464 2013-05-04 05:47:44 \n",
"\n",
" dwell tz \n",
"191 0.568889 US/Pacific \n",
"4009 0.106667 US/Pacific \n",
"2694 0.071111 US/Pacific \n",
"2204 0.071111 US/Pacific \n",
"4903 0.106667 US/Pacific "
]
}
],
"prompt_number": 95
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alt</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>t</th>\n",
" <th>ts</th>\n",
" <th>dwell</th>\n",
" <th>tz</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4739</th>\n",
" <td> 26.049835</td>\n",
" <td> 40.762707</td>\n",
" <td>-73.923622</td>\n",
" <td> 1358979584</td>\n",
" <td>2013-01-23 22:19:44</td>\n",
" <td> 0.035556</td>\n",
" <td> US/Eastern</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1738</th>\n",
" <td> 21.747589</td>\n",
" <td> 40.762657</td>\n",
" <td>-73.923492</td>\n",
" <td> 1358979712</td>\n",
" <td>2013-01-23 22:21:52</td>\n",
" <td> 0.924444</td>\n",
" <td> US/Eastern</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1522</th>\n",
" <td> 26.512102</td>\n",
" <td> 40.762688</td>\n",
" <td>-73.923630</td>\n",
" <td> 1358983040</td>\n",
" <td>2013-01-23 23:17:20</td>\n",
" <td> 13.866667</td>\n",
" <td> US/Eastern</td>\n",
" </tr>\n",
" <tr>\n",
" <th>821 </th>\n",
" <td> 16.265961</td>\n",
" <td> 40.767021</td>\n",
" <td>-73.921288</td>\n",
" <td> 1359032960</td>\n",
" <td>2013-01-24 13:09:20</td>\n",
" <td> 0.106667</td>\n",
" <td> US/Eastern</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4791</th>\n",
" <td> -0.725677</td>\n",
" <td> 40.769920</td>\n",
" <td>-73.880081</td>\n",
" <td> 1359033344</td>\n",
" <td>2013-01-24 13:15:44</td>\n",
" <td> 0.071111</td>\n",
" <td> US/Eastern</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 96,
"text": [
" alt lat lon t ts \\\n",
"4739 26.049835 40.762707 -73.923622 1358979584 2013-01-23 22:19:44 \n",
"1738 21.747589 40.762657 -73.923492 1358979712 2013-01-23 22:21:52 \n",
"1522 26.512102 40.762688 -73.923630 1358983040 2013-01-23 23:17:20 \n",
"821 16.265961 40.767021 -73.921288 1359032960 2013-01-24 13:09:20 \n",
"4791 -0.725677 40.769920 -73.880081 1359033344 2013-01-24 13:15:44 \n",
"\n",
" dwell tz \n",
"4739 0.035556 US/Eastern \n",
"1738 0.924444 US/Eastern \n",
"1522 13.866667 US/Eastern \n",
"821 0.106667 US/Eastern \n",
"4791 0.071111 US/Eastern "
]
}
],
"prompt_number": 96
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, I have to take a roundabout way to get there, because I'm not figuring out pandas' timezone support."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def weekday_or_weekend(dt):\n",
" day_type = 'weekday'\n",
" if dt.weekday() in [5,6]:\n",
" day_type = 'weekend'\n",
" elif dt.weekday() == 4 and dt.hour > 17:\n",
" day_type = 'weekend'\n",
" elif dt.weekday() == 0 and dt.hour < 5:\n",
" day_type = 'weekend'\n",
" return day_type"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 97
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df_pac = df[ df['tz'] == 'US/Pacific' ]\n",
"df_east = df[ df['tz'] == 'US/Eastern' ]\n",
"\n",
"df_pac['dt'] = df_pac['ts'].map(lambda x: x.tz_localize('UTC').tz_convert('US/Pacific'))\n",
"df_pac['day_type'] = df_pac['dt'].map(weekday_or_weekend)\n",
"\n",
"df_east['dt'] = df_east['ts'].map(lambda x: x.tz_localize('UTC').tz_convert('US/Eastern'))\n",
"df_east['day_type'] = df_east['dt'].map(weekday_or_weekend)\n",
"\n",
"df_east.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alt</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>t</th>\n",
" <th>ts</th>\n",
" <th>dwell</th>\n",
" <th>tz</th>\n",
" <th>dt</th>\n",
" <th>day_type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4739</th>\n",
" <td> 26.049835</td>\n",
" <td> 40.762707</td>\n",
" <td>-73.923622</td>\n",
" <td> 1358979584</td>\n",
" <td>2013-01-23 22:19:44</td>\n",
" <td> 0.035556</td>\n",
" <td> US/Eastern</td>\n",
" <td> 2013-01-23 17:19:44-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1738</th>\n",
" <td> 21.747589</td>\n",
" <td> 40.762657</td>\n",
" <td>-73.923492</td>\n",
" <td> 1358979712</td>\n",
" <td>2013-01-23 22:21:52</td>\n",
" <td> 0.924444</td>\n",
" <td> US/Eastern</td>\n",
" <td> 2013-01-23 17:21:52-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1522</th>\n",
" <td> 26.512102</td>\n",
" <td> 40.762688</td>\n",
" <td>-73.923630</td>\n",
" <td> 1358983040</td>\n",
" <td>2013-01-23 23:17:20</td>\n",
" <td> 13.866667</td>\n",
" <td> US/Eastern</td>\n",
" <td> 2013-01-23 18:17:20-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" <tr>\n",
" <th>821 </th>\n",
" <td> 16.265961</td>\n",
" <td> 40.767021</td>\n",
" <td>-73.921288</td>\n",
" <td> 1359032960</td>\n",
" <td>2013-01-24 13:09:20</td>\n",
" <td> 0.106667</td>\n",
" <td> US/Eastern</td>\n",
" <td> 2013-01-24 08:09:20-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4791</th>\n",
" <td> -0.725677</td>\n",
" <td> 40.769920</td>\n",
" <td>-73.880081</td>\n",
" <td> 1359033344</td>\n",
" <td>2013-01-24 13:15:44</td>\n",
" <td> 0.071111</td>\n",
" <td> US/Eastern</td>\n",
" <td> 2013-01-24 08:15:44-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 98,
"text": [
" alt lat lon t ts \\\n",
"4739 26.049835 40.762707 -73.923622 1358979584 2013-01-23 22:19:44 \n",
"1738 21.747589 40.762657 -73.923492 1358979712 2013-01-23 22:21:52 \n",
"1522 26.512102 40.762688 -73.923630 1358983040 2013-01-23 23:17:20 \n",
"821 16.265961 40.767021 -73.921288 1359032960 2013-01-24 13:09:20 \n",
"4791 -0.725677 40.769920 -73.880081 1359033344 2013-01-24 13:15:44 \n",
"\n",
" dwell tz dt day_type \n",
"4739 0.035556 US/Eastern 2013-01-23 17:19:44-05:00 weekday \n",
"1738 0.924444 US/Eastern 2013-01-23 17:21:52-05:00 weekday \n",
"1522 13.866667 US/Eastern 2013-01-23 18:17:20-05:00 weekday \n",
"821 0.106667 US/Eastern 2013-01-24 08:09:20-05:00 weekday \n",
"4791 0.071111 US/Eastern 2013-01-24 08:15:44-05:00 weekday "
]
}
],
"prompt_number": 98
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df2 = df_pac.append(df_east).sort(['t'])\n",
"len(df2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 99,
"text": [
"5369"
]
}
],
"prompt_number": 99
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, so that was ugly, but now I have a DataFrame with all the data I need for mapping."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = df2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 100
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pickle as pkl\n",
"\n",
"pkl.dump(df, open('openpaths.pkl', 'wb')) # Save this off in case I need it again"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 101
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"del df['ts']\n",
"del df['tz']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 102
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alt</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>t</th>\n",
" <th>dwell</th>\n",
" <th>dt</th>\n",
" <th>day_type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4739</th>\n",
" <td> 26.049835</td>\n",
" <td> 40.762707</td>\n",
" <td>-73.923622</td>\n",
" <td> 1358979584</td>\n",
" <td> 0.035556</td>\n",
" <td> 2013-01-23 17:19:44-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1738</th>\n",
" <td> 21.747589</td>\n",
" <td> 40.762657</td>\n",
" <td>-73.923492</td>\n",
" <td> 1358979712</td>\n",
" <td> 0.924444</td>\n",
" <td> 2013-01-23 17:21:52-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1522</th>\n",
" <td> 26.512102</td>\n",
" <td> 40.762688</td>\n",
" <td>-73.923630</td>\n",
" <td> 1358983040</td>\n",
" <td> 13.866667</td>\n",
" <td> 2013-01-23 18:17:20-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" <tr>\n",
" <th>821 </th>\n",
" <td> 16.265961</td>\n",
" <td> 40.767021</td>\n",
" <td>-73.921288</td>\n",
" <td> 1359032960</td>\n",
" <td> 0.106667</td>\n",
" <td> 2013-01-24 08:09:20-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4791</th>\n",
" <td> -0.725677</td>\n",
" <td> 40.769920</td>\n",
" <td>-73.880081</td>\n",
" <td> 1359033344</td>\n",
" <td> 0.071111</td>\n",
" <td> 2013-01-24 08:15:44-05:00</td>\n",
" <td> weekday</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 103,
"text": [
" alt lat lon t dwell \\\n",
"4739 26.049835 40.762707 -73.923622 1358979584 0.035556 \n",
"1738 21.747589 40.762657 -73.923492 1358979712 0.924444 \n",
"1522 26.512102 40.762688 -73.923630 1358983040 13.866667 \n",
"821 16.265961 40.767021 -73.921288 1359032960 0.106667 \n",
"4791 -0.725677 40.769920 -73.880081 1359033344 0.071111 \n",
"\n",
" dt day_type \n",
"4739 2013-01-23 17:19:44-05:00 weekday \n",
"1738 2013-01-23 17:21:52-05:00 weekday \n",
"1522 2013-01-23 18:17:20-05:00 weekday \n",
"821 2013-01-24 08:09:20-05:00 weekday \n",
"4791 2013-01-24 08:15:44-05:00 weekday "
]
}
],
"prompt_number": 103
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initial upload to [CartoDB](http://cartodb.com/)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.to_csv('openpaths.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 104
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's a look at my initial CartoDB Visualization of the data:\n",
"\n",
"<iframe width='100%' height='400' frameborder='0' src='http://wimsy.cartodb.com/viz/533e739c-0560-11e3-80b6-dfe95a3edaf2/embed_map?title=true&description=true&search=false&shareable=false&cartodb_logo=true&layer_selector=true&legends=true&scrollwheel=true&sublayer_options=1&sql=&sw_lat=40.46157664398329&sw_lon=-74.63836669921874&ne_lat=40.950862628132775&ne_lon=-73.32000732421875'></iframe>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I had to tweak for this visualization a bit myself using CartoDB's [CartoCSS](http://www.mapbox.com/carto/api/2.1.0/). Also, I used a color palette suggested at [Colorbrewer](http://colorbrewer2.org/) as good for classification and colorblind viewers. I especially like how the colors combine to become very intense in areas of high density - and brown or black where weekday and weekend data coincide. Can you guess where my home is in this visualization? Where I work? Where I go on my days off?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculating distances and where I spend what percentage of my time\n",
"\n",
"Next I want to figure out where my time is spent. Where am I, on average, and how much time do I spend in the immediate vicinity of my home versus in far-flung locations? \n",
"\n",
"I found some useful equations to deal with polar coordinates on [StackExchange](http://math.stackexchange.com/questions/47854/calculate-average-latitude-longitude). So, let's go..."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from numpy import cos, sin, arctan2, sqrt, radians, degrees\n",
"\n",
"R = 3959.0 # radius of the Earth in miles\n",
"\n",
"# Calculate Cartesian coordinates\n",
"# convert to radians first????\n",
"\n",
"df['x'] = R * cos(radians(df['lat'])) * cos(radians(df['lon']))\n",
"df['y'] = R * sin(radians(df['lat'])) * cos(radians(df['lon']))\n",
"df['z'] = R * sin(radians(df['lon']))\n",
"\n",
"dwell_tot = df['dwell'].sum()\n",
"xbar = (df['x'] * df['dwell']).sum() / dwell_tot\n",
"ybar = (df['y'] * df['dwell']).sum() / dwell_tot\n",
"zbar = (df['z'] * df['dwell']).sum() / dwell_tot\n",
"\n",
"latbar = degrees(arctan2(ybar, xbar))\n",
"lonbar = degrees(arctan2(zbar, sqrt(xbar**2 + ybar**2)))\n",
"dispersion_factor = sqrt(xbar**2 + ybar**2 + zbar**2) / R\n",
"\n",
"print latbar, lonbar\n",
"print dispersion_factor"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"41.3526280584 -75.7166194059\n",
"0.988620903172\n"
]
}
],
"prompt_number": 105
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's the average position I occupied over the past six months or so - somewhere in south Scranton. I'm not surprised it's so far west, but I am a bit surprised it's so far north. I'd expect it to be - if anything - south of NYC. Maybe something's amiss in my code.\n",
"\n",
"<iframe width=\"425\" height=\"350\" frameborder=\"0\" scrolling=\"no\" marginheight=\"0\" marginwidth=\"0\" src=\"https://www.google.com/maps?sll=41.7076431,-76.1801158&amp;sspn=6.454518572823161,12.73713753091208&amp;t=m&amp;q=(41.352628058390124,+-75.716619405901)&amp;dg=opt&amp;ie=UTF8&amp;z=7&amp;ll=41.352628,-75.716619&amp;output=embed\"></iframe><br /><small><a href=\"https://www.google.com/maps?sll=41.7076431,-76.1801158&amp;sspn=6.454518572823161,12.73713753091208&amp;t=m&amp;q=(41.352628058390124,+-75.716619405901)&amp;dg=opt&amp;ie=UTF8&amp;z=7&amp;ll=41.352628,-75.716619&amp;source=embed\" style=\"color:#0000FF;text-align:left\">View Larger Map</a></small>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that I think about it, I think it will be more interesting to measure out from my home. Clearly the big black blob in the middle of Queens is generally where I live, so I'll pick that as the starting point and measure out from there."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lat_home = 40.762680\n",
"lon_home = -73.923665"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 106
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculating Great Circle distance with the Haversine formula\n",
"\n",
"Now I need to calcuate the distance of every point from my home so I can iteratively calculate the percentage of time I spend within arbitrary radii of there. I will be using a [haversine formula](http://en.wikipedia.org/wiki/Haversine_formula) to do this."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from numpy import arcsin\n",
"\n",
"lat1, lon1, lat2, lon2 = map(radians, [lat_home, lon_home, df['lat'], df['lon']])\n",
"\n",
"# haversine formula\n",
"\n",
"d = 2 * R * arcsin( sqrt( sin((lat2-lat1)/2.)**2 + cos(lat1)*cos(lat2)*sin((lon2-lon1)/2.)**2 ) )\n",
"df['d'] = d"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 175
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def time_spent_in_radius(r):\n",
" return df.dwell.where(df.d < r).sum() / df.dwell.sum()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 176
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"radius_df = pd.DataFrame({'radius': [0.1, 0.25, 0.5, 5.0, 10.0, 40.0, 100.0, 200.0, 2600.0]})\n",
"radius_df['time_spent'] = radius_df.radius.map(lambda x: time_spent_in_radius(x))\n",
"radius_df"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>radius</th>\n",
" <th>time_spent</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 0.10</td>\n",
" <td> 0.278665</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 0.25</td>\n",
" <td> 0.462255</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 0.50</td>\n",
" <td> 0.526922</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 5.00</td>\n",
" <td> 0.585533</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 10.00</td>\n",
" <td> 0.600975</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td> 40.00</td>\n",
" <td> 0.833430</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td> 100.00</td>\n",
" <td> 0.909522</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td> 200.00</td>\n",
" <td> 0.954185</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td> 2600.00</td>\n",
" <td> 1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 184,
"text": [
" radius time_spent\n",
"0 0.10 0.278665\n",
"1 0.25 0.462255\n",
"2 0.50 0.526922\n",
"3 5.00 0.585533\n",
"4 10.00 0.600975\n",
"5 40.00 0.833430\n",
"6 100.00 0.909522\n",
"7 200.00 0.954185\n",
"8 2600.00 1.000000"
]
}
],
"prompt_number": 184
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Export the data to CartoDB\n",
"\n",
"Now that we have distance info for each point, I'll export it and use that data to highlight points with certain ranges."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.to_csv('openpaths2.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 186
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the results...\n",
"\n",
"<iframe width='100%' height='400' frameborder='0' src='http://wimsy.cartodb.com/viz/3bde937c-05ff-11e3-996c-4fa1c4c34f60/embed_map?title=true&description=true&search=false&shareable=false&cartodb_logo=true&layer_selector=false&legends=true&scrollwheel=true&sublayer_options=1&sql=&sw_lat=37.405073750176946&sw_lon=-85.166015625&ne_lat=45.1510532655634&ne_lon=-64.072265625'></iframe>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, about a quarter of my time I spend on my own block. Given that I should be sleeping about 1/3 of the time, that seems right (especially when accounting for nights away and some dwell points that didn't update to my home and ended up just outside the 0.1-mile radius).\n",
"\n",
"The one-mile radius marks what I'd call central Astoria - the neighborhod where I usually live and play. And I spend a little over half my time there.\n",
"\n",
"I'd say 10 miles is a decent approximation of New York City (sorry Staten Island). I spend about 60% of my time within the [four](http://cityroom.blogs.nytimes.com/2008/12/17/a-new-call-for-staten-island-to-secede/?_r=0) boroughs.\n",
"\n",
"The 40-mile line encompasses my normal office and about 83% of my life. Expanding to 100 miles includes nother office near Philadelphia, which I visit weekly, and most my other regional business travel - bumping me up to 91%.\n",
"\n",
"If we expand the radius to 200 miles, you catch most of my weekend trips home, upstate or elsewhere in the spring and summer. They add about 4pp to account for 95% of my time. \n",
"\n",
"The final 5%? One ten-day trip to Arizona and California in early May. You need to expand the radius to 2,600 miles to pick that one up and to capture all the time I've spent since late January this year."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment