Created
November 27, 2017 04:11
-
-
Save JnBrymn-EB/e28d48d5e71f3cc90c901b02bd40c430 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Joining Stuff\n", | |
"\n", | |
"* train is the main table possible join rows `store_nbr`, `item_nbr`, and `date`\n", | |
"* stores can be joined by `store_nbr`\n", | |
"* transactions can be joined by `store_nbr`\n", | |
"* items can be joined in by `item_nbr`\n", | |
"* oil can be joined in by `date`\n", | |
"--------\n", | |
"* holiday can be joined in by `date` and possibly `locale_name` but it might be complicated - maybe we can augment this dataset with mapzen WhosOnFirst." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 286, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"12550" | |
] | |
}, | |
"execution_count": 286, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# get a sub set of the data so that it's easier to play with `merge`\n", | |
"train_small = train.iloc[range(0,125497040, 10000)]\n", | |
"len(train_small)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## merge easy things" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 289, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"features = train_small \\\n", | |
" .merge(stores, on='store_nbr', how='left') \\\n", | |
" .merge(transactions, on=['store_nbr','date'], how='left') \\\n", | |
" .merge(items, on='item_nbr', how='left') \\\n", | |
" .merge(oil, on='date', how='left')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## merge holidays\n", | |
"Goal - to add 3 fields to each row which represent if the {store, date} is associated with a local, regional, or national holiday" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 290, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# These holidays matter because they exclude holidays \"transferred\" to another day and \n", | |
"# \"Work Days\" which aren't holidays at all\n", | |
"holidays_events_that_matter = holidays_events[\n", | |
" (holidays_events['type'] != 'Work Day') \n", | |
" | (holidays_events['transferred'] != True)\n", | |
"]\n", | |
"# drop redundant holidays (occasionally there are multiple local holidays on the same day)\n", | |
"holidays_events_that_matter.drop_duplicates(keep='first', subset=['date','locale'], inplace=True)\n", | |
"\n", | |
"# create dummy fields in the tables we're going to join\n", | |
"features['s'] = features.apply(lambda r: 's_{}'.format(r['state']), axis=1)\n", | |
"features['c'] = features.apply(lambda r: 'c_{}'.format(r['city']), axis=1)\n", | |
"def make_location_column(row):\n", | |
" if row['locale'] == \"Regional\":\n", | |
" return 's_{}'.format(row['locale_name'])\n", | |
" elif row['locale'] == \"Local\":\n", | |
" return 'c_{}'.format(row['locale_name'])\n", | |
"holidays_events_that_matter['place'] = holidays_events_that_matter.apply(make_location_column, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 295, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# make local, regional, and national holiday fields\n", | |
"temp = features.merge(holidays_events_that_matter, left_on=['c', 'date'], right_on=['place', 'date'], how='left')\n", | |
"features['local_holiday'] = temp['locale'] == 'Local'\n", | |
"\n", | |
"temp = features.merge(holidays_events_that_matter, left_on=['s', 'date'], right_on=['place', 'date'], how='left')\n", | |
"features['regional_holiday'] = temp['locale'] == 'Regional'\n", | |
"\n", | |
"temp = features.merge(\n", | |
" holidays_events_that_matter[holidays_events_that_matter['locale'] == 'National'],\n", | |
" on='date', how='left')\n", | |
"features['national_holiday'] = temp['locale'] == 'National'" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# todo\n", | |
"* figure out how to impute missing data in each set before joining them\n" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment