Skip to content

Instantly share code, notes, and snippets.

@JnBrymn-EB
Created November 27, 2017 04:11
Show Gist options
  • Save JnBrymn-EB/e28d48d5e71f3cc90c901b02bd40c430 to your computer and use it in GitHub Desktop.
Save JnBrymn-EB/e28d48d5e71f3cc90c901b02bd40c430 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Joining Stuff\n",
"\n",
"* train is the main table possible join rows `store_nbr`, `item_nbr`, and `date`\n",
"* stores can be joined by `store_nbr`\n",
"* transactions can be joined by `store_nbr`\n",
"* items can be joined in by `item_nbr`\n",
"* oil can be joined in by `date`\n",
"--------\n",
"* holiday can be joined in by `date` and possibly `locale_name` but it might be complicated - maybe we can augment this dataset with mapzen WhosOnFirst."
]
},
{
"cell_type": "code",
"execution_count": 286,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12550"
]
},
"execution_count": 286,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get a sub set of the data so that it's easier to play with `merge`\n",
"train_small = train.iloc[range(0,125497040, 10000)]\n",
"len(train_small)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## merge easy things"
]
},
{
"cell_type": "code",
"execution_count": 289,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"features = train_small \\\n",
" .merge(stores, on='store_nbr', how='left') \\\n",
" .merge(transactions, on=['store_nbr','date'], how='left') \\\n",
" .merge(items, on='item_nbr', how='left') \\\n",
" .merge(oil, on='date', how='left')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## merge holidays\n",
"Goal - to add 3 fields to each row which represent if the {store, date} is associated with a local, regional, or national holiday"
]
},
{
"cell_type": "code",
"execution_count": 290,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# These holidays matter because they exclude holidays \"transferred\" to another day and \n",
"# \"Work Days\" which aren't holidays at all\n",
"holidays_events_that_matter = holidays_events[\n",
" (holidays_events['type'] != 'Work Day') \n",
" | (holidays_events['transferred'] != True)\n",
"]\n",
"# drop redundant holidays (occasionally there are multiple local holidays on the same day)\n",
"holidays_events_that_matter.drop_duplicates(keep='first', subset=['date','locale'], inplace=True)\n",
"\n",
"# create dummy fields in the tables we're going to join\n",
"features['s'] = features.apply(lambda r: 's_{}'.format(r['state']), axis=1)\n",
"features['c'] = features.apply(lambda r: 'c_{}'.format(r['city']), axis=1)\n",
"def make_location_column(row):\n",
" if row['locale'] == \"Regional\":\n",
" return 's_{}'.format(row['locale_name'])\n",
" elif row['locale'] == \"Local\":\n",
" return 'c_{}'.format(row['locale_name'])\n",
"holidays_events_that_matter['place'] = holidays_events_that_matter.apply(make_location_column, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 295,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# make local, regional, and national holiday fields\n",
"temp = features.merge(holidays_events_that_matter, left_on=['c', 'date'], right_on=['place', 'date'], how='left')\n",
"features['local_holiday'] = temp['locale'] == 'Local'\n",
"\n",
"temp = features.merge(holidays_events_that_matter, left_on=['s', 'date'], right_on=['place', 'date'], how='left')\n",
"features['regional_holiday'] = temp['locale'] == 'Regional'\n",
"\n",
"temp = features.merge(\n",
" holidays_events_that_matter[holidays_events_that_matter['locale'] == 'National'],\n",
" on='date', how='left')\n",
"features['national_holiday'] = temp['locale'] == 'National'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# todo\n",
"* figure out how to impute missing data in each set before joining them\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment