Created
June 1, 2018 10:45
-
-
Save rutgerhofste/a75b588c36632d5c2913b4fd541434de to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"from sqlalchemy import *\n", | |
"import datetime\n", | |
"\n", | |
"\n", | |
"DATABASE_ENDPOINT = \"aqueduct30v05.cgpnumwmfcqc.eu-central-1.rds.amazonaws.com\"\n", | |
"DATABASE_NAME = \"database01\"\n", | |
"TABLE_NAME = \"y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01\"\n", | |
"\n", | |
"F = open(\"/.password\",\"r\")\n", | |
"password = F.read().splitlines()[0]\n", | |
"F.close()\n", | |
"\n", | |
"engine = create_engine(\"postgresql://rutgerhofste:{}@{}:5432/{}\".format(password,DATABASE_ENDPOINT,DATABASE_NAME))\n", | |
"connection = engine.connect()\n", | |
"\n", | |
"sql = \"SELECT * FROM {} \\\n", | |
" LIMIT 10\".format(TABLE_NAME)\n", | |
"\n", | |
"df = pd.read_sql(sql, connection)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Some Background\n", | |
"\n", | |
"the database is the result of running zonal statistics on a climate model. I am relatively new to postgreSQL and haven't set any indexes yet. The database is on AWS RDS on an x.large instance. None of the columns is unique. \"pfafid_30spfaf06\" is a zonal code for water basins. There are in total appr. 16000 unique pfaf_ids. year [1960-2014] month [1-12], temporal_resolution [\"year\",\"month\"]\n", | |
"\n", | |
"Result of (PgAdmin) : \n", | |
"`SELECT pg_size_pretty(pg_total_relation_size('\"public\".\"y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01\"'));`\n", | |
"\n", | |
"Successfully run. Total query runtime: 425 msec.\n", | |
"1 rows affected:\n", | |
"\n", | |
"11 GB\n", | |
"\n", | |
"\n", | |
"Result of (PgAdmin) :\n", | |
"`SELECT count(*) FROM y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01`\n", | |
"\n", | |
"Successfully run. Total query runtime: 52 secs.\n", | |
"1 rows affected.\n", | |
"\n", | |
"11715275 i.e. 11,715,275 rows\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pfafid_30spfaf06</th>\n", | |
" <th>area_count_30spfaf06</th>\n", | |
" <th>area_m2_30spfaf06</th>\n", | |
" <th>month</th>\n", | |
" <th>pdomwn_count_30spfaf06</th>\n", | |
" <th>pdomwn_m_30spfaf06</th>\n", | |
" <th>pdomww_count_30spfaf06</th>\n", | |
" <th>pdomww_m_30spfaf06</th>\n", | |
" <th>pindwn_count_30spfaf06</th>\n", | |
" <th>pindwn_m_30spfaf06</th>\n", | |
" <th>...</th>\n", | |
" <th>plivww_count_30spfaf06</th>\n", | |
" <th>plivww_m_30spfaf06</th>\n", | |
" <th>riverdischarge_count_30spfaf06</th>\n", | |
" <th>riverdischarge_m_30spfaf06</th>\n", | |
" <th>temporal_resolution</th>\n", | |
" <th>year</th>\n", | |
" <th>input_file_name</th>\n", | |
" <th>datetime_stamp</th>\n", | |
" <th>ptotwn_m_30spfaf06</th>\n", | |
" <th>ptotww_m_30spfaf06</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>482012</td>\n", | |
" <td>15572</td>\n", | |
" <td>1.027443e+10</td>\n", | |
" <td>12</td>\n", | |
" <td>15572</td>\n", | |
" <td>0.000007</td>\n", | |
" <td>15572</td>\n", | |
" <td>0.000007</td>\n", | |
" <td>15572</td>\n", | |
" <td>1.803221e-07</td>\n", | |
" <td>...</td>\n", | |
" <td>15572</td>\n", | |
" <td>0.000004</td>\n", | |
" <td>15462.0</td>\n", | |
" <td>0.078879</td>\n", | |
" <td>year</td>\n", | |
" <td>1967</td>\n", | |
" <td>global_historical_merged_year_m_30sPfaf06_1960...</td>\n", | |
" <td>2018-05-31 16:42:33.339340</td>\n", | |
" <td>0.000089</td>\n", | |
" <td>0.000218</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>482013</td>\n", | |
" <td>2665</td>\n", | |
" <td>1.749258e+09</td>\n", | |
" <td>12</td>\n", | |
" <td>2665</td>\n", | |
" <td>0.000018</td>\n", | |
" <td>2665</td>\n", | |
" <td>0.000018</td>\n", | |
" <td>2665</td>\n", | |
" <td>0.000000e+00</td>\n", | |
" <td>...</td>\n", | |
" <td>2665</td>\n", | |
" <td>0.000005</td>\n", | |
" <td>2665.0</td>\n", | |
" <td>0.463177</td>\n", | |
" <td>year</td>\n", | |
" <td>1967</td>\n", | |
" <td>global_historical_merged_year_m_30sPfaf06_1960...</td>\n", | |
" <td>2018-05-31 16:42:33.339340</td>\n", | |
" <td>0.000058</td>\n", | |
" <td>0.000115</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>482014</td>\n", | |
" <td>7927</td>\n", | |
" <td>5.247636e+09</td>\n", | |
" <td>12</td>\n", | |
" <td>7927</td>\n", | |
" <td>0.000020</td>\n", | |
" <td>7927</td>\n", | |
" <td>0.000020</td>\n", | |
" <td>7927</td>\n", | |
" <td>7.687544e-06</td>\n", | |
" <td>...</td>\n", | |
" <td>7927</td>\n", | |
" <td>0.000016</td>\n", | |
" <td>7872.0</td>\n", | |
" <td>0.000123</td>\n", | |
" <td>year</td>\n", | |
" <td>1967</td>\n", | |
" <td>global_historical_merged_year_m_30sPfaf06_1960...</td>\n", | |
" <td>2018-05-31 16:42:33.339340</td>\n", | |
" <td>0.000115</td>\n", | |
" <td>0.000265</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>482015</td>\n", | |
" <td>6268</td>\n", | |
" <td>4.100086e+09</td>\n", | |
" <td>12</td>\n", | |
" <td>6268</td>\n", | |
" <td>0.000020</td>\n", | |
" <td>6268</td>\n", | |
" <td>0.000020</td>\n", | |
" <td>6268</td>\n", | |
" <td>0.000000e+00</td>\n", | |
" <td>...</td>\n", | |
" <td>6268</td>\n", | |
" <td>0.000004</td>\n", | |
" <td>6268.0</td>\n", | |
" <td>0.197547</td>\n", | |
" <td>year</td>\n", | |
" <td>1967</td>\n", | |
" <td>global_historical_merged_year_m_30sPfaf06_1960...</td>\n", | |
" <td>2018-05-31 16:42:33.339340</td>\n", | |
" <td>0.000023</td>\n", | |
" <td>0.000023</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>482016</td>\n", | |
" <td>51248</td>\n", | |
" <td>3.393310e+10</td>\n", | |
" <td>12</td>\n", | |
" <td>51248</td>\n", | |
" <td>0.000033</td>\n", | |
" <td>51248</td>\n", | |
" <td>0.000033</td>\n", | |
" <td>51248</td>\n", | |
" <td>2.805116e-05</td>\n", | |
" <td>...</td>\n", | |
" <td>51248</td>\n", | |
" <td>0.000027</td>\n", | |
" <td>51248.0</td>\n", | |
" <td>0.023862</td>\n", | |
" <td>year</td>\n", | |
" <td>1967</td>\n", | |
" <td>global_historical_merged_year_m_30sPfaf06_1960...</td>\n", | |
" <td>2018-05-31 16:42:33.339340</td>\n", | |
" <td>0.001410</td>\n", | |
" <td>0.003697</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 28 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pfafid_30spfaf06 area_count_30spfaf06 area_m2_30spfaf06 month \\\n", | |
"0 482012 15572 1.027443e+10 12 \n", | |
"1 482013 2665 1.749258e+09 12 \n", | |
"2 482014 7927 5.247636e+09 12 \n", | |
"3 482015 6268 4.100086e+09 12 \n", | |
"4 482016 51248 3.393310e+10 12 \n", | |
"\n", | |
" pdomwn_count_30spfaf06 pdomwn_m_30spfaf06 pdomww_count_30spfaf06 \\\n", | |
"0 15572 0.000007 15572 \n", | |
"1 2665 0.000018 2665 \n", | |
"2 7927 0.000020 7927 \n", | |
"3 6268 0.000020 6268 \n", | |
"4 51248 0.000033 51248 \n", | |
"\n", | |
" pdomww_m_30spfaf06 pindwn_count_30spfaf06 pindwn_m_30spfaf06 \\\n", | |
"0 0.000007 15572 1.803221e-07 \n", | |
"1 0.000018 2665 0.000000e+00 \n", | |
"2 0.000020 7927 7.687544e-06 \n", | |
"3 0.000020 6268 0.000000e+00 \n", | |
"4 0.000033 51248 2.805116e-05 \n", | |
"\n", | |
" ... plivww_count_30spfaf06 plivww_m_30spfaf06 \\\n", | |
"0 ... 15572 0.000004 \n", | |
"1 ... 2665 0.000005 \n", | |
"2 ... 7927 0.000016 \n", | |
"3 ... 6268 0.000004 \n", | |
"4 ... 51248 0.000027 \n", | |
"\n", | |
" riverdischarge_count_30spfaf06 riverdischarge_m_30spfaf06 \\\n", | |
"0 15462.0 0.078879 \n", | |
"1 2665.0 0.463177 \n", | |
"2 7872.0 0.000123 \n", | |
"3 6268.0 0.197547 \n", | |
"4 51248.0 0.023862 \n", | |
"\n", | |
" temporal_resolution year \\\n", | |
"0 year 1967 \n", | |
"1 year 1967 \n", | |
"2 year 1967 \n", | |
"3 year 1967 \n", | |
"4 year 1967 \n", | |
"\n", | |
" input_file_name \\\n", | |
"0 global_historical_merged_year_m_30sPfaf06_1960... \n", | |
"1 global_historical_merged_year_m_30sPfaf06_1960... \n", | |
"2 global_historical_merged_year_m_30sPfaf06_1960... \n", | |
"3 global_historical_merged_year_m_30sPfaf06_1960... \n", | |
"4 global_historical_merged_year_m_30sPfaf06_1960... \n", | |
"\n", | |
" datetime_stamp ptotwn_m_30spfaf06 ptotww_m_30spfaf06 \n", | |
"0 2018-05-31 16:42:33.339340 0.000089 0.000218 \n", | |
"1 2018-05-31 16:42:33.339340 0.000058 0.000115 \n", | |
"2 2018-05-31 16:42:33.339340 0.000115 0.000265 \n", | |
"3 2018-05-31 16:42:33.339340 0.000023 0.000023 \n", | |
"4 2018-05-31 16:42:33.339340 0.001410 0.003697 \n", | |
"\n", | |
"[5 rows x 28 columns]" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I know combining month and year in one datetime column is best practice but for future use, keeping them separate is easier." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"The query I like to run calculates a 10 year moving average for three columns: \n", | |
"1. ptotwn_m_30spfaf06\n", | |
"1. ptotww_m_30spfaf06\n", | |
"1. riverdischarge_m_30spfaf06\n", | |
"\n", | |
"For axample the 10y annual moving average of 1969 is the average of 1960 - 1969. For a monthly moving average the average is filtered by month: average of jan 1960 jan 1961 ... jan 1969. \n", | |
"\n", | |
"The query I have so far:\n", | |
"\n", | |
"`\n", | |
"SELECT year, ptotww_m_30spfaf06, temporal_resolution,\n", | |
" SUM(ptotww_m_30spfaf06)\n", | |
" OVER(ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as ptotwwma_m_30spfaf06 \n", | |
"FROM y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01\n", | |
"WHERE temporal_resolution = 'year'\n", | |
"LIMIT 200`\n", | |
"\n", | |
"However this is slow (need to set index? Which columns? year, month?) and does not work for the monthly scores. \n", | |
"\n", | |
"Successfully run. Total query runtime: 52 secs.\n", | |
"200 rows affected.\n", | |
"\n", | |
"Which is quite slow. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 35", | |
"language": "python", | |
"name": "python35" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment