Created
December 1, 2019 03:01
-
-
Save calum-chamberlain/425455509ada795373e22a496a0b7aef to your computer and use it in GitHub Desktop.
Testing memory and timing for obsplus index creation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Test Obsplus indexing speed and memory consumption\n", | |
"\n", | |
"Obsplus currently indexes event banks by reading all the events in the bank into memory, then creating a dataframe from\n", | |
"these. This is proably the most efficient way to create the dataframe, however, storing events in memory can be quite\n", | |
"memory intensive. during reindexing my bank of ~540,000 events I used over 48GB of memory (and my computer crashed).\n", | |
"\n", | |
"This notebook attempts to demonstrate some possible options for lighter memory usage.\n", | |
"\n", | |
"First lets get a catalog to make a bank from and test that. I'm going to hack some `update_index` functions to\n", | |
"demonstrate the differences in time and memory." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Catalog of 1652 events\n" | |
] | |
} | |
], | |
"source": [ | |
"from obspy import UTCDateTime\n", | |
"from obspy.clients.fdsn import Client\n", | |
"from obsplus import EventBank\n", | |
"\n", | |
"client = Client(\"GEONET\")\n", | |
"catalog = client.get_events(\n", | |
" starttime=UTCDateTime(2019, 1, 1), endtime=UTCDateTime(2019, 2, 1))\n", | |
"print(\"Catalog of {0} events\".format(len(catalog)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We will make a throw-away `EventBank` for testing - delete this at the end!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"bank = EventBank(\"event_bank_to_delete\")\n", | |
"bank.put_events(catalog, update_index=False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Lets hack in an update index function that is the current obsplus way:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from obsplus.bank.eventbank import *\n", | |
"import os\n", | |
"\n", | |
"def update_index(bank: EventBank, bar: Optional[ProgressBar] = None) -> \"EventBank\":\n", | |
" \"\"\"\n", | |
" Iterate files in bank and add any modified since last update to index.\n", | |
" Parameters\n", | |
" ----------\n", | |
" {bar_parameter_description}\n", | |
" \"\"\"\n", | |
"\n", | |
" def func(path):\n", | |
" \"\"\" Function to yield events, update_time and paths. \"\"\"\n", | |
" cat = try_read_catalog(path, format=bank.format)\n", | |
" update_time = getmtime(path)\n", | |
" path = path.replace(bank.bank_path, \"\")\n", | |
" return cat, update_time, path\n", | |
"\n", | |
" bank._enforce_min_version() # delete index if schema has changed\n", | |
" # For this we will force deleting the index to get a fair full re-index\n", | |
" os.remove(bank.index_path)\n", | |
" # create iterator and lists for storing output\n", | |
" update_time = time.time()\n", | |
" iterator = bank._measured_unindexed_iterator(bar)\n", | |
" events, update_times, paths = [], [], []\n", | |
" for cat, mtime, path in bank._map(func, iterator):\n", | |
" if cat is None:\n", | |
" continue\n", | |
" for event in cat:\n", | |
" events.append(event)\n", | |
" update_times.append(mtime)\n", | |
" paths.append(path)\n", | |
" # add new events to database\n", | |
" df = obsplus.events.pd._default_cat_to_df(obspy.Catalog(events=events))\n", | |
" df[\"updated\"] = update_times\n", | |
" df[\"path\"] = paths\n", | |
" if len(df):\n", | |
" df_to_write = bank._prepare_dataframe(df, EVENT_TYPES_INPUT)\n", | |
" bank._write_update(df_to_write, update_time)\n", | |
" return bank" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%timeit update_index(bank, False)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%load_ext memory_profiler\n", | |
"%memit update_index(bank, False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"NB: Running this alone in iPython gives:\n", | |
"```\n", | |
"2min 17s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n", | |
"peak memory: 1614.09 MiB, increment: 1457.95 MiB\n", | |
"```\n", | |
"\n", | |
"Hack number two: making a list of data-frames then concatenating them." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def update_index_from_list(bank: EventBank, bar: Optional[ProgressBar] = None) -> \"EventBank\":\n", | |
" \"\"\"\n", | |
" Iterate files in bank and add any modified since last update to index.\n", | |
" Parameters\n", | |
" ----------\n", | |
" {bar_parameter_description}\n", | |
" \"\"\"\n", | |
"\n", | |
" def func(path):\n", | |
" \"\"\" Function to yield events, update_time and paths. \"\"\"\n", | |
" cat = try_read_catalog(path, format=bank.format)\n", | |
" update_time = getmtime(path)\n", | |
" path = path.replace(bank.bank_path, \"\")\n", | |
" return cat, update_time, path\n", | |
"\n", | |
" bank._enforce_min_version() # delete index if schema has changed\n", | |
" # For this we will force deleting the index to get a fair full re-index\n", | |
" os.remove(bank.index_path)\n", | |
" # create iterator and lists for storing output\n", | |
" update_time = time.time()\n", | |
" iterator = bank._measured_unindexed_iterator(bar)\n", | |
" df = []\n", | |
" for cat, mtime, path in bank._map(func, iterator):\n", | |
" events, update_times, paths = [], [], []\n", | |
" if cat is None:\n", | |
" continue\n", | |
" for event in cat:\n", | |
" events.append(event)\n", | |
" update_times.append(mtime)\n", | |
" paths.append(path)\n", | |
" # add new events to database\n", | |
" _df = obsplus.events.pd._default_cat_to_df(obspy.Catalog(events=events))\n", | |
" _df[\"updated\"] = update_times\n", | |
" _df[\"path\"] = paths\n", | |
" df.append(_df)\n", | |
" if len(df):\n", | |
" df = pd.concat(df)\n", | |
" df_to_write = bank._prepare_dataframe(df, EVENT_TYPES_INPUT)\n", | |
" bank._write_update(df_to_write, update_time)\n", | |
" return bank" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%timeit update_index_from_list(bank, False)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%memit update_index_from_list(bank, False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Running this seperately in iPython gives:\n", | |
"```\n", | |
"2min 35s ± 5.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n", | |
"peak memory: 202.56 MiB, increment: 45.44 MiB\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment