Skip to content

Instantly share code, notes, and snippets.

@calum-chamberlain
Created December 1, 2019 03:01
Show Gist options
  • Save calum-chamberlain/425455509ada795373e22a496a0b7aef to your computer and use it in GitHub Desktop.
Save calum-chamberlain/425455509ada795373e22a496a0b7aef to your computer and use it in GitHub Desktop.
Testing memory and timing for obsplus index creation
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Test Obsplus indexing speed and memory consumption\n",
"\n",
"Obsplus currently indexes event banks by reading all the events in the bank into memory, then creating a dataframe from\n",
"these. This is proably the most efficient way to create the dataframe, however, storing events in memory can be quite\n",
"memory intensive. during reindexing my bank of ~540,000 events I used over 48GB of memory (and my computer crashed).\n",
"\n",
"This notebook attempts to demonstrate some possible options for lighter memory usage.\n",
"\n",
"First lets get a catalog to make a bank from and test that. I'm going to hack some `update_index` functions to\n",
"demonstrate the differences in time and memory."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Catalog of 1652 events\n"
]
}
],
"source": [
"from obspy import UTCDateTime\n",
"from obspy.clients.fdsn import Client\n",
"from obsplus import EventBank\n",
"\n",
"client = Client(\"GEONET\")\n",
"catalog = client.get_events(\n",
" starttime=UTCDateTime(2019, 1, 1), endtime=UTCDateTime(2019, 2, 1))\n",
"print(\"Catalog of {0} events\".format(len(catalog)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will make a throw-away `EventBank` for testing - delete this at the end!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"bank = EventBank(\"event_bank_to_delete\")\n",
"bank.put_events(catalog, update_index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets hack in an update index function that is the current obsplus way:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from obsplus.bank.eventbank import *\n",
"import os\n",
"\n",
"def update_index(bank: EventBank, bar: Optional[ProgressBar] = None) -> \"EventBank\":\n",
" \"\"\"\n",
" Iterate files in bank and add any modified since last update to index.\n",
" Parameters\n",
" ----------\n",
" {bar_parameter_description}\n",
" \"\"\"\n",
"\n",
" def func(path):\n",
" \"\"\" Function to yield events, update_time and paths. \"\"\"\n",
" cat = try_read_catalog(path, format=bank.format)\n",
" update_time = getmtime(path)\n",
" path = path.replace(bank.bank_path, \"\")\n",
" return cat, update_time, path\n",
"\n",
" bank._enforce_min_version() # delete index if schema has changed\n",
" # For this we will force deleting the index to get a fair full re-index\n",
" os.remove(bank.index_path)\n",
" # create iterator and lists for storing output\n",
" update_time = time.time()\n",
" iterator = bank._measured_unindexed_iterator(bar)\n",
" events, update_times, paths = [], [], []\n",
" for cat, mtime, path in bank._map(func, iterator):\n",
" if cat is None:\n",
" continue\n",
" for event in cat:\n",
" events.append(event)\n",
" update_times.append(mtime)\n",
" paths.append(path)\n",
" # add new events to database\n",
" df = obsplus.events.pd._default_cat_to_df(obspy.Catalog(events=events))\n",
" df[\"updated\"] = update_times\n",
" df[\"path\"] = paths\n",
" if len(df):\n",
" df_to_write = bank._prepare_dataframe(df, EVENT_TYPES_INPUT)\n",
" bank._write_update(df_to_write, update_time)\n",
" return bank"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%timeit update_index(bank, False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext memory_profiler\n",
"%memit update_index(bank, False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NB: Running this alone in iPython gives:\n",
"```\n",
"2min 17s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
"peak memory: 1614.09 MiB, increment: 1457.95 MiB\n",
"```\n",
"\n",
"Hack number two: making a list of data-frames then concatenating them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def update_index_from_list(bank: EventBank, bar: Optional[ProgressBar] = None) -> \"EventBank\":\n",
" \"\"\"\n",
" Iterate files in bank and add any modified since last update to index.\n",
" Parameters\n",
" ----------\n",
" {bar_parameter_description}\n",
" \"\"\"\n",
"\n",
" def func(path):\n",
" \"\"\" Function to yield events, update_time and paths. \"\"\"\n",
" cat = try_read_catalog(path, format=bank.format)\n",
" update_time = getmtime(path)\n",
" path = path.replace(bank.bank_path, \"\")\n",
" return cat, update_time, path\n",
"\n",
" bank._enforce_min_version() # delete index if schema has changed\n",
" # For this we will force deleting the index to get a fair full re-index\n",
" os.remove(bank.index_path)\n",
" # create iterator and lists for storing output\n",
" update_time = time.time()\n",
" iterator = bank._measured_unindexed_iterator(bar)\n",
" df = []\n",
" for cat, mtime, path in bank._map(func, iterator):\n",
" events, update_times, paths = [], [], []\n",
" if cat is None:\n",
" continue\n",
" for event in cat:\n",
" events.append(event)\n",
" update_times.append(mtime)\n",
" paths.append(path)\n",
" # add new events to database\n",
" _df = obsplus.events.pd._default_cat_to_df(obspy.Catalog(events=events))\n",
" _df[\"updated\"] = update_times\n",
" _df[\"path\"] = paths\n",
" df.append(_df)\n",
" if len(df):\n",
" df = pd.concat(df)\n",
" df_to_write = bank._prepare_dataframe(df, EVENT_TYPES_INPUT)\n",
" bank._write_update(df_to_write, update_time)\n",
" return bank"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%timeit update_index_from_list(bank, False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%memit update_index_from_list(bank, False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running this seperately in iPython gives:\n",
"```\n",
"2min 35s ± 5.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
"peak memory: 202.56 MiB, increment: 45.44 MiB\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment