calum-chamberlain/obsplus_index_testing.ipynb

## obsplus_index_testing.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test Obsplus indexing speed and memory consumption\n",
    "\n",
    "Obsplus currently indexes event banks by reading all the events in the bank into memory, then creating a dataframe from\n",
    "these.  This is proably the most efficient way to create the dataframe, however, storing events in memory can be quite\n",
    "memory intensive.  during reindexing my bank of ~540,000 events I used over 48GB of memory (and my computer crashed).\n",
    "\n",
    "This notebook attempts to demonstrate some possible options for lighter memory usage.\n",
    "\n",
    "First lets get a catalog to make a bank from and test that.  I'm going to hack some `update_index` functions to\n",
    "demonstrate the differences in time and memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Catalog of 1652 events\n"
     ]
    }
   ],
   "source": [
    "from obspy import UTCDateTime\n",
    "from obspy.clients.fdsn import Client\n",
    "from obsplus import EventBank\n",
    "\n",
    "client = Client(\"GEONET\")\n",
    "catalog = client.get_events(\n",
    "    starttime=UTCDateTime(2019, 1, 1), endtime=UTCDateTime(2019, 2, 1))\n",
    "print(\"Catalog of {0} events\".format(len(catalog)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will make a throw-away `EventBank` for testing - delete this at the end!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "bank = EventBank(\"event_bank_to_delete\")\n",
    "bank.put_events(catalog, update_index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets hack in an update index function that is the current obsplus way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from obsplus.bank.eventbank import *\n",
    "import os\n",
    "\n",
    "def update_index(bank: EventBank, bar: Optional[ProgressBar] = None) -> \"EventBank\":\n",
    "    \"\"\"\n",
    "    Iterate files in bank and add any modified since last update to index.\n",
    "    Parameters\n",
    "    ----------\n",
    "    {bar_parameter_description}\n",
    "    \"\"\"\n",
    "\n",
    "    def func(path):\n",
    "        \"\"\" Function to yield events, update_time and paths. \"\"\"\n",
    "        cat = try_read_catalog(path, format=bank.format)\n",
    "        update_time = getmtime(path)\n",
    "        path = path.replace(bank.bank_path, \"\")\n",
    "        return cat, update_time, path\n",
    "\n",
    "    bank._enforce_min_version()  # delete index if schema has changed\n",
    "    # For this we will force deleting the index to get a fair full re-index\n",
    "    os.remove(bank.index_path)\n",
    "    # create iterator  and lists for storing output\n",
    "    update_time = time.time()\n",
    "    iterator = bank._measured_unindexed_iterator(bar)\n",
    "    events, update_times, paths = [], [], []\n",
    "    for cat, mtime, path in bank._map(func, iterator):\n",
    "        if cat is None:\n",
    "            continue\n",
    "        for event in cat:\n",
    "            events.append(event)\n",
    "            update_times.append(mtime)\n",
    "            paths.append(path)\n",
    "    # add new events to database\n",
    "    df = obsplus.events.pd._default_cat_to_df(obspy.Catalog(events=events))\n",
    "    df[\"updated\"] = update_times\n",
    "    df[\"path\"] = paths\n",
    "    if len(df):\n",
    "        df_to_write = bank._prepare_dataframe(df, EVENT_TYPES_INPUT)\n",
    "        bank._write_update(df_to_write, update_time)\n",
    "    return bank"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit update_index(bank, False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext memory_profiler\n",
    "%memit update_index(bank, False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NB: Running this alone in iPython gives:\n",
    "```\n",
    "2min 17s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
    "peak memory: 1614.09 MiB, increment: 1457.95 MiB\n",
    "```\n",
    "\n",
    "Hack number two: making a list of data-frames then concatenating them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def update_index_from_list(bank: EventBank, bar: Optional[ProgressBar] = None) -> \"EventBank\":\n",
    "    \"\"\"\n",
    "    Iterate files in bank and add any modified since last update to index.\n",
    "    Parameters\n",
    "    ----------\n",
    "    {bar_parameter_description}\n",
    "    \"\"\"\n",
    "\n",
    "    def func(path):\n",
    "        \"\"\" Function to yield events, update_time and paths. \"\"\"\n",
    "        cat = try_read_catalog(path, format=bank.format)\n",
    "        update_time = getmtime(path)\n",
    "        path = path.replace(bank.bank_path, \"\")\n",
    "        return cat, update_time, path\n",
    "\n",
    "    bank._enforce_min_version()  # delete index if schema has changed\n",
    "    # For this we will force deleting the index to get a fair full re-index\n",
    "    os.remove(bank.index_path)\n",
    "    # create iterator  and lists for storing output\n",
    "    update_time = time.time()\n",
    "    iterator = bank._measured_unindexed_iterator(bar)\n",
    "    df = []\n",
    "    for cat, mtime, path in bank._map(func, iterator):\n",
    "        events, update_times, paths = [], [], []\n",
    "        if cat is None:\n",
    "            continue\n",
    "        for event in cat:\n",
    "            events.append(event)\n",
    "            update_times.append(mtime)\n",
    "            paths.append(path)\n",
    "        # add new events to database\n",
    "        _df = obsplus.events.pd._default_cat_to_df(obspy.Catalog(events=events))\n",
    "        _df[\"updated\"] = update_times\n",
    "        _df[\"path\"] = paths\n",
    "        df.append(_df)\n",
    "    if len(df):\n",
    "        df = pd.concat(df)\n",
    "        df_to_write = bank._prepare_dataframe(df, EVENT_TYPES_INPUT)\n",
    "        bank._write_update(df_to_write, update_time)\n",
    "    return bank"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit update_index_from_list(bank, False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%memit update_index_from_list(bank, False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running this seperately in iPython gives:\n",
    "```\n",
    "2min 35s ± 5.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
    "peak memory: 202.56 MiB, increment: 45.44 MiB\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Test Obsplus indexing speed and memory consumption\n",
	"\n",
	"Obsplus currently indexes event banks by reading all the events in the bank into memory, then creating a dataframe from\n",
	"these. This is proably the most efficient way to create the dataframe, however, storing events in memory can be quite\n",
	"memory intensive. during reindexing my bank of ~540,000 events I used over 48GB of memory (and my computer crashed).\n",
	"\n",
	"This notebook attempts to demonstrate some possible options for lighter memory usage.\n",
	"\n",
	"First lets get a catalog to make a bank from and test that. I'm going to hack some `update_index` functions to\n",
	"demonstrate the differences in time and memory."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Catalog of 1652 events\n"
	]
	}
	],
	"source": [
	"from obspy import UTCDateTime\n",
	"from obspy.clients.fdsn import Client\n",
	"from obsplus import EventBank\n",
	"\n",
	"client = Client(\"GEONET\")\n",
	"catalog = client.get_events(\n",
	" starttime=UTCDateTime(2019, 1, 1), endtime=UTCDateTime(2019, 2, 1))\n",
	"print(\"Catalog of {0} events\".format(len(catalog)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We will make a throw-away `EventBank` for testing - delete this at the end!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"bank = EventBank(\"event_bank_to_delete\")\n",
	"bank.put_events(catalog, update_index=False)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Lets hack in an update index function that is the current obsplus way:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"from obsplus.bank.eventbank import *\n",
	"import os\n",
	"\n",
	"def update_index(bank: EventBank, bar: Optional[ProgressBar] = None) -> \"EventBank\":\n",
	" \"\"\"\n",
	" Iterate files in bank and add any modified since last update to index.\n",
	" Parameters\n",
	" ----------\n",
	" {bar_parameter_description}\n",
	" \"\"\"\n",
	"\n",
	" def func(path):\n",
	" \"\"\" Function to yield events, update_time and paths. \"\"\"\n",
	" cat = try_read_catalog(path, format=bank.format)\n",
	" update_time = getmtime(path)\n",
	" path = path.replace(bank.bank_path, \"\")\n",
	" return cat, update_time, path\n",
	"\n",
	" bank._enforce_min_version() # delete index if schema has changed\n",
	" # For this we will force deleting the index to get a fair full re-index\n",
	" os.remove(bank.index_path)\n",
	" # create iterator and lists for storing output\n",
	" update_time = time.time()\n",
	" iterator = bank._measured_unindexed_iterator(bar)\n",
	" events, update_times, paths = [], [], []\n",
	" for cat, mtime, path in bank._map(func, iterator):\n",
	" if cat is None:\n",
	" continue\n",
	" for event in cat:\n",
	" events.append(event)\n",
	" update_times.append(mtime)\n",
	" paths.append(path)\n",
	" # add new events to database\n",
	" df = obsplus.events.pd._default_cat_to_df(obspy.Catalog(events=events))\n",
	" df[\"updated\"] = update_times\n",
	" df[\"path\"] = paths\n",
	" if len(df):\n",
	" df_to_write = bank._prepare_dataframe(df, EVENT_TYPES_INPUT)\n",
	" bank._write_update(df_to_write, update_time)\n",
	" return bank"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%timeit update_index(bank, False)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%load_ext memory_profiler\n",
	"%memit update_index(bank, False)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"NB: Running this alone in iPython gives:\n",
	"```\n",
	"2min 17s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
	"peak memory: 1614.09 MiB, increment: 1457.95 MiB\n",
	"```\n",
	"\n",
	"Hack number two: making a list of data-frames then concatenating them."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"def update_index_from_list(bank: EventBank, bar: Optional[ProgressBar] = None) -> \"EventBank\":\n",
	" \"\"\"\n",
	" Iterate files in bank and add any modified since last update to index.\n",
	" Parameters\n",
	" ----------\n",
	" {bar_parameter_description}\n",
	" \"\"\"\n",
	"\n",
	" def func(path):\n",
	" \"\"\" Function to yield events, update_time and paths. \"\"\"\n",
	" cat = try_read_catalog(path, format=bank.format)\n",
	" update_time = getmtime(path)\n",
	" path = path.replace(bank.bank_path, \"\")\n",
	" return cat, update_time, path\n",
	"\n",
	" bank._enforce_min_version() # delete index if schema has changed\n",
	" # For this we will force deleting the index to get a fair full re-index\n",
	" os.remove(bank.index_path)\n",
	" # create iterator and lists for storing output\n",
	" update_time = time.time()\n",
	" iterator = bank._measured_unindexed_iterator(bar)\n",
	" df = []\n",
	" for cat, mtime, path in bank._map(func, iterator):\n",
	" events, update_times, paths = [], [], []\n",
	" if cat is None:\n",
	" continue\n",
	" for event in cat:\n",
	" events.append(event)\n",
	" update_times.append(mtime)\n",
	" paths.append(path)\n",
	" # add new events to database\n",
	" _df = obsplus.events.pd._default_cat_to_df(obspy.Catalog(events=events))\n",
	" _df[\"updated\"] = update_times\n",
	" _df[\"path\"] = paths\n",
	" df.append(_df)\n",
	" if len(df):\n",
	" df = pd.concat(df)\n",
	" df_to_write = bank._prepare_dataframe(df, EVENT_TYPES_INPUT)\n",
	" bank._write_update(df_to_write, update_time)\n",
	" return bank"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%timeit update_index_from_list(bank, False)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%memit update_index_from_list(bank, False)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Running this seperately in iPython gives:\n",
	"```\n",
	"2min 35s ± 5.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
	"peak memory: 202.56 MiB, increment: 45.44 MiB\n",
	"```"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}