Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dsevero/252a5f280600c6b1118ed42826d188a9 to your computer and use it in GitHub Desktop.
Save dsevero/252a5f280600c6b1118ed42826d188a9 to your computer and use it in GitHub Desktop.
Persisting lru_cache to disk while using hashable pandas objects for parallel experiments
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Persisting lru_cache to disk while using hashable pandas objects for parallel experiments",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/dsevero/252a5f280600c6b1118ed42826d188a9/persisting-lru_cache-to-disk-while-using-hashable-pandas-objects-for-parallel-experiments.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Oxoi-OCsdPcc",
"colab_type": "text"
},
"source": [
"Persisting lru_cache to disk while using hashable pandas objects for parallel experiments\n",
"---\n",
"\n",
"by [dsevero.com](https://dsevero.com)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TC-aZe2vOpRG",
"colab_type": "text"
},
"source": [
"This tutorial will show you how to make your functions work with caching in RAM (via `lru_cache`) and disk (via `joblib.Memory`) simultaneously. The advantage of having both is that while you gain speed with `lru_cache`, you keep persistence with `joblib.Memory`. Therefore, if your session crashes, you don't lose the data of experiments you already ran. We also implement a custom `HashableDataFrame` class which, as the name suggests, allows pandas dataframes to be hashed and cached. Finally, we show how to use it to run parallel experiments with RAM/disk caching."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EEfZUZ62qxIA",
"colab_type": "text"
},
"source": [
"# Persisting lru_cache with joblib.Memory"
]
},
{
"cell_type": "code",
"metadata": {
"id": "2v_lLrjpYCVX",
"colab_type": "code",
"colab": {}
},
"source": [
"from joblib import Memory\n",
"from functools import lru_cache\n",
"\n",
"memory = Memory('cache/')"
],
"execution_count": 1,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "MvVlYRzJN6gN",
"colab_type": "text"
},
"source": [
"The disk cache is initially empty"
]
},
{
"cell_type": "code",
"metadata": {
"id": "tio--9LBy3Dy",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "5853cbaa-0b30-4d0b-eee6-e37178433b5a"
},
"source": [
"memory.store_backend.get_items()"
],
"execution_count": 2,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[]"
]
},
"metadata": {
"tags": []
},
"execution_count": 2
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zfipsC_NOA6c",
"colab_type": "text"
},
"source": [
"Decorate your function (represented here by `f`) as shown below. Note that order is important, `lru_cache` must come after `memory.cache`.\n",
"\n",
"We are saying that results of `f` should be stored to RAM to avoid re-evaluation, and should also be stored to disk for persistence. \n",
"\n",
"To restate: Calling `f` with some argument will trigger a search in RAM with `lru_cache`. If the result is not found, it will then search the disk cache `memory.cache`. If the result is found in either cache, it will be returned. Otherwise, evaluate the function."
]
},
{
"cell_type": "code",
"metadata": {
"id": "kovBH6ixYHFc",
"colab_type": "code",
"colab": {}
},
"source": [
"@lru_cache()\n",
"@memory.cache\n",
"def f(*args, **kwargs):\n",
" print('evaluated!')\n",
" return args, kwargs"
],
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "uz6DF068OQC7",
"colab_type": "text"
},
"source": [
"Note below that the first call to `f` triggers an evaluation. The result will be stored to RAM and disk caches."
]
},
{
"cell_type": "code",
"metadata": {
"id": "3UVBNiDe9lpI",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 122
},
"outputId": "dac1fe6f-64ff-4f06-9576-b82af9da4be7"
},
"source": [
"f(0)"
],
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": [
"________________________________________________________________________________\n",
"[Memory] Calling __main__--content-__ipython-input__.f...\n",
"f(0)\n",
"evaluated!\n",
"________________________________________________________________f - 0.0s, 0.0min\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"((0,), {})"
]
},
"metadata": {
"tags": []
},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CIMIco7NPvXG",
"colab_type": "text"
},
"source": [
"Below, since `f` was called with argument `0` before, it will be rescued from the RAM cache. The other 2 executions (with `1` and `2` as arguments) will trigger evaluations."
]
},
{
"cell_type": "code",
"metadata": {
"id": "BThlcfnX97OG",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 246
},
"outputId": "a0bf5456-eebd-44ad-a461-e35cf616b08c"
},
"source": [
"for i in range(3):\n",
" print(f(i))"
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
"((0,), {})\n",
"________________________________________________________________________________\n",
"[Memory] Calling __main__--content-__ipython-input__.f...\n",
"f(1)\n",
"evaluated!\n",
"________________________________________________________________f - 0.0s, 0.0min\n",
"((1,), {})\n",
"________________________________________________________________________________\n",
"[Memory] Calling __main__--content-__ipython-input__.f...\n",
"f(2)\n",
"evaluated!\n",
"________________________________________________________________f - 0.0s, 0.0min\n",
"((2,), {})\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "glXIbi5DP-R1",
"colab_type": "text"
},
"source": [
"Notice how all 3 results (`f(0)`, `f(1)` and `f(2)`) are stored in disk cache."
]
},
{
"cell_type": "code",
"metadata": {
"id": "cTF0wkgM-JRn",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 90
},
"outputId": "a93642e3-3903-4fd9-fdb4-3cca10da4283"
},
"source": [
"memory.store_backend.get_items()"
],
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[CacheItemInfo(path='cache/joblib/__main__--content-__ipython-input__/f/b09ed3b4f43f553add9ca907ee381d4a', size=88, last_access=datetime.datetime(2020, 7, 17, 15, 49, 25, 636673)),\n",
" CacheItemInfo(path='cache/joblib/__main__--content-__ipython-input__/f/954d7a628712a79211bb7e941b74f75a', size=89, last_access=datetime.datetime(2020, 7, 17, 15, 49, 25, 648675)),\n",
" CacheItemInfo(path='cache/joblib/__main__--content-__ipython-input__/f/e3e67fc4b746637be28240261d89b7e3', size=89, last_access=datetime.datetime(2020, 7, 17, 15, 49, 25, 653676))]"
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7wu3CSqgQVxj",
"colab_type": "text"
},
"source": [
"They are also stored in RAM cache. We can check the stats and see that we have 4 calls to `lru_cache` with 1 hit(s) and 3 misses, where the former happens due to the first `f(0)` call."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Odl8eehOQgE8",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "f71d47f6-5b37-4624-9b12-0be23ae5ea24"
},
"source": [
"f.cache_info()"
],
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"CacheInfo(hits=1, misses=3, maxsize=128, currsize=3)"
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Hg7Q9eINRaQ-",
"colab_type": "text"
},
"source": [
"Now, what happens if your session crashes or you restart the Jupyter kernel? All cached results in RAM will be lost, but they have been persisted to disk. Therefore, the RAM cache will be re-populated without the need to re-evaluate `f` as soon as a call to `f` is made with the appropriate arguments.\n",
"\n",
"We can simulate this by clearing the RAM cache."
]
},
{
"cell_type": "code",
"metadata": {
"id": "fA0AshMlRvgL",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "cd9ad394-ae46-4a31-f2ee-0ab205970a4c"
},
"source": [
"f.cache_clear()\n",
"f.cache_info()"
],
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"CacheInfo(hits=0, misses=0, maxsize=128, currsize=0)"
]
},
"metadata": {
"tags": []
},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nHjY4G--Y-8E",
"colab_type": "text"
},
"source": [
"When `f(1)` is called, it is not executed (as can be noted since `\"evaluated!\"` was not printed). It was also not recovered from RAM cache, as we can see from `misses=1` and `hits=0`). It was in fact recovered from disk cache, re-populated to RAM cache, and returned."
]
},
{
"cell_type": "code",
"metadata": {
"id": "uXMz17JWYVO5",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "d38da2b2-c3f7-4165-fa18-d550567e905f"
},
"source": [
"f(1)"
],
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"((1,), {})"
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "6fJnsMktZKF2",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "57ca24a6-8236-4653-cca5-8248a635ca99"
},
"source": [
"f.cache_info()"
],
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"CacheInfo(hits=0, misses=1, maxsize=128, currsize=1)"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "N49SueQ_YWcn",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "e509b1a4-b83f-46f2-b05a-3dbb21d83a76"
},
"source": [
"f(1)"
],
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"((1,), {})"
]
},
"metadata": {
"tags": []
},
"execution_count": 11
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Efl-2lYSZdG7",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "7c8c95e4-67b8-4acd-e044-2286a70e14f4"
},
"source": [
"f.cache_info()"
],
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"CacheInfo(hits=1, misses=1, maxsize=128, currsize=1)"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "75oKYbKxkrY9",
"colab_type": "text"
},
"source": [
"# Making pandas objects cacheable"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yQwr3gqzeM4M",
"colab_type": "text"
},
"source": [
"Naturally, this setup is ideal for running experiments with pandas dataframes. Say we have a dataframe `df` and we wish to run a batch of parametrized experiments such as "
]
},
{
"cell_type": "code",
"metadata": {
"id": "6cotk6LRe7wq",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"outputId": "87955329-5dd7-4dcd-9ef0-6fdc579e25ef"
},
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame({'a': [0, 1, 2, 3], \n",
" 'b': [0, 0, 1, 1]})\n",
"\n",
"def run_experiment(df, b_val):\n",
" # some heavy computation here\n",
" results = (df.query(\"b == @b_val\")\n",
" ['a']\n",
" .agg(['mean', 'count', 'sum'])\n",
" .to_dict())\n",
" return {'b_val': b_val, **results}\n",
"\n",
"[run_experiment(df, b_val) for b_val in [0, 1]]"
],
"execution_count": 13,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'b_val': 0, 'count': 2.0, 'mean': 0.5, 'sum': 1.0},\n",
" {'b_val': 1, 'count': 2.0, 'mean': 2.5, 'sum': 5.0}]"
]
},
"metadata": {
"tags": []
},
"execution_count": 13
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f1hFnudDgfAh",
"colab_type": "text"
},
"source": [
"If we try to decorate `run_experiment` with `lru_cache`, it will break since pandas objects are not hashable by default."
]
},
{
"cell_type": "code",
"metadata": {
"id": "PVxCnuyxgmMW",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 349
},
"outputId": "ad61b337-ff68-472e-c6fd-97d98ac74bb2"
},
"source": [
"@lru_cache()\n",
"def run_experiment(df, b_val):\n",
" # some heavy computation here\n",
" results = (df.query(\"b == @b_val\")\n",
" ['a']\n",
" .agg(['mean', 'count', 'sum'])\n",
" .to_dict())\n",
" return {'b_val': b_val, **results}\n",
"\n",
"[run_experiment(df, b_val) for b_val in [0, 1]]"
],
"execution_count": 14,
"outputs": [
{
"output_type": "error",
"ename": "TypeError",
"evalue": "ignored",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-14-f2155ed45c55>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m'b_val'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mb_val\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mresults\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0mrun_experiment\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb_val\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mb_val\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m<ipython-input-14-f2155ed45c55>\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m'b_val'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mb_val\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mresults\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0mrun_experiment\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb_val\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mb_val\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m__hash__\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1797\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__hash__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1798\u001b[0m raise TypeError(\n\u001b[0;32m-> 1799\u001b[0;31m \u001b[0;34mf\"{repr(type(self).__name__)} objects are mutable, \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1800\u001b[0m \u001b[0;34mf\"thus they cannot be hashed\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1801\u001b[0m )\n",
"\u001b[0;31mTypeError\u001b[0m: 'DataFrame' objects are mutable, thus they cannot be hashed"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0PSOTSjOgys9",
"colab_type": "text"
},
"source": [
"What exactly does this mean? The idea is that dataframes are mutable, since you can change values inplace. To make it hashable would require implementing a `__hash__` and `__eq__` method to tell Python how to compute the hash and how to test if 2 dataframes are equal. Inspired by [this gist](https://gist.github.com/dsevero/3f3db7acb45d6cd8e945e8a32eaca168), we can do this easily.\n",
"\n",
"**Disclaimer**: If you are executing this on Jupyter, you will need to save the class to a file due to [a bug](https://github.com/joblib/joblib/issues/1035). Use the %%writefile magic command to make it easy."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Ad9jCJZBhS9r",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "efeb19e3-d35b-4fed-dd71-bf6b7b369136"
},
"source": [
"%%writefile hashdf.py\n",
"\n",
"from hashlib import sha256\n",
"from pandas.util import hash_pandas_object\n",
"from pandas import DataFrame\n",
"\n",
"\n",
"class HashableDataFrame(DataFrame):\n",
" def __init__(self, obj):\n",
" super().__init__(obj)\n",
"\n",
" def __hash__(self):\n",
" hash_value = sha256(hash_pandas_object(self, index=True).values)\n",
" hash_value = hash(hash_value.hexdigest())\n",
" return hash_value\n",
"\n",
" def __eq__(self, other):\n",
" return self.equals(other)"
],
"execution_count": 15,
"outputs": [
{
"output_type": "stream",
"text": [
"Writing hashdf.py\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aZeaz8Lbhh9M",
"colab_type": "text"
},
"source": [
"We can now work with the RAM and disk caches seamlessly. Notice how the last two calls to `run_experiment` are recovered from `lru_cache`."
]
},
{
"cell_type": "code",
"metadata": {
"id": "E0Pd_qMeheEf",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 369
},
"outputId": "265f5bfe-6a45-41a6-e902-578ab045e5fb"
},
"source": [
"from hashdf import HashableDataFrame\n",
"\n",
"@lru_cache()\n",
"@memory.cache\n",
"def run_experiment(df, b_val):\n",
" # some heavy computation here\n",
" results = (df.query(\"b == @b_val\")\n",
" ['a']\n",
" .agg(['mean', 'count', 'sum'])\n",
" .to_dict())\n",
" return {'b_val': b_val, **results}\n",
"\n",
"df_hashable = HashableDataFrame(df)\n",
"[run_experiment(df_hashable, b_val) for b_val in [0, 1, 0, 1]]"
],
"execution_count": 16,
"outputs": [
{
"output_type": "stream",
"text": [
"________________________________________________________________________________\n",
"[Memory] Calling __main__--content-__ipython-input__.run_experiment...\n",
"run_experiment( a b\n",
"0 0 0\n",
"1 1 0\n",
"2 2 1\n",
"3 3 1, 0)\n",
"___________________________________________________run_experiment - 0.0s, 0.0min\n",
"________________________________________________________________________________\n",
"[Memory] Calling __main__--content-__ipython-input__.run_experiment...\n",
"run_experiment( a b\n",
"0 0 0\n",
"1 1 0\n",
"2 2 1\n",
"3 3 1, 1)\n",
"___________________________________________________run_experiment - 0.0s, 0.0min\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'b_val': 0, 'count': 2.0, 'mean': 0.5, 'sum': 1.0},\n",
" {'b_val': 1, 'count': 2.0, 'mean': 2.5, 'sum': 5.0},\n",
" {'b_val': 0, 'count': 2.0, 'mean': 0.5, 'sum': 1.0},\n",
" {'b_val': 1, 'count': 2.0, 'mean': 2.5, 'sum': 5.0}]"
]
},
"metadata": {
"tags": []
},
"execution_count": 16
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "MHgAleaDh2o_",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "0ae44c6e-f356-4e06-c081-c6fa52aed7c7"
},
"source": [
"run_experiment.cache_info()"
],
"execution_count": 17,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"CacheInfo(hits=2, misses=2, maxsize=128, currsize=2)"
]
},
"metadata": {
"tags": []
},
"execution_count": 17
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gp91Y9Vgi_UR",
"colab_type": "text"
},
"source": [
"You can clear the disk and memory caches with"
]
},
{
"cell_type": "code",
"metadata": {
"id": "FKXmjlA1jDjJ",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "5e227599-4a1d-47d5-db45-73a469401227"
},
"source": [
"memory.clear()\n",
"run_experiment.cache_clear()\n",
"f.cache_clear()"
],
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": [
"WARNING:root:[Memory(location=cache/joblib)]: Flushing completely the cache\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h387hl9yq5Z5",
"colab_type": "text"
},
"source": [
"# Running Parallel Experiments with caching"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rtFRm0qotIdt",
"colab_type": "text"
},
"source": [
"We can easily run parallel executions of our experiment using `joblib.Parallel` as illustrated below, with the added bonus of disk caching with `joblib.Memory`. Using `lru_cache` directly is not possible due to pickling issues, but is not really needed since results are always available at the disk cache."
]
},
{
"cell_type": "code",
"metadata": {
"id": "l3xLSojYss8v",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"outputId": "35db5f64-a213-4065-9040-18178dfe9cf6"
},
"source": [
"from joblib import Parallel, delayed\n",
"from multiprocessing import cpu_count\n",
"\n",
"parallel = Parallel(n_jobs=cpu_count())\n",
"\n",
"@delayed\n",
"@memory.cache\n",
"def run_experiment(df, b_val):\n",
" # some heavy computation here\n",
" results = (df.query(\"b == @b_val\")\n",
" ['a']\n",
" .agg(['mean', 'count', 'sum'])\n",
" .to_dict())\n",
" return {'b_val': b_val, **results}\n",
"\n",
"parallel(run_experiment(df_hashable, b_val) for b_val in [0, 1])"
],
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'b_val': 0, 'count': 2.0, 'mean': 0.5, 'sum': 1.0},\n",
" {'b_val': 1, 'count': 2.0, 'mean': 2.5, 'sum': 5.0}]"
]
},
"metadata": {
"tags": []
},
"execution_count": 19
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ERV-Ty1MdfN9",
"colab_type": "text"
},
"source": [
"by [dsevero.com](https://dsevero.com)"
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment