Created
September 30, 2022 05:16
-
-
Save zonca/37b4423e507d5dc2fc3b8b5d4db6d316 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"cells": [{"cell_type": "markdown", "source": "## Astrophysics background\n\nIt is very common in Astrophysics to work with sky pixels. The sky is tassellated in patches with specific properties and a sky map is then a collection of intensity values for each pixel. The most common pixelization used in Cosmology is [HEALPix](http://healpix.jpl.nasa.gov).", "metadata": {}}, {"cell_type": "markdown", "source": "Measurements from telescopes are then represented as an array of pixels that encode the pointing of the instrument at each timestamp and the measurement output.", "metadata": {}}, {"cell_type": "markdown", "source": "## Sample timeline", "metadata": {}}, {"execution_count": 1, "cell_type": "code", "source": "import pandas as pd\nimport numba\nimport numpy as np", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"cell_type": "markdown", "source": "For simplicity let's assume we have a sky with 50K pixels:", "metadata": {}}, {"execution_count": 2, "cell_type": "code", "source": "NPIX = 50000", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"cell_type": "markdown", "source": "And we have 50 million measurement from our instrument:", "metadata": {}}, {"execution_count": 3, "cell_type": "code", "source": "NTIME = int(50 * 1e6)", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"cell_type": "markdown", "source": "The pointing of our instrument is an array of pixels, random in our sample case:", "metadata": {}}, {"execution_count": 4, "cell_type": "code", "source": "pixels = np.random.randint(0, NPIX-1, NTIME)", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"cell_type": "markdown", "source": "Our data are also random:", "metadata": {}}, {"execution_count": 5, "cell_type": "code", "source": "timeline = np.random.randn(NTIME)", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"cell_type": "markdown", "source": "## Create a map of the sky with pandas", "metadata": {}}, {"cell_type": "markdown", "source": "One of the most common operations is to sum all of our measurements in a sky map, so the value of each pixel in our sky map will be the sum of each individual measurement.\nThe easiest way is to use the `groupby` operation in `pandas`:", "metadata": {}}, {"execution_count": 6, "cell_type": "code", "source": "timeline_pandas = pd.Series(timeline, index=pixels)", "metadata": {"collapsed": false, "trusted": true}, "outputs": []}, {"execution_count": 7, "cell_type": "code", "source": "timeline_pandas.head()", "metadata": {"collapsed": false, "trusted": true}, "outputs": [{"execution_count": 7, "output_type": "execute_result", "metadata": {}, "data": {"text/plain": "46889 0.407097\n3638 1.300001\n6345 0.174931\n15742 -0.255958\n34308 1.147338\ndtype: float64"}}]}, {"execution_count": 8, "cell_type": "code", "source": "%time m = timeline_pandas.groupby(level=0).sum()", "metadata": {"collapsed": false, "trusted": true}, "outputs": [{"name": "stdout", "output_type": "stream", "text": "CPU times: user 4.09 s, sys: 471 ms, total: 4.56 s\nWall time: 4.55 s\n"}]}, {"cell_type": "markdown", "source": "## Create a map of the sky with numba", "metadata": {}}, {"cell_type": "markdown", "source": "We would like to improve the performance of this operation using `numba`, which allows to produce automatically C-speed compiled code from pure python functions.", "metadata": {}}, {"cell_type": "markdown", "source": "First we need to develop a pure python version of the code, test it, and then have `numba` optimize it:", "metadata": {}}, {"execution_count": 9, "cell_type": "code", "source": "def groupby_python(index, value, output):\n for i in range(index.shape[0]):\n output[index[i]] += value[i]", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"execution_count": 10, "cell_type": "code", "source": "m_python = np.zeros_like(m)", "metadata": {"collapsed": false, "trusted": true}, "outputs": []}, {"execution_count": 11, "cell_type": "code", "source": "%time groupby_python(pixels, timeline, m_python)", "metadata": {"collapsed": false, "trusted": true}, "outputs": [{"name": "stdout", "output_type": "stream", "text": "CPU times: user 37.5 s, sys: 0 ns, total: 37.5 s\nWall time: 37.6 s\n"}]}, {"execution_count": 12, "cell_type": "code", "source": "np.testing.assert_allclose(m_python, m)", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"cell_type": "markdown", "source": "Pure Python is slower than the `pandas` version implemented in `cython`.", "metadata": {}}, {"cell_type": "markdown", "source": "### Optimize the function with numba.jit\n\n`numba.jit` gets an input function and creates an compiled version with does not depend on slow Python calls, this is enforced by `nopython=True`, `numba` would throw an error if it would not be possible to run in `nopython` mode.", "metadata": {}}, {"execution_count": 13, "cell_type": "code", "source": "groupby_numba = numba.jit(groupby_python, nopython=True)", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"execution_count": 14, "cell_type": "code", "source": "m_numba = np.zeros_like(m)", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"execution_count": 15, "cell_type": "code", "source": "%time groupby_numba(pixels, timeline, m_numba)", "metadata": {"collapsed": false, "trusted": true}, "outputs": [{"name": "stdout", "output_type": "stream", "text": "CPU times: user 274 ms, sys: 5 ms, total: 279 ms\nWall time: 278 ms\n"}]}, {"execution_count": 16, "cell_type": "code", "source": "np.testing.assert_allclose(m_numba, m)", "metadata": {"collapsed": true, "trusted": true}, "outputs": []}, {"cell_type": "markdown", "source": "Performance improvement is about 100x compared to Python and 20x compared to Pandas, pretty good!", "metadata": {}}, {"cell_type": "markdown", "source": "## Use numba.jit as a decorator", "metadata": {}}, {"cell_type": "markdown", "source": "The exact same result is obtained if we use `numba.jit` as a decorator:", "metadata": {}}, {"execution_count": 18, "cell_type": "code", "source": "@numba.jit(nopython=True)\ndef groupby_numba(index, value, output):\n for i in range(index.shape[0]):\n output[index[i]] += value[i]", "metadata": {"collapsed": false, "trusted": true}, "outputs": []}], "metadata": {"kernelspec": {"name": "python3", "display_name": "Python 3", "language": "python"}}, "nbformat": 4, "nbformat_minor": 0} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment