AndiH/Hands-On-Performance-Counters.ipynb

## Hands-On-Performance-Counters.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hands-On: Performance Counters\n",
    "\n",
    "This Notebook is part of the exercises for the SC19 Tutorial »Application Porting and Optimization on GPU-accelerated POWER Architectures«. It is to be run on a POWER9 machine; in the tutorial: on Ascent, the POWER9 training cluster of Oak Ridge National Lab.\n",
    "\n",
    "This Notebook can be run interactively on Ascent. If this capability is unavailable to you, use it as a description for executing the tasks on Ascent via a shell access. During data evaluation, the Notebook mentions the corresponding commands to execute in case you are not able to run the Notebook interactively directly on Ascent.\n",
    "\n",
    "## Table of Contents\n",
    "<a name=\"toc\"></a>\n",
    "\n",
    "* [Task 1: Measuring Cycles and Instructions](#task1)\n",
    "* [Task 2: Measuring Loads and Stores](#task2)\n",
    "  - [A: Loads and Stores](#task2-a)\n",
    "  - [B: More Loads and Stores](#task2-b)\n",
    "  - [C: Bandwidth](#task2-c)\n",
    "* [Task E1: Measuring FLOP](#taske1)\n",
    "* [Task E2: Measuring a Greater Range](#taske2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 1: Measuring Cycles and Instructions\n",
    "<a name=\"task1\"></a>\n",
    "\n",
    "Throughout this exercise, the core loop of the Jacobi algorithm is instrumented and analyzed. The part in question is\n",
    "\n",
    "```c\n",
    "for (int iy = iy_start; iy < iy_end; iy++)\n",
    "{\n",
    "    for( int ix = ix_start; ix < ix_end; ix++ )\n",
    "    {\n",
    "        Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - (A[ iy   *nx+ix+1] + A[ iy   *nx+ix-1]\n",
    "                                                +  A[(iy-1)*nx+ix  ] + A[(iy+1)*nx+ix  ]));\n",
    "        error = fmaxr( error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "The code is instrumented using PAPI. The API routine `PAPI_add_named_event()` is used to add *named* PMU events outside of the relaxation iteration. After that, calls to `PAPI_start()`\n",
    "and `PAPI_stop()` can be used to count how often a PMU event is incremented.\n",
    "\n",
    "For the first task, we will measure quantities often used to characterize an application: cycles and instructions.\n",
    "\n",
    "**TASK**: Please measure counters for completed instructions and run cycles. See the TODOs in file [`poisson2d.ins_cyc.c`](poisson2d.ins_cyc.c). You can either edit the files with Jupyter capabilities by clicking on the link of the file or selecting it in the file drawer on the left; or use a dedicated editor on the system(`vim` is available). The names of the counters to be implemented are `PM_INST_CMPL` and `PM_RUN_CYC`.\n",
    "\n",
    "After changing the source code, compile it with `make task1` or by executing the following cell (we need to change directories first, though).  \n",
    "*(Using the `Makefile` we have hidden quite a few intricacies from you in order to focus on the relevant content at hand. Don't worry too much about it right now – we'll un-hide it gradually during the course of the tutorial.)*\n",
    "\n",
    "[Back to top](#toc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%cd Tasks/\n",
    "# Use `%cd Solutions` to look at the solutions for each task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!make task1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we launch our measurement campaign we should make sure that the program is measuring correctly. Let's invoking it, for instance, with these arguments: `./poisson2d.ins_cyc.bin 100 64 32` – see the next cell. The `100` specifies the number of iterations to perform, `64` and `32` are the size of the grid in y and x direction, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!./poisson2d.ins_cyc.bin 100 64 32\n",
    "# alternatively call !make run_task1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alright! That should return a comma-seperated list of measurements.\n",
    "\n",
    "For the following runs, we are going to use Ascent's compute backend nodes which are not shared amongst users and also have six GPUs available (each!). We use the available batch scheduler *IBM Spectrum LSF* for this. For convenience, a call to the batch submission system is stored in the environment variable `$SC19_SUBMIT_CMD`. You are welcome to adapt it once you get more familiar with the system.\n",
    "\n",
    "For now, we want to run our first benchmarking run and measure cycles and instructions for different data sizes, as a function of `nx`. The Makefile holds a target for this, call it with `make bench_task1`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!make bench_task1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the run is completed, let's study the data!\n",
    "\n",
    "This can be done best in the interactive version of the Jupyter Notebook. In case this version of the description is unavailable to you, call the Makefile target `make graph_task1` (either with X forwarding, or download the resulting PDF)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import common\n",
    "%matplotlib inline\n",
    "sns.set()\n",
    "plt.rcParams['figure.figsize'] = [14, 6]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Execute the following cell if you want to switch to color-blind-safer colors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set_palette(\"colorblind\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize'] = [14, 6]\n",
    "df = pd.read_csv(\"poisson2d.ins_cyc.bin.csv\", skiprows=range(2, 50000, 2))  # Read in the CSV file from the bench run; parse with Pandas\n",
    "df[\"Grid Points\"] = df[\"nx\"] * df[\"ny\"]  # Add a new column of the number of grid points (the product of nx and ny)\n",
    "df.head()  # Display the head of the Pandas dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look at the counters we've just measured and see how they scaling with increasing number of grid points.\n",
    "\n",
    "*In the following, we are always using the minimal value of the counter (indicated by »(min)«) as this should give us an estimate of the best achievable result of the architecture.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)\n",
    "df.set_index(\"Grid Points\")[\"PM_RUN_CYC (min)\"].plot(ax=ax1, legend=True);\n",
    "df.set_index(\"Grid Points\")[\"PM_INST_CMPL (min)\"].plot(ax=ax2, legend=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although some slight variations can be seen for run cycles for many grid points, the correlation looks quite linear (as one would naively expect). Let's test that by fitting a linear function!\n",
    "\n",
    "*The details of the fitting have been extracted into dedicated function, `print_and_return_fit()`, of the `common.py` helper file. If you're interested, [go have a look at it](common.py).* "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def linear_function(x, a, b):\n",
    "    return a*x+b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_parameters, fit_covariance = common.print_and_return_fit(\n",
    "    [\"PM_RUN_CYC (min)\", \"PM_INST_CMPL (min)\"], \n",
    "    df.set_index(\"Grid Points\"), \n",
    "    linear_function,\n",
    "    format_uncertainty=\".4f\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's overlay our fits to the graphs from before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)\n",
    "for ax, pmu_counter in zip([ax1, ax2], [\"PM_RUN_CYC (min)\", \"PM_INST_CMPL (min)\"]):\n",
    "    df.set_index(\"Grid Points\")[pmu_counter].plot(ax=ax, legend=True);\n",
    "    ax.plot(\n",
    "        df[\"Grid Points\"], \n",
    "        linear_function(df[\"Grid Points\"], *fit_parameters[pmu_counter]), \n",
    "        linestyle=\"--\", \n",
    "        label=\"Fit: {:.2f} * x + {:.2f}\".format(*fit_parameters[pmu_counter])\n",
    "    )\n",
    "    ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please execute the next cell to summarize the first task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"The algorithm under investigation runs about {:.0f} cycles and executes about {:.0f} instructions per grid point\".format(\n",
    "    *[fit_parameters[pmu_counter][0] for pmu_counter in [\"PM_RUN_CYC (min)\", \"PM_INST_CMPL (min)\"]]\n",
    "))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Bonus:**\n",
    "\n",
    "The linear fits also calculate a y intersection (»`b`«). How do you interpret this value?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are revisiting the graph in a little while.\n",
    "\n",
    "[Back to top](#toc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2: Measuring Loads and Stores\n",
    "<a name=\"task2\"></a>\n",
    "\n",
    "Looking at the source code, how many loads and stores from / to memory do you expect? Have a look at the loop which we instrumented.\n",
    "\n",
    "Let's compare your estimate to what the system actually does!\n",
    "\n",
    "### Task A\n",
    "<a name=\"task2-a\"></a>\n",
    "\n",
    "Please measure counters for loads and stores. See the TODOs in [`poisson2d.ld_st.c`](/edit/Tasks/poisson2d.ld_st.c). This time, implement `PM_LD_CMPL` and `PM_ST_CMPL`.\n",
    "\n",
    "Compile with `make task2`, test your program with a single run with `make run_task2`, and then finally submit a benchmarking run to the batch system with `make bench_task2`. The following cell will take care of all this.\n",
    "\n",
    "[Back to top](#toc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!make bench_task2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the run finished, let's plot it again in the course of the following cells (non-interactive: `make graph_task2a`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ldst = pd.read_csv(\"poisson2d.ld_st.bin.csv\", skiprows=range(2, 50000, 2))\n",
    "df_ldst[\"Grid Points\"] = df_ldst[\"nx\"] * df_ldst[\"ny\"] \n",
    "df_ldst.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)\n",
    "df_ldst.set_index(\"Grid Points\")[\"PM_LD_CMPL (min)\"].plot(ax=ax1, legend=True);\n",
    "df_ldst.set_index(\"Grid Points\")[\"PM_ST_CMPL (min)\"].plot(ax=ax2, legend=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also this behaviour looks – at a first glance – linear. We can again fit a first-order polynom (and re-use our previously defined function `curve_fit`)!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_fit, _cov = common.print_and_return_fit(\n",
    "    [\"PM_LD_CMPL (min)\", \"PM_ST_CMPL (min)\"], \n",
    "    df_ldst.set_index(\"Grid Points\"), \n",
    "    linear_function,\n",
    "    format_value=\".4f\"\n",
    ")\n",
    "fit_parameters = {**fit_parameters, **_fit}\n",
    "fit_covariance = {**fit_covariance, **_cov}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's overlay this in one common plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)\n",
    "for ax, pmu_counter in zip([ax1, ax2], [\"PM_LD_CMPL (min)\", \"PM_ST_CMPL (min)\"]):\n",
    "    df_ldst.set_index(\"Grid Points\")[pmu_counter].plot(ax=ax, legend=True);\n",
    "    ax.plot(\n",
    "        df_ldst[\"Grid Points\"], \n",
    "        linear_function(df[\"Grid Points\"], *fit_parameters[pmu_counter]), \n",
    "        linestyle=\"--\", \n",
    "        label=\"Fit: {:.2f} * x + {:.2f}\".format(*fit_parameters[pmu_counter])\n",
    "    )\n",
    "    ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Did you expect more?\n",
    "\n",
    "The reason is simple: Among the load and store instructions counted by `PM_LD_CMPL` and `PM_ST_CMPL` are vector instructions which can load and store multiple (in this case: two) values at a time. To see how many *bytes* are loaded and stored, we need to measure counters for vectorized loads and stores as well.\n",
    "\n",
    "### TASK B\n",
    "<a name=\"task2-b\"></a>\n",
    "\n",
    "Please measure counters for _vectorized_ loads and _vectorized_ stores. See the TODOs in [`poisson2d.vld.c`](poisson2d.vld.c) and [`poisson2d.vst.c`](poisson2d.vst.c) (*Note: These vector counters can not be measured together and need separate files and runs*). Can you find out the name of the counters yourself, using `papi_native_avail | grep VECTOR_`?\n",
    "\n",
    "Compile, test, and bench-run your program again.\n",
    "\n",
    "[Back to top](#toc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!papi_native_avail | grep VECTOR_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`make bench_task3` will submit benchmark runs of both vectorized counters to the batch system (as two subsequent runs of the individual files)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!make bench_task3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot it again, as soon as the run finishes! Non-interactively, call `graph_task2b`.\n",
    "\n",
    "*Because we couldn't measure the two vector counters at the same time, we have two CSV files to read in now. We combine them into one common dataframe `df_vldvst` in the following.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_vld = pd.read_csv(\"poisson2d.vld.bin.csv\", skiprows=range(2, 50000, 2))\n",
    "df_vst = pd.read_csv(\"poisson2d.vst.bin.csv\", skiprows=range(2, 50000, 2))\n",
    "df_vldvst = pd.concat([df_vld.set_index(\"nx\"), df_vst.set_index(\"nx\")[['PM_VECTOR_ST_CMPL (total)', 'PM_VECTOR_ST_CMPL (min)', ' PM_VECTOR_ST_CMPL (max)']]], axis=1).reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_vldvst[\"Grid Points\"] = df_vldvst[\"nx\"] * df_vldvst[\"ny\"] \n",
    "df_vldvst.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)\n",
    "df_vldvst.set_index(\"Grid Points\")[\"PM_VECTOR_LD_CMPL (min)\"].plot(ax=ax1, legend=True);\n",
    "df_vldvst.set_index(\"Grid Points\")[\"PM_VECTOR_ST_CMPL (min)\"].plot(ax=ax2, legend=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also here seems to be a linear correlation. Let's do our fitting and plot directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_fit, _cov = common.print_and_return_fit(\n",
    "    [\"PM_VECTOR_LD_CMPL (min)\", \"PM_VECTOR_ST_CMPL (min)\"], \n",
    "    df_vldvst.set_index(\"Grid Points\"), \n",
    "    linear_function,\n",
    "    format_value=\".4f\",\n",
    ")\n",
    "fit_parameters = {**fit_parameters, **_fit}\n",
    "fit_covariance = {**fit_covariance, **_cov}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)\n",
    "for ax, pmu_counter in zip([ax1, ax2], [\"PM_VECTOR_LD_CMPL (min)\", \"PM_VECTOR_ST_CMPL (min)\"]):\n",
    "    df_vldvst.set_index(\"Grid Points\")[pmu_counter].plot(ax=ax, legend=True);\n",
    "    ax.plot(\n",
    "        df_vldvst[\"Grid Points\"], \n",
    "        linear_function(df[\"Grid Points\"], *fit_parameters[pmu_counter]), \n",
    "        linestyle=\"--\", \n",
    "        label=\"Fit: {:.2f} * x + {:.2f}\".format(*fit_parameters[pmu_counter])\n",
    "    )\n",
    "    ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try to make sense of those numbers.\n",
    "\n",
    "Vector loads and vector stores use two 8 Byte values at a time. When we measured loads and stores with `LD_CMPL` and `ST_CMPL` in part A of this task, we measured total number of stores and loads; that is: vector and scalar versions of the instructions. In order to convert the load and store instructions into **bytes** loaded and stored, we need to separate them. The difference of total instructions and vector instructions yield scalar instructions. We multiply the scalar instructions by 8 Byte (double precision) and the vector instructions by 16 Byte (two loads or stores of double precision). That yields the loaded or stored data (or, more precisely, the instruction-equivalent data).\n",
    "\n",
    "To formualize it, see the following equations, as an example for load ($ld$), with $b$ denoting data loaded in bytes and $n$ denoting the number of instructions.\n",
    "\n",
    "\\begin{align}\n",
    "b_\\text{ld} &= b_\\text{ld}^\\text{scalar} + b_\\text{ld}^\\text{vector}\\\\\n",
    "b_\\text{ld}^\\text{scalar} &= n_\\text{ld}^\\text{scalar} * 8\\,\\text{Byte} \\\\\n",
    "b_\\text{ld}^\\text{vector} &= n_\\text{ld}^\\text{vector} * 16\\,\\text{Byte} \\\\\n",
    "n_\\text{ld}^\\text{scalar} &= n_\\text{ld}^\\text{total} - n_\\text{ld}^\\text{vector}\\\\\n",
    "\\Rightarrow b_\\text{ld} &= n_\\text{ld}^\\text{scalar}* 8 \\,\\text{Byte} + n_\\text{ld}^\\text{vector} * 16\\,\\text{Byte} \\\\\n",
    "& = (n_\\text{ld}^\\text{scalar}+2 n_\\text{ld}^\\text{vector}) * 8\\,Byte \\\\\n",
    "& = (n_\\text{ld}^\\text{total} - n_\\text{ld}^\\text{vector} + 2 n_\\text{ld}^\\text{vector}) * 8\\,Byte \\\\\n",
    "& = (n_\\text{ld}^\\text{total} + n_\\text{ld}^\\text{vector}) *8\\,Byte \n",
    "\\end{align}\n",
    "\n",
    "We are going to print this in the next cell. In case you look at this Notebook non-interactively, call `graph_task2b-2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_byte = pd.DataFrame()\n",
    "df_byte[\"Loads\"]  = (df_vldvst.set_index(\"Grid Points\")[\"PM_VECTOR_LD_CMPL (min)\"] + df_ldst.set_index(\"Grid Points\")[\"PM_LD_CMPL (min)\"])*8\n",
    "df_byte[\"Stores\"] = (df_vldvst.set_index(\"Grid Points\")[\"PM_VECTOR_ST_CMPL (min)\"] + df_ldst.set_index(\"Grid Points\")[\"PM_ST_CMPL (min)\"])*8\n",
    "ax = df_byte.plot()\n",
    "ax.set_ylabel(\"Bytes\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quantify the difference by, again, fitting a linear function to the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_fit, _cov = common.print_and_return_fit(\n",
    "    [\"Loads\", \"Stores\"], \n",
    "    df_byte, \n",
    "    linear_function\n",
    ")\n",
    "fit_parameters = {**fit_parameters, **_fit}\n",
    "fit_covariance = {**fit_covariance, **_cov}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Analagously to the proportionality factors, this much is loaded/stored per grid point."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Not really a* <a name=\"task2-c\"></a>**TASK C**: We can combine this information with the cycles measured in Task 1 to create a bandwidth of exchanged bytes per cycle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_bandwidth = pd.DataFrame()\n",
    "df_bandwidth[\"Bandwidth / Byte/Cycle\"] = (df_byte[\"Loads\"] + df_byte[\"Stores\"]) / df.set_index(\"Grid Points\")[\"PM_RUN_CYC (min)\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's display it as a function of grid points. And also compare it to the available L1 cache bandwidth in a second (sub-)plot. Non-interactive users, call `make graph_task2c`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)\n",
    "for ax in [ax1, ax2]:\n",
    "    df_bandwidth[\"Bandwidth / Byte/Cycle\"].plot(ax=ax, legend=True, label=\"Jacobi Bandwidth\")\n",
    "    ax.set_ylabel(\"Byte/Cycle\")\n",
    "ax2.axhline(2*16, color=sns.color_palette()[1], label=\"L1 Bandwidth\");\n",
    "ax2.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, we are quite a bit away from the available L1 cache bandwidth. Can you think of reasons why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task E1: Measuring FlOps\n",
    "<a name=\"taske1\"></a>\n",
    "\n",
    "If you still have time, feel free to work on the following extended task.\n",
    "\n",
    "\n",
    "**TASK**: Please measure counters for _vectorized_ floating point operations and _scalar_ floating point operations. The two counters can also not be measured during the same run. So please see the TODOs in [`poisson2d.sflops.c`](/edit/Tasks/poisson2d.sflops.c) and [`poisson2d.vflops.c`](/edit/Tasks/poisson2d.vflops.c). By now you should be able to find out the names of the counters by yourself (*Hint: they include the words »scalar« and »vector«…*).\n",
    "\n",
    "As usual, compile, test, and bench-run your program.\n",
    "\n",
    "[Back to top](#toc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!make bench_task4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_sflop = pd.read_csv(\"poisson2d.sflop.bin.csv\", skiprows=range(2, 50000, 2))\n",
    "df_vflop = pd.read_csv(\"poisson2d.vflop.bin.csv\", skiprows=range(2, 50000, 2))\n",
    "df_flop = pd.concat([df_sflop.set_index(\"nx\"), df_vflop.set_index(\"nx\")[['PM_VECTOR_FLOP_CMPL (total)', 'PM_VECTOR_FLOP_CMPL (min)', ' PM_VECTOR_FLOP_CMPL (max)']]], axis=1).reset_index()\n",
    "df_flop.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, the name of the vector counter is a bit misleading; not floating point operations are measured but floating point instructions. To get *real* floating point operations, each value needs to be multiplied by the vector width (2). We can plot the values afterwards (non-interactive: `make graph_task4`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_flop[\"Grid Points\"] = df_flop[\"nx\"] * df_flop[\"ny\"]\n",
    "df_flop[\"Vector FlOps (min)\"] = df_flop[\"PM_VECTOR_FLOP_CMPL (min)\"] * 2\n",
    "df_flop[\"Scalar FlOps (min)\"] = df_flop[\"PM_SCALAR_FLOP_CMPL (min)\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_flop.set_index(\"Grid Points\")[[\"Scalar FlOps (min)\", \"Vector FlOps (min)\"]].plot();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_fit, _cov = common.print_and_return_fit(\n",
    "    [\"Scalar FlOps (min)\", \"Vector FlOps (min)\"], \n",
    "    df_flop.set_index(\"Grid Points\"), \n",
    "    linear_function\n",
    ")\n",
    "fit_parameters = {**fit_parameters, **_fit}\n",
    "fit_covariance = {**fit_covariance, **_cov}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With that measured, we can determine the Arithmetic Intensity; the balance of floating point operations to bytes transmitted:\n",
    "\n",
    "\\begin{align}\n",
    "\\text{AI}^\\text{emp} = I_\\text{flop} / I_\\text{mem} \\text{,}\n",
    "\\end{align}\n",
    "\n",
    "with $I$ denoting the respective amount. This is the emperically determined Arithmetic Intensity.\n",
    "\n",
    "In the non-interactive version of the Notebook, please plot the graph calling `make graph_task4-2` in the terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "I_flop_scalar = df_flop.set_index(\"Grid Points\")[\"Scalar FlOps (min)\"]\n",
    "I_flop_vector = df_flop.set_index(\"Grid Points\")[\"Vector FlOps (min)\"]\n",
    "I_mem_load    = df_byte[\"Loads\"]\n",
    "I_mem_store   = df_byte[\"Stores\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ai = pd.DataFrame()\n",
    "df_ai[\"Arithmetic Intensity\"] = (I_flop_scalar + I_flop_vector) / (I_mem_load + I_mem_store)\n",
    "ax = df_ai.plot();\n",
    "ax.set_ylabel(\"Byte/FlOp\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thinking back to the first lecture of the tutorial, what Arithemtic Intensity did you expect?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task E2: Measuring a Larger Range\n",
    "<a name=\"taske2\"></a>\n",
    "\n",
    "If you still still have time, you might venture into your own benchmarking adventure.\n",
    "\n",
    "Maybe you noticed already, for instance in Task 2 C: At the very right to very large numbers of grid points, the behaviour of the graph changed. Something is happening there!\n",
    "\n",
    "\n",
    "**TASK**: Revisit the counters measured above for a larger range of `nx`. Right now, we only studied `nx` until 1000. New effects appear above that value – partly only well above, though ($nx > 15000$).\n",
    "\n",
    "You're on your own here. Edit the `bench.sh` script to change the range and the stepping increments.\n",
    "\n",
    "**Good luck!**\n",
    "\n",
    "[Back to top](#toc)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "SC19 S2 (Python 3 + PAPI)",
   "language": "python",
   "name": "session-2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}