schicks/v2.ipynb

## v2.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working Title\n",
    "\n",
    "Transforming data into the form you want to work with as it enters your system is an essential part of software design. One of the mistakes I have made most often is keeping too much information from the outside, thinking that if I have it already then when I need it later I can avoid a painful reworking of the code, but instead the increased data and the structure it carries with it leads to complex, tightly coupled code, and when requirements do change the inevitable refactoring is significantly more painful. Instead, it helps to take a minimalist approach, building no more structure than you need, which leaves more options open for extension later on. The other way to look at this is that your domain type should be as _specific_ as is possible, which can be a useful frame for when you are trying to work out what you can discard. This example is in OCaml, because I believe that nominal types and pure functions help a lot in disciplining this sort of approach.\n",
    "\n",
    "\n",
    "In pluralsight search, we want to be able to slice and dice the search results that came back in three different ways; by content type, by content level, and by timeframe. When we apply a filter using one of these 'facets' we want it to affect the counts when we are breaking down along some other facet, but not when we are looking for the counts on that facet. For example, when filtering to intermediate courses, we don't want that filter to appear to be applied when trying to find out how many beginner courses there were, or we would just always get zero.\n",
    "\n",
    "To get this data, we can use [elasticsearch subaggregations](https://www.elastic.co/blog/intro-to-aggregations-pt-2-sub-aggregations) to break down the counts. Aggregations in elasticsearch allow you to break down the counts of your content based on the value of some field; for instance, we could break down along the content type field and get the count of each content type in our index. Subaggregations allow you to repeat that process within each bucket of your original aggregation, so we could see the counts broken down by level within each of our content type. This can continue as far as you want, forming a tree of aggregations that break down the count. \n",
    "\n",
    "```\n",
    "Root\n",
    "|-Video(253)\n",
    "|   |-Beginner(87)\n",
    "|   |   |-LastSixMonths(4)\n",
    "|   |   |-LastYear(7)\n",
    "|   |   |-LastTwoYears(32)\n",
    "|   |   |-AllTime(87)\n",
    "|   |   |\n",
    "|   |-Intermediate(93)\n",
    "|   |   |-LastSixMonths(3)\n",
    "|   |   |-LastYear(56)\n",
    "|   |   |-LastTwoYears(74)\n",
    "|   |   |-AllTime(93)\n",
    "|   |   |\n",
    "|   |-Advanced(73)\n",
    "|   |   |-LastSixMonths(2)\n",
    "|   |   |-LastYear(52)\n",
    "|   |   |-LastTwoYears(73)\n",
    "|   |   |-AllTime(73)\n",
    "|   |   |\n",
    "|-Interactive(156)\n",
    "|   |-Beginner(47)\n",
    "|   |   |...And so on\n",
    "```\n",
    "\n",
    "\n",
    "This aggregation compactly contains all of the information we need to get the counts given any sets of filters, which means that we could do this once and apply different filters to the result for each filtering widget we have on the page. However, an aggregation tree also has way more than we need. There is lots of unneccessary structure to handle the possibility of multiple different subaggregations under each aggregation, even though we only ever break down by one facet at each layer. There are intermediate counts stored at every layer of the tree that we have no use for, because all of the same information and more is encoded in the bottom layer.\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"aggregations\": {\n",
    "        \"contentType\": {\n",
    "            \"buckets\": [\n",
    "                {\n",
    "                    \"key\": \"videoCourse\",\n",
    "                    \"doc_count\": 253,\n",
    "                    \"levels\": {\n",
    "                        \"buckets\": [\n",
    "                            {\n",
    "                                \"key\": \"beginner\",\n",
    "                                \"doc_count\": 87,\n",
    "                                \"timeFrame\": {\n",
    "                                    \"buckets\": [\n",
    "                                        {\n",
    "                                            \"key\": \"LastSixMonths\",\n",
    "                                            \"doc_count\" 4\n",
    "                                        },\n",
    "                                        ...\n",
    "                                    ]\n",
    "                                }\n",
    "                            }\n",
    "                            ...\n",
    "                        ]\n",
    "                    }\n",
    "                },\n",
    "                {\n",
    "                    \"key\": \"interactiveCourse\",\n",
    "                    ...\n",
    "                }\n",
    "            ]\n",
    "        }\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "First, lets model this as directly as possible as a type, without worrying too much about carrying extra information with us. Right away we're going to start squeezing out some of the unimportant details though; we know in our case that we only have one sub aggregation, so we don't need to allow for arbitrarily many keys, and the name `buckets` which occurs at every level of this structure is completely useless."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "type elasticAggregation = {\n",
       "  key : string;\n",
       "  docCount : int;\n",
       "  subAggregation : (string * elasticAggregation list) option;\n",
       "}\n"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "val aggregatedData : elasticAggregation list =\n",
       "  [{key = \"videoCourse\"; docCount = 253;\n",
       "    subAggregation =\n",
       "     Some\n",
       "      (\"levels\",\n",
       "       [{key = \"beginner\"; docCount = 87;\n",
       "         subAggregation =\n",
       "          Some\n",
       "           (\"timeFrame\",\n",
       "            [{key = \"LastSixMonths\"; docCount = 4; subAggregation = None}])}])}]\n"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(*The type*)\n",
    "type elasticAggregation = {\n",
    "    key: string;\n",
    "    docCount: int;\n",
    "    subAggregation: (string * (elasticAggregation list)) option\n",
    "}\n",
    "(*The data*)\n",
    "let aggregatedData = [{\n",
    "    key = \"videoCourse\";\n",
    "    docCount = 253;\n",
    "    subAggregation = Some (\"levels\", [{\n",
    "        key = \"beginner\";\n",
    "        docCount = 87;\n",
    "        subAggregation = Some (\"timeFrame\", [{\n",
    "            key = \"LastSixMonths\";\n",
    "            docCount = 4;\n",
    "            subAggregation = None;\n",
    "        }]);\n",
    "    }]);\n",
    "}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_It's important to remember that while I'm only showing a single branch here, in practice there would be a whole tree; the top level would be a list of every content type, each of which would contain a list of every level, each of which would contain a list of every timeframe. That would just be a lot to type out here._\n",
    "\n",
    "This is a record type that has the same string key that appears in the JSON. It already strips down to only information we will need; the value of a facet represented by a given aggregation, the next facet to split by, and the document count at this point. It is also already significantly more specific than the JSON was; it doesn't allow multiple different sub aggregations, which means that the keys on a given aggregation are always the same. However, it still is significantly less specific than it could be. The keys and subaggregation names are represented as strings, but that's way too general; we don't want to write code that deals with every possible string, we want to write code that deals with our particular domain. We can do this with union types.\n",
    "\n",
    "Union types allow you to specify a given type as being one of a small set of possibilities, rather than the infinite sets of possibilities represented by things like `string` and `list`. We can use them to describe the possible values of all of our facets, as well as the different facets that we have."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "type contentType = Video | Interactive | Project | Path | Guide | Assessment\n"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "type level = Beginner | Intermediate | Advanced\n"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "type timeFrame = LastSixMonths | LastYear | LastTwoYears | AllTime\n"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "type facet =\n",
       "    ContentType of contentType\n",
       "  | Level of level option\n",
       "  | TimeFrame of timeFrame option\n"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type contentType = Video | Interactive | Project | Path | Guide | Assessment\n",
    "type level = Beginner | Intermediate | Advanced\n",
    "type timeFrame = LastSixMonths | LastYear | LastTwoYears | AllTime\n",
    "\n",
    "type facet = \n",
    "    | ContentType of contentType\n",
    "    | Level of level option\n",
    "    | TimeFrame of timeFrame option"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Union types can do way more than this, because each of the small set of possibilites can themselves contain arbitrary other types, but the new structure that they introduce is a way to disambiguate between a few options, even if those options are infinite. For our purposes, we don't need our unions to contain any infinite type, so they truly are finite._\n",
    "\n",
    "Here we have a union type for each of our facets, representing the possible values that facet can take, and one union representing the facets themselves. That one looks a little different; the variants are all `Something of something`. This is because each variant of that type contains some data, namely what the value of that facet is. In our case, everything has a content type, but some things might not have a level or a timeframe. Because of that, those two facet variants contain `option` types, which could either be a particular value (`Some beginner`) or nothing at all (`None`). With these types defining our domain, we can redefine our `elasticAggregation` type to be even more specific."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "type elasticAggregation = {\n",
       "  label : facet;\n",
       "  docCount : int;\n",
       "  subAggregation : elasticAggregation list option;\n",
       "}\n"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "val aggregatedData : elasticAggregation list =\n",
       "  [{label = ContentType Video; docCount = 253;\n",
       "    subAggregation =\n",
       "     Some\n",
       "      [{label = Level (Some Beginner); docCount = 87;\n",
       "        subAggregation =\n",
       "         Some\n",
       "          [{label = TimeFrame (Some LastSixMonths); docCount = 4;\n",
       "            subAggregation = None}]}]}]\n"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(*The type*)\n",
    "type elasticAggregation = {\n",
    "    label: facet;\n",
    "    docCount: int;\n",
    "    subAggregation: elasticAggregation list option\n",
    "}\n",
    "\n",
    "(*The data*)\n",
    "let aggregatedData = [\n",
    "    {\n",
    "        label = ContentType Video;\n",
    "        docCount = 253;\n",
    "        subAggregation = Some [{\n",
    "            label = Level (Some Beginner);\n",
    "            docCount = 87;\n",
    "            subAggregation = Some [{\n",
    "                label = TimeFrame (Some LastSixMonths);\n",
    "                docCount = 4;\n",
    "                subAggregation = None;\n",
    "            }]\n",
    "        }]\n",
    "    }\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This only allows for aggregation names and facet values that we already know about. It also organizes than information better, putting all of the information about the facet in the `label` field rather than spreading it between two strings. \n",
    "\n",
    "We could stop here if we wanted, but this still has a few problems. While this type looks nice and compact in its definition, the fact that `subAggregation` is always present but optional means that all the way down the tree we need to keep checking whether there is a subaggregation of a given node. We're also keeping docCounts at every level of the tree, even though all the information we need is down at the leaf level. We can change this structure so that nodes at the bottom of the tree have a different shape than nodes higher in the tree, using another union type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "type elasticAggregation = Inner of inner | Leaf of leaf\n",
       "and inner = { label : facet; subAggregation : elasticAggregation list; }\n",
       "and leaf = { label : facet; docCount : int; }\n"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "val aggregatedData : elasticAggregation list =\n",
       "  [Inner\n",
       "    {label = ContentType Video;\n",
       "     subAggregation =\n",
       "      [Inner\n",
       "        {label = Level (Some Beginner);\n",
       "         subAggregation =\n",
       "          [Leaf {label = TimeFrame (Some LastYear); docCount = 200}]}]}]\n"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "File \"[12]\", line 9, characters 4-17:\n",
      "Warning 30: the label label is defined in both types inner and leaf.\n"
     ]
    }
   ],
   "source": [
    "(*The type*)\n",
    "type elasticAggregation =\n",
    "    | Inner of inner\n",
    "    | Leaf of leaf\n",
    "and inner = {\n",
    "    label: facet;\n",
    "    subAggregation: elasticAggregation list\n",
    "}\n",
    "and leaf = {\n",
    "    label: facet;\n",
    "    docCount: int;\n",
    "}\n",
    "\n",
    "(*The data*)\n",
    "let aggregatedData = [\n",
    "    Inner {\n",
    "        label = ContentType Video;\n",
    "        subAggregation = [Inner {\n",
    "            label = Level (Some Beginner);\n",
    "            subAggregation = [Leaf {\n",
    "                label = TimeFrame (Some LastYear);\n",
    "                docCount = 200\n",
    "            }]\n",
    "        }]\n",
    "    }\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_This introduces one more piece of new syntax. Types are not allowed to be **corecursive** by default, meaning that type `a` can't reference type `b` if type `b` also references type `a`. Using `and` to create types lifts this restriction, so we can reference `aggregationTree` in the types `inner` and `leaf` as well as referencing `inner` and `leaf` in `aggregationTree`._\n",
    "\n",
    "This is a more major departure from our original structure, but once again strips away unneccessary information; we no longer need to know about the document counts on aggregations other than at the bottom of the tree. It also exposes whether we are at the bottom or the tree as the outermost label that we can see about an aggregation. This is extremely helpful when for implementing the logic of counting up the data in the tre."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "val sumAggregation :\n",
       "  (elasticAggregation -> bool) -> elasticAggregation -> int = <fun>\n"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "let rec sumAggregation filter aggregation = match (aggregation, (filter aggregation)) with\n",
    "    | _, false -> 0\n",
    "    | Leaf {docCount}, _ -> docCount    \n",
    "    | Inner {subAggregation}, _ -> (List.map (sumAggregation filter) subAggregation) |> List.fold_left (+) 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Functions in OCaml can be a little baffling if you're not used to their structure; where is the argument list? where is the return statement? The intuition here is that declaring a function is the same as declaring any other variable, except functions have some free parameters (their arguments). To represent that, the names of the arguments are on the left side of the equals sign, after the name of the function. The body of the function is then just a statement that can reference the arguments as well as anything else that is in scope. Our function here is marked as `rec` so that it is allowed by the compiler to reference itself recursively. \n",
    "The other syntax here that might seem unfamiliar is the `match variable with` syntax, which is called pattern matching. Pattern matching allows the code to do different things depending on the structure of the variable._\n",
    "\n",
    "In this function, we are matching on the tuple of the aggregation itself and the result of applying the filter to the aggregation. The first pattern says \"regardless of what the aggregation is, if the filter returns false the result is 0\". The second pattern says \"If the aggregation is a leaf node, return the `docCount`\". The third pattern is where most of the meat of the function is, and says \"If the aggregation is an inner node, apply this function to every element of the `subAggregation` and sum up the results\".\n",
    "\n",
    "Apart from new syntax though, the logic expressed by this function is now extremely straightforward; we have three possible cases, and expressing each of them is a single line. However, part of the reason this is so simple is that it leaves defining which nodes should be filtered to the caller, by asking for a function that takes an aggregation and returns a boolean as an argument. How would we define how filters would work in our domain? Well, a set of filters would be made up of a filter on content type, level and timeframe. That should be all we need to think about when we call in to this function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "type filters = {\n",
       "  contentType : contentType option;\n",
       "  level : level option;\n",
       "  timeFrame : timeFrame;\n",
       "}\n"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type filters = {\n",
    "    contentType: contentType option;\n",
    "    level: level option;\n",
    "    timeFrame: timeFrame;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_timeFrame is the only filter that isn't optional; it makes sense to talk about any content type being allowed, but because our time frames overlap we only ever want to specify one, even if that one is `AllTime`._\n",
    "\n",
    "This type is the point of all of this work. Now that we have a sufficiently specific representation of our data, we can ask a question about that data with this extremely simple representation. We just need to define the logic of how these filters should be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "val getLabel : elasticAggregation -> facet = <fun>\n"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "ename": "error",
     "evalue": "compile_error",
     "output_type": "error",
     "traceback": [
      "\u001b[32mFile \"[19]\", line 10, characters 51-52:\n\u001b[31mError: This expression has type contentType/1894\n       but an expression was expected of type contentType/1712\n\u001b[36m   9: \u001b[30m    | TimeFrame None, _ -> false\n\u001b[36m  10: \u001b[30m    | ContentType t, {contentType = Some f} -> t = \u001b[4mf\u001b[0m\u001b[30m\n\u001b[36m  11: \u001b[30m    | Level l, {level = f} -> l = f\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "let getLabel aggregation = match aggregation with\n",
    "    | Inner {label} -> label\n",
    "    | Leaf {label} -> label\n",
    "\n",
    "let filterBy filters aggregation = match (getLabel aggregation), filters with\n",
    "    | ContentType _, {contentType = None} -> true\n",
    "    | Level _, {level = None} -> true\n",
    "    | TimeFrame None, {timeFrame = AllTime} -> true\n",
    "    | TimeFrame None, _ -> false\n",
    "    | ContentType t, {contentType = Some f} -> t = f\n",
    "    | Level l, {level = f} -> l = f\n",
    "    | TimeFrame (Some t), {timeFrame = f} -> t = f"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This `filterBy` function again makes heavy use of pattern matching, to define all of the possible cases that our code could fall into. Once again though, the function is just a single list of rules that maps our cases to simple boolean statements. And now we can use these together to create the final form of our function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "val getCountFor : filters -> elasticAggregation -> int = <fun>\n"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "let getCountFor filters aggregation = sumAggregation (filterBy filters) aggregation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_It may look strange to be giving `filterBy` one argument, even though it was defined with two. In OCaml, functions can be partially applied automatically, which means that given their first argument they return another function on the remaining arguments. This is hinted at by the signature; `filters -> elasticAggregation -> bool`. The result of this is a function `elasticAggregation -> bool`, which is exactly what we needed as an argument to `sumAggregation`._\n",
    "\n",
    "This function is what we were looking for. It takes in a record describing the filters we want to apply and a tree containing the information from elasticsearch, and returns us the number of records that satisfy the filter. We could have done this more directly, writing something similar to `sumAggregation` using our initial type of aggregations, or even directly against the JSON response, but this approach not only allows us to be decoupled from the source of information, it also gives all of the information we need about filters in an extremely compact way.\n",
    "\n",
    "The simplicity of this representation has three benefits that make it worthwhile. It allows us to describe all of the logic of how we will answer the question asked by the `getCountFor` function in simple, flat tables. These flat tables are inherently extensible; if we wanted to add a new facet to filter by, it would be as simple as adding the new facet to our domain types and defining how it should be handled in the `filterBy` function. Lastly, it describes our particular problem in the frame of a much more general one; namely, aggregating over a tree. Finding ways to represent your domain as a special case of such a general problem leads to common patterns and more reusable code.\n",
    "\n",
    "Nothing about this pattern requires OCaml, but the OCaml type system has two structural features that make it easier. Union types encourage us to build our domain up from nothing, rather than trying to describe it with general purpose data structures, and pattern matching allows us to see the logic over these unions as flat tables rather than the nested binary choices that `if/else` statements cause. I don't use OCaml at pluralsight in production, but these advantages are large enough that when I want to understand a part of my domain better I model it with types because they clarify my ideas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "OCaml /Users/sam-schick/Documents/code/notebooks",
   "language": "OCaml",
   "name": "ocaml-jupyter"
  },
  "language_info": {
   "codemirror_mode": "text/x-ocaml",
   "file_extension": ".ml",
   "mimetype": "text/x-ocaml",
   "name": "OCaml",
   "nbconverter_exporter": null,
   "pygments_lexer": "OCaml",
   "version": "4.07.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}