jpivarski/avro-to-awkward.ipynb

## avro-to-awkward.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ace9f31e-be89-422d-8e29-c6ee3ed5b6c4",
   "metadata": {},
   "source": [
    "Concrete types, such as an instance of an Avro Schema or an Awkward Form, can be expressed as trees.\n",
    "\n",
    "Each tree (concrete type) is an element of a set generated by some rules. The set of trees (a type system) is typically infinite because the rules can be applied arbitrarily many times.\n",
    "\n",
    "Programming languages also have this structure, and a programming language represents trees in a human-readable/writable way.\n",
    "\n",
    "Lark is a parser; I used it below to create two languages: one that represents Avro Schemas and another that represents Awkward Forms. Both of these have JSON representations, so you will be transforming JSON to JSON (without Lark), but I wanted to give you a gist of how it should work without doing the whole thing for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f37bc317-eefc-497a-960f-605aca563147",
   "metadata": {},
   "outputs": [],
   "source": [
    "import lark"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8896f313-bbb6-480a-a47f-225ebb006baa",
   "metadata": {},
   "source": [
    "Parses the concrete type and pretty-prints it as a nested tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "df20d135-f9d8-4065-9735-868cd507824e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def show(parser, unparsed):\n",
    "    print(parser.parse(unparsed).pretty())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abd11797-f41b-4fa5-aaa7-85c0d8b3aae0",
   "metadata": {},
   "source": [
    "The following defines a language for Avro schemas. It is expressed in Backus-Naur form, which is itself a small language.\n",
    "\n",
    "A language can be thought of as a mapping from input strings to trees.\n",
    "\n",
    "It can also be thought of as a mapping from input strings to \"error\" or \"no error,\" to indicate whether an input string is in the language or not. In this way, the following specifies a space that is isomorphic to Avro Schemas. (The mapping between parsed trees in this language and valid Avro Schema JSON is the isomorphism.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "443180ec-8dbc-42c2-a9dc-76ff8b37f2b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "avro = lark.Lark('''\n",
    "start: _type\n",
    "\n",
    "_type: PRIMITIVE | record | enum | array | map | union | fixed\n",
    "\n",
    "PRIMITIVE: \"null\" | \"boolean\" | \"int\" | \"long\" | \"float\" | \"double\" | \"bytes\" | \"string\"\n",
    "\n",
    "field: ESCAPED_STRING \":\" _type\n",
    "fields: \"{\" [field (\",\" field)*] \"}\"\n",
    "record: \"record\" \"(\" ESCAPED_STRING \",\" fields \")\"   // name, zero or more fields\n",
    "\n",
    "symbols: \"[\" ESCAPED_STRING (\",\" ESCAPED_STRING)* \"]\"\n",
    "enum: \"enum\" \"(\" ESCAPED_STRING \",\" symbols \")\"   // name, symbols\n",
    "\n",
    "array: \"array\" \"(\" _type \")\"\n",
    "\n",
    "map: \"map\" \"(\" _type \")\"\n",
    "\n",
    "union: \"union\" \"(\" _type (\",\" _type)* \")\"   // one or more possibilities\n",
    "\n",
    "fixed: \"fixed\" \"(\" ESCAPED_STRING \",\" INT \")\"   // name, size\n",
    "\n",
    "%import common.WS\n",
    "%import common.INT\n",
    "%import common.ESCAPED_STRING\n",
    "%ignore WS\n",
    "''')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5da5cf7-fad1-41c0-abcc-c9f4fd19ff95",
   "metadata": {},
   "source": [
    "Here are some examples of Avro types (in this language) and their parsed trees."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "83502768-f1b6-497b-94f4-0bd5809a0277",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\tint\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'int')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2cbd343f-212b-4beb-ae9f-7feb79af9570",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  record\n",
      "    \"empty\"\n",
      "    fields\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'record(\"empty\", {})')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "5b800c3e-5127-4c38-9c16-cb765b08e782",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  record\n",
      "    \"triple\"\n",
      "    fields\n",
      "      field\n",
      "        \"x\"\n",
      "        int\n",
      "      field\n",
      "        \"y\"\n",
      "        double\n",
      "      field\n",
      "        \"z\"\n",
      "        string\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'record(\"triple\", {\"x\": int, \"y\": double, \"z\": string})')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e8f8579c-af0b-4af4-bf11-00bdbcd1ab8f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  enum\n",
      "    \"name\"\n",
      "    symbols\n",
      "      \"one\"\n",
      "      \"two\"\n",
      "      \"three\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'enum(\"name\", [\"one\", \"two\", \"three\"])')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "06046a7f-3818-4c1b-a0d8-9e1330277c96",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  array\tint\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'array(int)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "28771555-be95-40ed-8487-570cbfbb3796",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  array\n",
      "    array\n",
      "      array\n",
      "        array\tint\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'array(array(array(array(int))))')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d029a94e-d706-4938-9326-dd0da20e7b99",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  map\tint\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'map(int)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "92243cfc-e073-4716-bd0b-f43f38383107",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  union\n",
      "    null\n",
      "    int\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'union(null, int)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7dbb2f5d-9bcb-44a3-88dc-293ebf66c672",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  union\n",
      "    null\n",
      "    int\n",
      "    array\tint\n",
      "    record\n",
      "      \"pair\"\n",
      "      fields\n",
      "        field\n",
      "          \"left\"\n",
      "          int\n",
      "        field\n",
      "          \"right\"\n",
      "          int\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'union(null, int, array(int), record(\"pair\", {\"left\": int, \"right\": int}))')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "0fc72afa-c535-4abd-be21-5ff0095a22f3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  fixed\n",
      "    \"name\"\n",
      "    16\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(avro, 'fixed(\"name\", 16)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9854feb-e291-42a9-9c72-00a8f8e251ff",
   "metadata": {},
   "source": [
    "The following defines a language for Awkward Forms (the subset we'll be using). Some of these have parameters to specify that a given list should be interpreted as a string, bytes, map, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "815c200f-870d-46e4-a1fd-1ba758fe6e87",
   "metadata": {},
   "outputs": [],
   "source": [
    "awkward = lark.Lark('''\n",
    "start: _form\n",
    "\n",
    "_form: numpy | listoffset | regular | bytemasked | indexedoption | indexed | record | union\n",
    "\n",
    "numpy: \"NumpyForm\" \"(\" ESCAPED_STRING \")\"\n",
    "\n",
    "is_string: \"string\"\n",
    "is_bytes: \"bytes\"\n",
    "is_map: \"map\"\n",
    "listoffset: \"ListOffsetForm\" \"(\" _form [\",\" (is_string | is_bytes | is_map)] \")\"\n",
    "regular: \"RegularForm\" \"(\" _form \",\" INT [\",\" is_bytes] \")\"\n",
    "\n",
    "bytemasked: \"ByteMaskedForm\" \"(\" _form \")\"\n",
    "indexedoption: \"IndexedOptionForm\" \"(\" _form \")\"\n",
    "\n",
    "is_categorical: \"categorical\"\n",
    "indexed: \"IndexedForm\" \"(\" _form [\",\" is_categorical] \")\"\n",
    "\n",
    "fields: \"[\" [_form (\",\" _form)*] \"]\"\n",
    "names: \"[\" [ESCAPED_STRING (\",\" ESCAPED_STRING)*] \"]\"\n",
    "record: \"RecordForm\" \"(\" ESCAPED_STRING \",\" fields \",\" names \")\"\n",
    "\n",
    "union: \"UnionForm\" \"(\" _form (\",\" _form)* \")\"\n",
    "\n",
    "%import common.WS\n",
    "%import common.INT\n",
    "%import common.ESCAPED_STRING\n",
    "%ignore WS\n",
    "''')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9a22255-f48c-4a16-a669-ec381a1e8a4e",
   "metadata": {},
   "source": [
    "Below, it will be useful to reverse the parsing process, so that's what this function does. It turns trees into one-line strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "7285798a-4e73-458c-8ded-f7c7227f34c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def unparse_awkward(node):\n",
    "    if node.data == \"numpy\":\n",
    "        return f\"NumpyForm({node.children[0]})\"\n",
    "\n",
    "    elif node.data == \"listoffset\":\n",
    "        if len(node.children) == 2 and node.children[1].data == \"is_string\":\n",
    "            extra = \", string\"\n",
    "        elif len(node.children) == 2 and node.children[1].data == \"is_bytes\":\n",
    "            extra = \", bytes\"\n",
    "        elif len(node.children) == 2 and node.children[1].data == \"is_map\":\n",
    "            extra = \", map\"\n",
    "        else:\n",
    "            extra = \"\"\n",
    "        return f\"ListOffsetForm({unparse_awkward(node.children[0])}{extra})\"\n",
    "\n",
    "    elif node.data == \"regular\":\n",
    "        if len(node.children) == 3 and node.children[2].data == \"is_bytes\":\n",
    "            extra = \", bytes\"\n",
    "        else:\n",
    "            extra = \"\"\n",
    "        return f\"RegularForm({unparse_awkward(node.children[0])}, {node.children[1]}{extra})\"\n",
    "\n",
    "    elif node.data == \"bytemasked\":\n",
    "        return f\"ByteMaskedForm({unparse_awkward(node.children[0])})\"\n",
    "\n",
    "    elif node.data == \"indexedoption\":\n",
    "        return f\"IndexedOptionForm({unparse_awkward(node.children[0])})\"\n",
    "\n",
    "    elif node.data == \"indexed\":\n",
    "        if len(node.children) == 2 and node.children[1].data == \"is_categorical\":\n",
    "            extra = \", categorical\"\n",
    "        else:\n",
    "            extra = \"\"\n",
    "        return f\"IndexedForm({unparse_awkward(node.children[0])}{extra})\"\n",
    "\n",
    "    elif node.data == \"record\":\n",
    "        return f\"RecordForm({node.children[0]}, [{', '.join(unparse_awkward(x) for x in node.children[1].children)}], [{', '.join(node.children[2].children)}])\"\n",
    "\n",
    "    elif node.data == \"union\":\n",
    "        return f\"UnionForm({', '.join(unparse_awkward(x) for x in node.children)})\"\n",
    "\n",
    "    elif node.data == \"start\":\n",
    "        return unparse_awkward(node.children[0])\n",
    "\n",
    "    else:\n",
    "        raise AssertionError(node)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07e5da4f-a797-4b88-8f6d-a29922677e14",
   "metadata": {},
   "source": [
    "Here are some examples of this language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "1df161a9-181c-4698-ab6c-fa5a95f9cf56",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  numpy\t\"<f8\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, 'NumpyForm(\"<f8\")')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "f7281146-9b46-4f8b-8a30-74c336f089ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  listoffset\n",
      "    numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, 'ListOffsetForm(NumpyForm(\"<i4\"))')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "4d34172f-9872-4c86-8f72-1cb94c618cd9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  listoffset\n",
      "    listoffset\n",
      "      listoffset\n",
      "        numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, 'ListOffsetForm(ListOffsetForm(ListOffsetForm(NumpyForm(\"<i4\"))))')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "ce690dd0-6915-49ac-a0e1-bab3594c36d0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  listoffset\n",
      "    numpy\t\"u1\"\n",
      "    is_string\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, 'ListOffsetForm(NumpyForm(\"u1\"), string)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "7a7f87a7-8038-4d6f-a1e0-9d763c6f2322",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  regular\n",
      "    numpy\t\"<i4\"\n",
      "    10\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, 'RegularForm(NumpyForm(\"<i4\"), 10)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "61c21431-676d-4f2e-8a15-aca30d5db245",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  bytemasked\n",
      "    numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, 'ByteMaskedForm(NumpyForm(\"<i4\"))')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "dc49c66c-9b62-47b3-a7ec-4013ca5a286a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  indexed\n",
      "    listoffset\n",
      "      numpy\t\"u1\"\n",
      "      is_string\n",
      "    is_categorical\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, 'IndexedForm(ListOffsetForm(NumpyForm(\"u1\"), string), categorical)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "cc010f83-ad51-48b2-853e-606a9cec53cf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  record\n",
      "    \"triple\"\n",
      "    fields\n",
      "      numpy\t\"<i4\"\n",
      "      numpy\t\"<f8\"\n",
      "      listoffset\n",
      "        numpy\t\"u1\"\n",
      "    names\n",
      "      \"x\"\n",
      "      \"y\"\n",
      "      \"z\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, 'RecordForm(\"triple\", [NumpyForm(\"<i4\"), NumpyForm(\"<f8\"), ListOffsetForm(NumpyForm(\"u1\"))], [\"x\", \"y\", \"z\"])')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "ea2437dc-13e7-46ba-a287-ebe2faa32bfc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start\n",
      "  union\n",
      "    numpy\t\"bool\"\n",
      "    numpy\t\"<i4\"\n",
      "    listoffset\n",
      "      record\n",
      "        \"pair\"\n",
      "        fields\n",
      "          listoffset\n",
      "            numpy\t\"u1\"\n",
      "            is_string\n",
      "          numpy\t\"<f8\"\n",
      "        names\n",
      "          \"key\"\n",
      "          \"value\"\n",
      "      is_map\n",
      "\n"
     ]
    }
   ],
   "source": [
    "show(awkward, '''UnionForm(\n",
    "    NumpyForm(\"bool\"),\n",
    "    NumpyForm(\"<i4\"),\n",
    "    ListOffsetForm(RecordForm(\"pair\", [ListOffsetForm(NumpyForm(\"u1\"), string), NumpyForm(\"<f8\")], [\"key\", \"value\"]), map)\n",
    ")''')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adb3636b-4ea8-4c63-b805-051ab7651ba8",
   "metadata": {},
   "source": [
    "The thing you'll want to do is transform Avro Schemas into Awkward Forms.\n",
    "\n",
    "Some conversions are impossible (Avro `\"null\"`), some don't preserve structure (Avro unions → Awkward option-types and/or unions), and some Awkward Forms are not in the codomain of this transformation.\n",
    "\n",
    "Note that the transformation function is recursive and mostly just a case statement. This is a good format for most of these sorts of transformations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "2803816a-775c-4d46-a3cd-4def8085df41",
   "metadata": {},
   "outputs": [],
   "source": [
    "def avro_to_awkward(node):\n",
    "    if node == \"null\":\n",
    "        raise ValueError(\"no equivalent\")\n",
    "\n",
    "    elif node == \"boolean\":\n",
    "        return lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"bool\"')])\n",
    "\n",
    "    elif node == \"int\":\n",
    "        return lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"<i4\"')])\n",
    "\n",
    "    elif node == \"long\":\n",
    "        return lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"<i8\"')])\n",
    "\n",
    "    elif node == \"float\":\n",
    "        return lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"<f4\"')])\n",
    "\n",
    "    elif node == \"double\":\n",
    "        return lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"<f8\"')])\n",
    "\n",
    "    elif node == \"bytes\":\n",
    "        return lark.Tree(\"listoffset\", [lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"u1\"')]), lark.Tree(\"is_bytes\", [])])\n",
    "\n",
    "    elif node == \"string\":\n",
    "        return lark.Tree(\"listoffset\", [lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"u1\"')]), lark.Tree(\"is_string\", [])])\n",
    "\n",
    "    elif node.data == \"record\":\n",
    "        fields, names = [], []\n",
    "        for child in node.children[1].children:\n",
    "            fields.append(avro_to_awkward(child.children[1]))\n",
    "            names.append(child.children[0])\n",
    "        return lark.Tree(\"record\", [node.children[0], lark.Tree(\"fields\", fields), lark.Tree(\"names\", names)])\n",
    "\n",
    "    elif node.data == \"enum\":\n",
    "        print(f\"Create a subtree of data containing strings: {[eval(x) for x in node.children[1].children]}\")\n",
    "        return lark.Tree(\"indexed\", [lark.Tree(\"listoffset\", [lark.Tree(\"listoffset\", [lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"u1\"')]), lark.Tree(\"is_string\", [])])]), lark.Tree(\"is_categorical\", [])])\n",
    "\n",
    "    elif node.data == \"array\":\n",
    "        return lark.Tree(\"listoffset\", [avro_to_awkward(node.children[0])])\n",
    "\n",
    "    elif node.data == \"map\":\n",
    "        fields, names = [], []\n",
    "        fields.append(lark.Tree(\"listoffset\", [lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"u1\"')]), lark.Tree(\"is_string\", [])]))\n",
    "        fields.append(avro_to_awkward(node.children[0]))\n",
    "        names.append(lark.Token(\"ESCAPED_STRING\", '\"key\"'))\n",
    "        names.append(lark.Token(\"ESCAPED_STRING\", '\"value\"'))\n",
    "        return lark.Tree(\"listoffset\", [lark.Tree(\"record\", [lark.Token(\"ESCAPED_STRING\", '\"pair\"'), lark.Tree(\"fields\", fields), lark.Tree(\"names\", names)]), lark.Tree(\"is_map\", [])])\n",
    "\n",
    "    elif node.data == \"union\":\n",
    "        is_option, possibilities = False, []\n",
    "        for child in node.children:\n",
    "            if child == \"null\":\n",
    "                is_option = True\n",
    "            else:\n",
    "                possibilities.append(avro_to_awkward(child))\n",
    "\n",
    "        if is_option and len(possibilities) == 1 and possibilities[0].data == \"record\":\n",
    "            return lark.Tree(\"indexedoption\", possibilities)\n",
    "        \n",
    "        elif is_option and len(possibilities) == 1:\n",
    "            return lark.Tree(\"bytemasked\", possibilities)\n",
    "        \n",
    "        else:\n",
    "            out = lark.Tree(\"union\", possibilities)\n",
    "            if is_option:\n",
    "                return lark.Tree(\"bytemasked\", [out])\n",
    "            else:\n",
    "                return out\n",
    "\n",
    "    elif node.data == \"fixed\":\n",
    "        return lark.Tree(\"regular\", [lark.Tree(\"numpy\", [lark.Token(\"ESCAPED_STRING\", '\"u1\"')]), node.children[1], lark.Tree(\"is_bytes\", [])])\n",
    "\n",
    "    elif node.data == \"start\":\n",
    "        return lark.Tree(\"start\", [avro_to_awkward(node.children[0])])\n",
    "\n",
    "    else:\n",
    "        raise AssertionError(node)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "f7c0cd7f-5221-407a-bb59-f7ede5639226",
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert(unparsed):\n",
    "    as_awkward = avro_to_awkward(avro.parse(unparsed))\n",
    "    print(unparse_awkward(as_awkward))\n",
    "    print()\n",
    "    print(as_awkward.pretty())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c247cd1e-9914-4b68-b9af-90f126b00e50",
   "metadata": {},
   "source": [
    "Avro primitives like `boolean`, `int`, `long`, `float`, `double` are converted to Awkward `NumpyForm` with different NumPy dtypes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "a65a04e8-d70b-492d-8c2e-924a977531ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NumpyForm(\"<i4\")\n",
      "\n",
      "start\n",
      "  numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('int')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6760778-4af4-4ec6-85cc-47cfbcb8a922",
   "metadata": {},
   "source": [
    "Avro's `array` is an Awkward `ListOffsetForm` (`ListForm` will never be reached by this transformation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "b14b291d-b76e-41ea-b529-8089be0c6b9d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ListOffsetForm(NumpyForm(\"<i4\"))\n",
      "\n",
      "start\n",
      "  listoffset\n",
      "    numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('array(int)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "97e8da5b-cc50-4da0-82c1-4390029134ba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ListOffsetForm(ListOffsetForm(ListOffsetForm(NumpyForm(\"<f8\"))))\n",
      "\n",
      "start\n",
      "  listoffset\n",
      "    listoffset\n",
      "      listoffset\n",
      "        numpy\t\"<f8\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('array(array(array(double)))')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ede3a74-0803-44b0-bd32-7131a8b65efd",
   "metadata": {},
   "source": [
    "Avro's `string` is an Awkward `ListOffsetForm` with special `parameters`. In the mini-language, it's just a token to indicate that we'll need to add `parameters={\"__array__\": \"string\"}` to the `ListOffsetForm` and `parameters={\"__array__\": \"char\"}` to the `NumpyForm`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "eed94821-7e2e-440e-a4bf-901a4072aec8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ListOffsetForm(NumpyForm(\"u1\"), string)\n",
      "\n",
      "start\n",
      "  listoffset\n",
      "    numpy\t\"u1\"\n",
      "    is_string\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('string')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3111726d-a134-42ac-8950-ef527977c645",
   "metadata": {},
   "source": [
    "The JSON form for Avro's records is an ordered list of pairs (JSON objects with `name` and `type` fields). Awkward's constructor takes them as two equal-length lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "42a40a1c-1088-4ec0-a68d-ba272453e514",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RecordForm(\"name\", [NumpyForm(\"<i4\"), NumpyForm(\"bool\")], [\"x\", \"y\"])\n",
      "\n",
      "start\n",
      "  record\n",
      "    \"name\"\n",
      "    fields\n",
      "      numpy\t\"<i4\"\n",
      "      numpy\t\"bool\"\n",
      "    names\n",
      "      \"x\"\n",
      "      \"y\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('record(\"name\", {\"x\": int, \"y\": boolean})')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85261fbb-7ab6-412a-97a1-57ea9a3c4d2a",
   "metadata": {},
   "source": [
    "The transformation function's recursion allows us to access the whole infinite space of Avro Schemas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "daa00c08-a575-41ff-9bf0-005ca4f07338",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RecordForm(\"name\", [NumpyForm(\"<i4\"), ListOffsetForm(NumpyForm(\"<f8\"))], [\"x\", \"y\"])\n",
      "\n",
      "start\n",
      "  record\n",
      "    \"name\"\n",
      "    fields\n",
      "      numpy\t\"<i4\"\n",
      "      listoffset\n",
      "        numpy\t\"<f8\"\n",
      "    names\n",
      "      \"x\"\n",
      "      \"y\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('record(\"name\", {\"x\": int, \"y\": array(double)})')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf2125ab-4177-4210-98d7-2970a8e94ca5",
   "metadata": {},
   "source": [
    "Remember that records can have zero fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "c7099ad3-b629-4a41-aa2d-ebf0f5aaaeef",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RecordForm(\"empty\", [], [])\n",
      "\n",
      "start\n",
      "  record\n",
      "    \"empty\"\n",
      "    fields\n",
      "    names\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('record(\"empty\", {})')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "295bbd89-af4b-4e92-a9b0-23c32b9cd38e",
   "metadata": {},
   "source": [
    "Avro's `enum` is a strange case. We'd want to make that an Awkward `IndexedForm` of strings, but the strings themselves have to go into the data, rather than the type.\n",
    "\n",
    "That will have to be spliced in after the Awkward Array has been constructed.\n",
    "\n",
    "Also, we don't use the `enum` name. It exists so that a type can be created in Java, but in Awkward Array, a set of categorical strings is not a new type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "88bfd79a-e26f-430f-9b9e-cd89601aabaf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Create a subtree of data containing strings: ['one', 'two', 'three']\n",
      "IndexedForm(ListOffsetForm(ListOffsetForm(NumpyForm(\"u1\"), string)), categorical)\n",
      "\n",
      "start\n",
      "  indexed\n",
      "    listoffset\n",
      "      listoffset\n",
      "        numpy\t\"u1\"\n",
      "        is_string\n",
      "    is_categorical\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('enum(\"unused name\", [\"one\", \"two\", \"three\"])')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee57a7d2-e5d6-46b9-898d-5db09b14da16",
   "metadata": {},
   "source": [
    "Avro's `map` is an Awkward list of records with two fields.\n",
    "\n",
    "Although in this case, I've named the record `\"pair\"` and the fields as `\"key\"` and `\"value\"`, we'll probabaly want to leave them unnamed. The `parameter={\"__array__\": \"map\"}` will be enough to indicate that it's a map."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "798b5e8b-36c1-4e6b-97df-8fd9db47dd38",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ListOffsetForm(RecordForm(\"pair\", [ListOffsetForm(NumpyForm(\"u1\"), string), NumpyForm(\"<i4\")], [\"key\", \"value\"]), map)\n",
      "\n",
      "start\n",
      "  listoffset\n",
      "    record\n",
      "      \"pair\"\n",
      "      fields\n",
      "        listoffset\n",
      "          numpy\t\"u1\"\n",
      "          is_string\n",
      "        numpy\t\"<i4\"\n",
      "      names\n",
      "        \"key\"\n",
      "        \"value\"\n",
      "    is_map\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('map(int)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e498dd7e-349d-4ef1-a00e-2e9505b069f6",
   "metadata": {},
   "source": [
    "Avro represents option-type data as a union of anything and `\"null\"`. The `\"null\"` type isn't useful outside of a union, so this has strange corner-cases.\n",
    "\n",
    "Awkward Array, however, has node types dedicated to option-type. We'd use `ByteMaskedForm` for all types but records."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "26297ff7-cdbc-4964-9223-41b4e63c86e1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ByteMaskedForm(NumpyForm(\"<i4\"))\n",
      "\n",
      "start\n",
      "  bytemasked\n",
      "    numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('union(null, int)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ba289ee-ec50-43dc-af34-5c21c7d1a6cc",
   "metadata": {},
   "source": [
    "We need to be insensitive to the order. Unions preserve order, but that information will be lost in the transformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "80f4826d-139f-41e1-8618-1c8980e9b7ed",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ByteMaskedForm(NumpyForm(\"<i4\"))\n",
      "\n",
      "start\n",
      "  bytemasked\n",
      "    numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('union(int, null)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da89aa39-c032-441e-91af-25832d5b7854",
   "metadata": {},
   "source": [
    "It's also possible to have multiple `\"null\"` types in the Avro union, but that has no meaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "0f128a00-8900-45f7-9916-a82a7f4415cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ByteMaskedForm(NumpyForm(\"<i4\"))\n",
      "\n",
      "start\n",
      "  bytemasked\n",
      "    numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('union(null, int, null)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "562deff6-7105-4270-8929-6c375773d1f6",
   "metadata": {},
   "source": [
    "Remember that the other type can be anything. That's why the function has to recurse."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "1edc3954-a46f-4ac8-b9a6-a5c2feada233",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ByteMaskedForm(ListOffsetForm(NumpyForm(\"u1\"), string))\n",
      "\n",
      "start\n",
      "  bytemasked\n",
      "    listoffset\n",
      "      numpy\t\"u1\"\n",
      "      is_string\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('union(null, string)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21c55f27-60f6-4bba-b035-ee11d8e44a92",
   "metadata": {},
   "source": [
    "If the other type is a record, we put it in an `IndexedOptionForm` instead of a `ByteMaskedForm`. In the long run, this is a simplifying assumption (don't have to invent fake record values to mask out), and it saves space (assuming an individual record is larger than 8 bytes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "1b99a9d4-e576-4b26-9a7e-9a64df1929bf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IndexedOptionForm(RecordForm(\"name\", [NumpyForm(\"<i4\"), NumpyForm(\"<f8\")], [\"x\", \"y\"]))\n",
      "\n",
      "start\n",
      "  indexedoption\n",
      "    record\n",
      "      \"name\"\n",
      "      fields\n",
      "        numpy\t\"<i4\"\n",
      "        numpy\t\"<f8\"\n",
      "      names\n",
      "        \"x\"\n",
      "        \"y\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('union(null, record(\"name\", {\"x\": int, \"y\": double}))')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b0f5315-c145-40fa-8715-06d628ca4f80",
   "metadata": {},
   "source": [
    "A union of just one type is degenerate. We'll eventually want to convert `\"union(int)\"` to just `\"int\"`, but that should happen after the Avro → Awkward transformation (by just calling `UnionArray.simplify_union()`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "90288195-744a-45d1-84c2-c0af726f1d86",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "UnionForm(NumpyForm(\"<i4\"))\n",
      "\n",
      "start\n",
      "  union\n",
      "    numpy\t\"<i4\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('union(int)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00781051-5f6f-4879-96ea-c3a7b4b76f33",
   "metadata": {},
   "source": [
    "When there are multiple types in the union, the transformation is straightforward."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "320f100a-698b-4690-af7c-9d486083ec3f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "UnionForm(NumpyForm(\"<i4\"), NumpyForm(\"<f8\"), NumpyForm(\"bool\"))\n",
      "\n",
      "start\n",
      "  union\n",
      "    numpy\t\"<i4\"\n",
      "    numpy\t\"<f8\"\n",
      "    numpy\t\"bool\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('union(int, double, boolean)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01ce0170-cc08-4d5b-ae63-28fb8f889777",
   "metadata": {},
   "source": [
    "When one of them is `\"null\"`, we can put the `UnionForm` inside of a `ByteMaskedForm`.\n",
    "\n",
    "Maybe it would be better to turn this into `union(bytemasked(numpy(\"<i4\")), bytemasked(numpy(\"<f8\")), bytemasked(numpy(\"bool\")))`? I don't know. Maybe. (It's what Arrow does.) We can define such things in the transformation function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "8ae2811b-c545-4aec-8cb9-7a8ccf3e2d5e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ByteMaskedForm(UnionForm(NumpyForm(\"<i4\"), NumpyForm(\"<f8\"), NumpyForm(\"bool\")))\n",
      "\n",
      "start\n",
      "  bytemasked\n",
      "    union\n",
      "      numpy\t\"<i4\"\n",
      "      numpy\t\"<f8\"\n",
      "      numpy\t\"bool\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "convert('union(null, int, double, boolean)')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}