Skip to content

Instantly share code, notes, and snippets.

@tonyfast
Created April 15, 2021 18:10
Show Gist options
  • Save tonyfast/707ea716d9d413feef184e7bdbb67a93 to your computer and use it in GitHub Desktop.
Save tonyfast/707ea716d9d413feef184e7bdbb67a93 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "possible-november",
"metadata": {},
"source": [
"# encoding jupyter notebooks as `jsonlines`\n",
"\n",
"the current notebook format only works by loading entire documents in at a time. the limits performance and user experience when loading large documents. in this document, we think an alternative serialization of the notebook format, specifically the cell format, represented as jsonlines.\n",
"\n",
"`jsonlines` is a line delimited `json` format, where each line is a json object. effectively, `jsonlines` represents a list of `json` objects.\n",
"\n",
"this approach takes advantage of new `id` feature in the notebook cell format. specifically, we will use the `uuid1` to capture timestamp metadata information about the execution. we get the added before of an id and temporal metadata."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "engaging-window",
"metadata": {},
"outputs": [],
"source": [
" import typing, nbformat, json, uuid, pathlib, freezegun, datetime, bz2, gzip, pandas, operator, collections, frozendict"
]
},
{
"cell_type": "markdown",
"id": "short-israeli",
"metadata": {},
"source": [
"for this discussion, we use this essay `\"lines-nb-format.ipynb\"` as test data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "portable-average",
"metadata": {},
"outputs": [],
"source": [
" with open(\"lines-nb-format.ipynb\") as file:\n",
" original: nbformat = nbformat.v4.reads(file.read())"
]
},
{
"cell_type": "markdown",
"id": "broke-triumph",
"metadata": {},
"source": [
"we are going to modify the existing cell metadata because their values are effectively meaningless. by adding a `uuid1` id we not have id's that represent the dual purpose of timekeeping and identification. time tracking can be turned off using a different `uuid` format."
]
},
{
"cell_type": "markdown",
"id": "structured-marine",
"metadata": {},
"source": [
"we'll use `freezegun` to simulate cell execution at different times, and replace our original existings ids with proper uuids."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "golden-sunglasses",
"metadata": {},
"outputs": [],
"source": [
" for i, cell in enumerate(original[\"cells\"]):\n",
" @freezegun.freeze_time(datetime.datetime(2021, 4, 15, 11, i))\n",
" def frozen_uuid():\n",
" return str(uuid.uuid1())\n",
" \n",
" original[\"cells\"][i] = {**cell, \"id\": frozen_uuid()}"
]
},
{
"cell_type": "markdown",
"id": "pressing-characteristic",
"metadata": {},
"source": [
"the lead cell uses the special [NIL uuid]. this value refers to the date `datetime(1582, 10, 15)`; the date of Gregorian reform to the Christian calendar. we will use this special identifier to hold the metadata of the notebook, but in the cell metadata because no additional properties are allowed in the top leve cell.\n",
"\n",
"[NIL uuid]: https://tools.ietf.org/html/rfc4122#section-4.1.7"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "balanced-football",
"metadata": {},
"outputs": [],
"source": [
" nil = \"00000000-0000-0000-0000-000000000000\""
]
},
{
"cell_type": "markdown",
"id": "constant-match",
"metadata": {},
"source": [
"the statement below converts all of our `original` cells into a block strings that has lines of valid `json`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "immune-dominican",
"metadata": {},
"outputs": [],
"source": [
" jsonlines: typing.Text = \"\\n\".join(\n",
" map(\n",
" json.dumps, [\n",
" nbformat.v4.new_code_cell(id=nil, metadata=original[\"metadata\"])\n",
" ] + original[\"cells\"]\n",
" )\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "incident-honey",
"metadata": {},
"source": [
"from the string representation we can create a `replica` by loading each lines are json."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "aboriginal-parts",
"metadata": {},
"outputs": [],
"source": [
" replica: nbformat = nbformat.v4.new_notebook(cells=list(\n",
" map(json.loads, jsonlines.splitlines())\n",
" ))"
]
},
{
"cell_type": "markdown",
"id": "fatal-tutorial",
"metadata": {},
"source": [
"our replica notebook has one extra cell relative to the `original` notebook. this is because we added the `nil` `uuid` to represent our notebook's metadata."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "permanent-delay",
"metadata": {},
"outputs": [],
"source": [
" assert len(replica.cells) == len(original.cells) + 1"
]
},
{
"cell_type": "markdown",
"id": "crude-surname",
"metadata": {},
"source": [
"our last task is to merge the `jsonlines` metadata representation with the conventional `nbformat`. we do this by promoting the metadata from the cell with the `nil` `uuid` to the notebook metadata."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "backed-tucson",
"metadata": {},
"outputs": [],
"source": [
" def reconstitute_metadata_from_nil(nb):\n",
" \"\"\"perform an in place modification of the notebook metadata\"\"\"\n",
" for i, cell in enumerate(nb[\"cells\"]):\n",
" if cell[\"id\"] == nil:\n",
" cell = nb[\"cells\"].pop(i)\n",
" nb[\"metadata\"].update(cell[\"metadata\"])\n",
" break\n",
" return nb\n",
" reconstitute_metadata_from_nil(replica); # in place operation"
]
},
{
"cell_type": "markdown",
"id": "digital-generator",
"metadata": {},
"source": [
"from json lines we can recover the original with little custom work. the only extra effort is to cast the cell containing the `nil` as the notebook metadata."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "labeled-nylon",
"metadata": {},
"outputs": [],
"source": [
" assert replica == original"
]
},
{
"cell_type": "markdown",
"id": "approved-surprise",
"metadata": {},
"source": [
"## on disk formats"
]
},
{
"cell_type": "markdown",
"id": "worthy-violin",
"metadata": {},
"source": [
"it is recommended to use the `.jsonl` extension for `jsonlines`"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "increased-solomon",
"metadata": {},
"outputs": [],
"source": [
" with open(\"lines-nb-format.ipynb.jsonl\", \"w\") as file:\n",
" file.write(jsonlines)"
]
},
{
"cell_type": "markdown",
"id": "curious-embassy",
"metadata": {},
"source": [
"`jsonlines` can also use `bz2` or `gzip` compression"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "black-cyprus",
"metadata": {},
"outputs": [],
"source": [
" with bz2.open(\"lines-nb-format.ipynb.jsonl.bz2\", \"w\") as file:\n",
" file.write(jsonlines.encode(\"utf-8\"))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "common-guitar",
"metadata": {},
"outputs": [],
"source": [
" with gzip.open(\"lines-nb-format.ipynb.jsonl.gz\", \"w\") as file:\n",
" file.write(jsonlines.encode(\"utf-8\"))"
]
},
{
"cell_type": "markdown",
"id": "consistent-winner",
"metadata": {},
"source": [
"## relative on disk sizes of formats"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "photographic-luther",
"metadata": {},
"outputs": [],
"source": [
" files = list(pathlib.Path().glob(\"lines-nb-format.ipynb*\"))\n",
" size = pandas.Series(data=files, index=files, name=\"kb\")"
]
},
{
"cell_type": "markdown",
"id": "postal-affiliate",
"metadata": {},
"source": [
"the table below shows the relative on disk sizes of the different formats."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "coral-boating",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>kb</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>lines-nb-format.ipynb.jsonl.bz2</th>\n",
" <td>3.480469</td>\n",
" </tr>\n",
" <tr>\n",
" <th>lines-nb-format.ipynb.jsonl.gz</th>\n",
" <td>3.559570</td>\n",
" </tr>\n",
" <tr>\n",
" <th>lines-nb-format.ipynb.jsonl</th>\n",
" <td>12.794922</td>\n",
" </tr>\n",
" <tr>\n",
" <th>lines-nb-format.ipynb</th>\n",
" <td>14.484375</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" kb\n",
"lines-nb-format.ipynb.jsonl.bz2 3.480469\n",
"lines-nb-format.ipynb.jsonl.gz 3.559570\n",
"lines-nb-format.ipynb.jsonl 12.794922\n",
"lines-nb-format.ipynb 14.484375"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" size.apply(pathlib.PosixPath.stat).apply(\n",
" operator.attrgetter(\"st_size\")).sort_values().divide(2**10).to_frame()"
]
},
{
"cell_type": "markdown",
"id": "detected-sacramento",
"metadata": {},
"source": [
"## a comparsion of the content of the files"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "extended-retreat",
"metadata": {},
"outputs": [],
"source": [
" from_disk = collections.defaultdict(nbformat.v4.new_notebook)"
]
},
{
"cell_type": "markdown",
"id": "opposed-lawrence",
"metadata": {},
"source": [
"the original document has to be loaded all at once."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "cutting-alexander",
"metadata": {},
"outputs": [],
"source": [
" with open(\"lines-nb-format.ipynb\") as file:\n",
" from_disk[file.name].update(json.loads(file.read()))\n",
" \n",
" from_disk[file.name] = original # because this is the one with the uuid ids"
]
},
{
"cell_type": "markdown",
"id": "right-shopping",
"metadata": {},
"source": [
"meanwhile, the jsonlines format can load in cells line by line."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "inside-wales",
"metadata": {},
"outputs": [],
"source": [
" with open(\"lines-nb-format.ipynb.jsonl\") as file:\n",
" for line in file:\n",
" from_disk[file.name][\"cells\"].append(json.loads(line))"
]
},
{
"cell_type": "markdown",
"id": "missing-ancient",
"metadata": {},
"source": [
"and there is a polymorphic api for loading lines of json from compressed files."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "separate-compiler",
"metadata": {},
"outputs": [],
"source": [
" name = \"lines-nb-format.ipynb.jsonl.bz2\"\n",
" with bz2.open(name) as file:\n",
" for line in file:\n",
" from_disk[name][\"cells\"].append(json.loads(line))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "former-entrance",
"metadata": {},
"outputs": [],
"source": [
" name = \"lines-nb-format.ipynb.jsonl.gz\"\n",
" with gzip.open(name) as file:\n",
" for line in file:\n",
" from_disk[name][\"cells\"].append(json.loads(line))"
]
},
{
"cell_type": "markdown",
"id": "preliminary-assumption",
"metadata": {},
"source": [
"remember we need to take our `nil` `uuid` and make that the notebook metadata."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "changing-workplace",
"metadata": {},
"outputs": [],
"source": [
" for value in from_disk.values():\n",
" reconstitute_metadata_from_nil(value)"
]
},
{
"cell_type": "markdown",
"id": "imported-performance",
"metadata": {},
"source": [
"we can now test that all of the forms of the notebook are the same."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "front-bishop",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
".\n",
"----------------------------------------------------------------------\n",
"Ran 1 test in 0.001s\n",
"\n",
"OK\n"
]
}
],
"source": [
" import unittest\n",
"\n",
" class Compose(unittest.TestCase):\n",
" def test_compare_files(x):\n",
" x.assertDictEqual(\n",
" from_disk['lines-nb-format.ipynb'], \n",
" from_disk['lines-nb-format.ipynb.jsonl']\n",
" )\n",
" x.assertDictEqual(\n",
" from_disk['lines-nb-format.ipynb'], \n",
" from_disk['lines-nb-format.ipynb.jsonl.gz']\n",
" )\n",
" x.assertDictEqual(\n",
" from_disk['lines-nb-format.ipynb'], \n",
" from_disk['lines-nb-format.ipynb.jsonl.bz2']\n",
" )\n",
"\n",
" unittest.main(argv=[\"\"], exit=False);"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment