Skip to content

Instantly share code, notes, and snippets.

@ChelseaTrotter
Created January 5, 2022 16:54
Show Gist options
  • Save ChelseaTrotter/fd8341ad904b768bc03d09f59e760235 to your computer and use it in GitHub Desktop.
Save ChelseaTrotter/fd8341ad904b768bc03d09f59e760235 to your computer and use it in GitHub Desktop.
What is the most memory efficient for table to JSON transformation. Comparison of different transformation orientation.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "5f2121dd-4f24-4ef1-ab48-422fbb5530c4",
"metadata": {},
"source": [
"# The considerations when converting a 2D table to JSON format.\n",
"\n",
"When we try to convert a SQL database, which is a record based 2D data table, to a NoSQL database, which is document based JSON format, we shall consider different ways of representation. Because your choice will impact retrieval latency and memory space. [Pandas Documentation for to_json](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html)\n",
"\n",
"Using pandas.DataFrame.to_json, you can easily convert the data to json format. Let’s take a look at the different ways to cover. The key argument is “orient”. It allows several values: split, records, index, columns, values, table. \n",
"\n",
"\n",
"Let's make an example dataframe: "
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "26963078-06db-43a4-9433-8da842116305",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import pandas as pd\n",
"import sys"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "5a7fea7c-6f95-4f13-8975-01bf0af36b23",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" col 1 col 2\n",
"row 1 a b\n",
"row 2 c d\n"
]
}
],
"source": [
"df = pd.DataFrame(\n",
" [[\"a\", \"b\"], [\"c\", \"d\"]],\n",
" index=[\"row 1\", \"row 2\"],\n",
" columns=[\"col 1\", \"col 2\"],\n",
")\n",
"print(df)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "436b74de-105f-4334-8589-db58f9eeaf03",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"columns\": [\n",
" \"col 1\",\n",
" \"col 2\"\n",
" ],\n",
" \"index\": [\n",
" \"row 1\",\n",
" \"row 2\"\n",
" ],\n",
" \"data\": [\n",
" [\n",
" \"a\",\n",
" \"b\"\n",
" ],\n",
" [\n",
" \"c\",\n",
" \"d\"\n",
" ]\n",
" ]\n",
"}\n",
"Memory usage of split: 133 bytes\n"
]
}
],
"source": [
"result = df.to_json(orient=\"split\")\n",
"parsed = json.loads(result)\n",
"print(json.dumps(parsed, indent=4) )\n",
"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "0545d0cd-ba51-4f58-98e1-3796f0f08369",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[\n",
" {\n",
" \"col 1\": \"a\",\n",
" \"col 2\": \"b\"\n",
" },\n",
" {\n",
" \"col 1\": \"c\",\n",
" \"col 2\": \"d\"\n",
" }\n",
"]\n",
"Memory usage of split: 102 bytes\n"
]
}
],
"source": [
"result = df.to_json(orient=\"records\")\n",
"parsed = json.loads(result)\n",
"print(json.dumps(parsed, indent=4) )\n",
"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
]
},
{
"cell_type": "markdown",
"id": "e1867723-a774-4d8c-a419-82f2774fca86",
"metadata": {},
"source": [
"\n",
"**\"Index\" option is basically use the row name as the key, and each row is a JSON object.It seems to be the most logical one to convert a 2D structure to JSON. A row is a record. and the primary key is the row name, which we can change to the column we want to make as primary key. And memory usage is managable**\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b0112c19-4248-44a4-9944-07f0913dbb46",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"row 1\": {\n",
" \"col 1\": \"a\",\n",
" \"col 2\": \"b\"\n",
" },\n",
" \"row 2\": {\n",
" \"col 1\": \"c\",\n",
" \"col 2\": \"d\"\n",
" }\n",
"}\n",
"Memory usage of split: 118 bytes\n"
]
}
],
"source": [
"result = df.to_json(orient=\"index\")\n",
"parsed = json.loads(result)\n",
"print(json.dumps(parsed, indent=4) )\n",
"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
]
},
{
"cell_type": "markdown",
"id": "62553a8e-162c-4198-ac74-63d3277cbc20",
"metadata": {},
"source": [
"**\"columns\" option is similar to \"index\", except it takes each column as a JSON object.**"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "60d6c5d3-6736-45c2-93b1-91c8c697b917",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"col 1\": {\n",
" \"row 1\": \"a\",\n",
" \"row 2\": \"c\"\n",
" },\n",
" \"col 2\": {\n",
" \"row 1\": \"b\",\n",
" \"row 2\": \"d\"\n",
" }\n",
"}\n",
"Memory usage of split: 118 bytes\n"
]
}
],
"source": [
"result = df.to_json(orient=\"columns\")\n",
"parsed = json.loads(result)\n",
"print(json.dumps(parsed, indent=4) )\n",
"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "b0171589-295e-4833-ae93-4d17f623249e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[\n",
" [\n",
" \"a\",\n",
" \"b\"\n",
" ],\n",
" [\n",
" \"c\",\n",
" \"d\"\n",
" ]\n",
"]\n",
"Memory usage of split: 70 bytes\n"
]
}
],
"source": [
"result = df.to_json(orient=\"values\")\n",
"parsed = json.loads(result)\n",
"print(json.dumps(parsed, indent=4) )\n",
"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "3b3a882e-ed53-4aa1-bf38-11f3cbfcc161",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"schema\": {\n",
" \"fields\": [\n",
" {\n",
" \"name\": \"index\",\n",
" \"type\": \"string\"\n",
" },\n",
" {\n",
" \"name\": \"col 1\",\n",
" \"type\": \"string\"\n",
" },\n",
" {\n",
" \"name\": \"col 2\",\n",
" \"type\": \"string\"\n",
" }\n",
" ],\n",
" \"primaryKey\": [\n",
" \"index\"\n",
" ],\n",
" \"pandas_version\": \"0.20.0\"\n",
" },\n",
" \"data\": [\n",
" {\n",
" \"index\": \"row 1\",\n",
" \"col 1\": \"a\",\n",
" \"col 2\": \"b\"\n",
" },\n",
" {\n",
" \"index\": \"row 2\",\n",
" \"col 1\": \"c\",\n",
" \"col 2\": \"d\"\n",
" }\n",
" ]\n",
"}\n",
"Memory usage of split: 313 bytes\n"
]
}
],
"source": [
"result = df.to_json(orient=\"table\")\n",
"parsed = json.loads(result)\n",
"print(json.dumps(parsed, indent=4) )\n",
"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1042642b-8bcc-4c6c-a1dc-7105a2c97366",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment