ChelseaTrotter/table2json.ipynb

## table2json.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5f2121dd-4f24-4ef1-ab48-422fbb5530c4",
   "metadata": {},
   "source": [
    "# The considerations when converting a 2D table to JSON format.\n",
    "\n",
    "When we try to convert a SQL database, which is a record based 2D data table, to a NoSQL database, which is document based JSON format, we shall consider different ways of representation. Because your choice will impact retrieval latency and memory space. [Pandas Documentation for to_json](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html)\n",
    "\n",
    "Using pandas.DataFrame.to_json, you can easily convert the data to json format. Let’s take a look at the different ways to cover. The key argument is “orient”. It allows several values: split, records, index, columns, values, table. \n",
    "\n",
    "\n",
    "Let's make an example dataframe: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "26963078-06db-43a4-9433-8da842116305",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "import sys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5a7fea7c-6f95-4f13-8975-01bf0af36b23",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      col 1 col 2\n",
      "row 1     a     b\n",
      "row 2     c     d\n"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame(\n",
    "    [[\"a\", \"b\"], [\"c\", \"d\"]],\n",
    "    index=[\"row 1\", \"row 2\"],\n",
    "    columns=[\"col 1\", \"col 2\"],\n",
    ")\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "436b74de-105f-4334-8589-db58f9eeaf03",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"columns\": [\n",
      "        \"col 1\",\n",
      "        \"col 2\"\n",
      "    ],\n",
      "    \"index\": [\n",
      "        \"row 1\",\n",
      "        \"row 2\"\n",
      "    ],\n",
      "    \"data\": [\n",
      "        [\n",
      "            \"a\",\n",
      "            \"b\"\n",
      "        ],\n",
      "        [\n",
      "            \"c\",\n",
      "            \"d\"\n",
      "        ]\n",
      "    ]\n",
      "}\n",
      "Memory usage of split: 133 bytes\n"
     ]
    }
   ],
   "source": [
    "result = df.to_json(orient=\"split\")\n",
    "parsed = json.loads(result)\n",
    "print(json.dumps(parsed, indent=4)  )\n",
    "print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0545d0cd-ba51-4f58-98e1-3796f0f08369",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      "    {\n",
      "        \"col 1\": \"a\",\n",
      "        \"col 2\": \"b\"\n",
      "    },\n",
      "    {\n",
      "        \"col 1\": \"c\",\n",
      "        \"col 2\": \"d\"\n",
      "    }\n",
      "]\n",
      "Memory usage of split: 102 bytes\n"
     ]
    }
   ],
   "source": [
    "result = df.to_json(orient=\"records\")\n",
    "parsed = json.loads(result)\n",
    "print(json.dumps(parsed, indent=4)  )\n",
    "print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1867723-a774-4d8c-a419-82f2774fca86",
   "metadata": {},
   "source": [
    "\n",
    "**\"Index\" option is basically use the row name as the key, and each row is a JSON object.It seems to be the most logical one to convert a 2D structure to JSON. A row is a record. and the primary key is the row name, which we can change to the column we want to make as primary key. And memory usage is managable**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b0112c19-4248-44a4-9944-07f0913dbb46",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"row 1\": {\n",
      "        \"col 1\": \"a\",\n",
      "        \"col 2\": \"b\"\n",
      "    },\n",
      "    \"row 2\": {\n",
      "        \"col 1\": \"c\",\n",
      "        \"col 2\": \"d\"\n",
      "    }\n",
      "}\n",
      "Memory usage of split: 118 bytes\n"
     ]
    }
   ],
   "source": [
    "result = df.to_json(orient=\"index\")\n",
    "parsed = json.loads(result)\n",
    "print(json.dumps(parsed, indent=4)  )\n",
    "print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62553a8e-162c-4198-ac74-63d3277cbc20",
   "metadata": {},
   "source": [
    "**\"columns\" option is similar to \"index\", except it takes each column as a JSON object.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "60d6c5d3-6736-45c2-93b1-91c8c697b917",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"col 1\": {\n",
      "        \"row 1\": \"a\",\n",
      "        \"row 2\": \"c\"\n",
      "    },\n",
      "    \"col 2\": {\n",
      "        \"row 1\": \"b\",\n",
      "        \"row 2\": \"d\"\n",
      "    }\n",
      "}\n",
      "Memory usage of split: 118 bytes\n"
     ]
    }
   ],
   "source": [
    "result = df.to_json(orient=\"columns\")\n",
    "parsed = json.loads(result)\n",
    "print(json.dumps(parsed, indent=4)  )\n",
    "print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "b0171589-295e-4833-ae93-4d17f623249e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      "    [\n",
      "        \"a\",\n",
      "        \"b\"\n",
      "    ],\n",
      "    [\n",
      "        \"c\",\n",
      "        \"d\"\n",
      "    ]\n",
      "]\n",
      "Memory usage of split: 70 bytes\n"
     ]
    }
   ],
   "source": [
    "result = df.to_json(orient=\"values\")\n",
    "parsed = json.loads(result)\n",
    "print(json.dumps(parsed, indent=4)  )\n",
    "print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "3b3a882e-ed53-4aa1-bf38-11f3cbfcc161",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"schema\": {\n",
      "        \"fields\": [\n",
      "            {\n",
      "                \"name\": \"index\",\n",
      "                \"type\": \"string\"\n",
      "            },\n",
      "            {\n",
      "                \"name\": \"col 1\",\n",
      "                \"type\": \"string\"\n",
      "            },\n",
      "            {\n",
      "                \"name\": \"col 2\",\n",
      "                \"type\": \"string\"\n",
      "            }\n",
      "        ],\n",
      "        \"primaryKey\": [\n",
      "            \"index\"\n",
      "        ],\n",
      "        \"pandas_version\": \"0.20.0\"\n",
      "    },\n",
      "    \"data\": [\n",
      "        {\n",
      "            \"index\": \"row 1\",\n",
      "            \"col 1\": \"a\",\n",
      "            \"col 2\": \"b\"\n",
      "        },\n",
      "        {\n",
      "            \"index\": \"row 2\",\n",
      "            \"col 1\": \"c\",\n",
      "            \"col 2\": \"d\"\n",
      "        }\n",
      "    ]\n",
      "}\n",
      "Memory usage of split: 313 bytes\n"
     ]
    }
   ],
   "source": [
    "result = df.to_json(orient=\"table\")\n",
    "parsed = json.loads(result)\n",
    "print(json.dumps(parsed, indent=4)  )\n",
    "print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1042642b-8bcc-4c6c-a1dc-7105a2c97366",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "5f2121dd-4f24-4ef1-ab48-422fbb5530c4",
	"metadata": {},
	"source": [
	"# The considerations when converting a 2D table to JSON format.\n",
	"\n",
	"When we try to convert a SQL database, which is a record based 2D data table, to a NoSQL database, which is document based JSON format, we shall consider different ways of representation. Because your choice will impact retrieval latency and memory space. [Pandas Documentation for to_json](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html)\n",
	"\n",
	"Using pandas.DataFrame.to_json, you can easily convert the data to json format. Let’s take a look at the different ways to cover. The key argument is “orient”. It allows several values: split, records, index, columns, values, table. \n",
	"\n",
	"\n",
	"Let's make an example dataframe: "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"id": "26963078-06db-43a4-9433-8da842116305",
	"metadata": {},
	"outputs": [],
	"source": [
	"import json\n",
	"import pandas as pd\n",
	"import sys"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"id": "5a7fea7c-6f95-4f13-8975-01bf0af36b23",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" col 1 col 2\n",
	"row 1 a b\n",
	"row 2 c d\n"
	]
	}
	],
	"source": [
	"df = pd.DataFrame(\n",
	" [[\"a\", \"b\"], [\"c\", \"d\"]],\n",
	" index=[\"row 1\", \"row 2\"],\n",
	" columns=[\"col 1\", \"col 2\"],\n",
	")\n",
	"print(df)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"id": "436b74de-105f-4334-8589-db58f9eeaf03",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"{\n",
	" \"columns\": [\n",
	" \"col 1\",\n",
	" \"col 2\"\n",
	" ],\n",
	" \"index\": [\n",
	" \"row 1\",\n",
	" \"row 2\"\n",
	" ],\n",
	" \"data\": [\n",
	" [\n",
	" \"a\",\n",
	" \"b\"\n",
	" ],\n",
	" [\n",
	" \"c\",\n",
	" \"d\"\n",
	" ]\n",
	" ]\n",
	"}\n",
	"Memory usage of split: 133 bytes\n"
	]
	}
	],
	"source": [
	"result = df.to_json(orient=\"split\")\n",
	"parsed = json.loads(result)\n",
	"print(json.dumps(parsed, indent=4) )\n",
	"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"id": "0545d0cd-ba51-4f58-98e1-3796f0f08369",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[\n",
	" {\n",
	" \"col 1\": \"a\",\n",
	" \"col 2\": \"b\"\n",
	" },\n",
	" {\n",
	" \"col 1\": \"c\",\n",
	" \"col 2\": \"d\"\n",
	" }\n",
	"]\n",
	"Memory usage of split: 102 bytes\n"
	]
	}
	],
	"source": [
	"result = df.to_json(orient=\"records\")\n",
	"parsed = json.loads(result)\n",
	"print(json.dumps(parsed, indent=4) )\n",
	"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
	]
	},
	{
	"cell_type": "markdown",
	"id": "e1867723-a774-4d8c-a419-82f2774fca86",
	"metadata": {},
	"source": [
	"\n",
	"\"Index\" option is basically use the row name as the key, and each row is a JSON object.It seems to be the most logical one to convert a 2D structure to JSON. A row is a record. and the primary key is the row name, which we can change to the column we want to make as primary key. And memory usage is managable\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"id": "b0112c19-4248-44a4-9944-07f0913dbb46",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"{\n",
	" \"row 1\": {\n",
	" \"col 1\": \"a\",\n",
	" \"col 2\": \"b\"\n",
	" },\n",
	" \"row 2\": {\n",
	" \"col 1\": \"c\",\n",
	" \"col 2\": \"d\"\n",
	" }\n",
	"}\n",
	"Memory usage of split: 118 bytes\n"
	]
	}
	],
	"source": [
	"result = df.to_json(orient=\"index\")\n",
	"parsed = json.loads(result)\n",
	"print(json.dumps(parsed, indent=4) )\n",
	"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
	]
	},
	{
	"cell_type": "markdown",
	"id": "62553a8e-162c-4198-ac74-63d3277cbc20",
	"metadata": {},
	"source": [
	"\"columns\" option is similar to \"index\", except it takes each column as a JSON object."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"id": "60d6c5d3-6736-45c2-93b1-91c8c697b917",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"{\n",
	" \"col 1\": {\n",
	" \"row 1\": \"a\",\n",
	" \"row 2\": \"c\"\n",
	" },\n",
	" \"col 2\": {\n",
	" \"row 1\": \"b\",\n",
	" \"row 2\": \"d\"\n",
	" }\n",
	"}\n",
	"Memory usage of split: 118 bytes\n"
	]
	}
	],
	"source": [
	"result = df.to_json(orient=\"columns\")\n",
	"parsed = json.loads(result)\n",
	"print(json.dumps(parsed, indent=4) )\n",
	"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"id": "b0171589-295e-4833-ae93-4d17f623249e",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[\n",
	" [\n",
	" \"a\",\n",
	" \"b\"\n",
	" ],\n",
	" [\n",
	" \"c\",\n",
	" \"d\"\n",
	" ]\n",
	"]\n",
	"Memory usage of split: 70 bytes\n"
	]
	}
	],
	"source": [
	"result = df.to_json(orient=\"values\")\n",
	"parsed = json.loads(result)\n",
	"print(json.dumps(parsed, indent=4) )\n",
	"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"id": "3b3a882e-ed53-4aa1-bf38-11f3cbfcc161",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"{\n",
	" \"schema\": {\n",
	" \"fields\": [\n",
	" {\n",
	" \"name\": \"index\",\n",
	" \"type\": \"string\"\n",
	" },\n",
	" {\n",
	" \"name\": \"col 1\",\n",
	" \"type\": \"string\"\n",
	" },\n",
	" {\n",
	" \"name\": \"col 2\",\n",
	" \"type\": \"string\"\n",
	" }\n",
	" ],\n",
	" \"primaryKey\": [\n",
	" \"index\"\n",
	" ],\n",
	" \"pandas_version\": \"0.20.0\"\n",
	" },\n",
	" \"data\": [\n",
	" {\n",
	" \"index\": \"row 1\",\n",
	" \"col 1\": \"a\",\n",
	" \"col 2\": \"b\"\n",
	" },\n",
	" {\n",
	" \"index\": \"row 2\",\n",
	" \"col 1\": \"c\",\n",
	" \"col 2\": \"d\"\n",
	" }\n",
	" ]\n",
	"}\n",
	"Memory usage of split: 313 bytes\n"
	]
	}
	],
	"source": [
	"result = df.to_json(orient=\"table\")\n",
	"parsed = json.loads(result)\n",
	"print(json.dumps(parsed, indent=4) )\n",
	"print(\"Memory usage of split: {mem_used} bytes\".format(mem_used=sys.getsizeof(result)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "1042642b-8bcc-4c6c-a1dc-7105a2c97366",
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}