Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Last active June 12, 2023 08:34
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save psychemedia/7cf7cf56f3178126df4e7a29d8621623 to your computer and use it in GitHub Desktop.
Save psychemedia/7cf7cf56f3178126df4e7a29d8621623 to your computer and use it in GitHub Desktop.
Convert a pandas dataframe describing a hierarchy to a JSON tree
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "a9b57a94",
"metadata": {},
"source": [
"# Utility — Table to Tree\n",
"\n",
"This scripts in this notebook are utility scripts for generating a hierarchical JSON tree structure from a *pandas* dataframe.\n",
"\n",
"Consider the following dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "28ad4f4b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>WD17CD</th>\n",
" <th>WD17NM</th>\n",
" <th>LAD17CD</th>\n",
" <th>LAD17NM</th>\n",
" <th>GOR10CD</th>\n",
" <th>GOR10NM</th>\n",
" <th>CTRY17CD</th>\n",
" <th>CTRY17NM</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>E05001678</td>\n",
" <td>Newington</td>\n",
" <td>E06000010</td>\n",
" <td>Kingston upon Hull, City of</td>\n",
" <td>E12000003</td>\n",
" <td>Yorkshire and The Humber</td>\n",
" <td>E92000001</td>\n",
" <td>England</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>E05001779</td>\n",
" <td>Mickleover</td>\n",
" <td>E06000015</td>\n",
" <td>Derby</td>\n",
" <td>E12000004</td>\n",
" <td>East Midlands</td>\n",
" <td>E92000001</td>\n",
" <td>England</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>E05001628</td>\n",
" <td>Higher Croft</td>\n",
" <td>E06000008</td>\n",
" <td>Blackburn with Darwen</td>\n",
" <td>E12000002</td>\n",
" <td>North West</td>\n",
" <td>E92000001</td>\n",
" <td>England</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>E05008942</td>\n",
" <td>Burn Valley</td>\n",
" <td>E06000001</td>\n",
" <td>Hartlepool</td>\n",
" <td>E12000001</td>\n",
" <td>North East</td>\n",
" <td>E92000001</td>\n",
" <td>England</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>E05001679</td>\n",
" <td>Newland</td>\n",
" <td>E06000010</td>\n",
" <td>Kingston upon Hull, City of</td>\n",
" <td>E12000003</td>\n",
" <td>Yorkshire and The Humber</td>\n",
" <td>E92000001</td>\n",
" <td>England</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" WD17CD WD17NM LAD17CD LAD17NM GOR10CD \\\n",
"0 E05001678 Newington E06000010 Kingston upon Hull, City of E12000003 \n",
"1 E05001779 Mickleover E06000015 Derby E12000004 \n",
"2 E05001628 Higher Croft E06000008 Blackburn with Darwen E12000002 \n",
"3 E05008942 Burn Valley E06000001 Hartlepool E12000001 \n",
"4 E05001679 Newland E06000010 Kingston upon Hull, City of E12000003 \n",
"\n",
" GOR10NM CTRY17CD CTRY17NM \n",
"0 Yorkshire and The Humber E92000001 England \n",
"1 East Midlands E92000001 England \n",
"2 North West E92000001 England \n",
"3 North East E92000001 England \n",
"4 Yorkshire and The Humber E92000001 England "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"wards_csv_fn = 'wards_data.csv'\n",
"\n",
"wards_df = pd.read_csv('wards_data.csv')\n",
"wards_df.head()"
]
},
{
"cell_type": "markdown",
"id": "7f47b4bd",
"metadata": {},
"source": [
"Regions are contained within countries, local authorities within regions, wards within local authorities.\n",
"\n",
"The Python [`treelib`](https://github.com/caesar0301/treelib) provides support for creating simple tree structures. The following explicit code fragment will generate a `treelib.Tree()` object from a dataframe when passed the column names corresponding to the ID value and label required for each level of the tree:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "347e4931",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Country\n",
"└── England\n",
" ├── East Midlands\n",
" │ └── Derby\n",
" │ └── Mickleover\n",
" ├── North East\n",
" │ └── Hartlepool\n",
" │ └── Burn Valley\n",
" ├── North West\n",
" │ └── Blackburn with Darwen\n",
" │ └── Higher Croft\n",
" └── Yorkshire and The Humber\n",
" └── Kingston upon Hull, City of\n",
" ├── Newington\n",
" └── Newland\n",
"\n"
]
}
],
"source": [
"#%pip install treelib\n",
"from treelib import Tree\n",
"\n",
"country_tree = Tree()\n",
"# Create a root node\n",
"country_tree.create_node(\"Country\", \"countries\")\n",
"\n",
"# Group by country\n",
"for country, regions in wards_df.head(5).groupby([\"CTRY17NM\", \"CTRY17CD\"]):\n",
" # Generate a node for each country\n",
" country_tree.create_node(country[0], country[1], parent=\"countries\")\n",
" # Group by region\n",
" for region, las in regions.groupby([\"GOR10NM\", \"GOR10CD\"]):\n",
" # Generate a node for each region\n",
" country_tree.create_node(region[0], region[1], parent=country[1])\n",
" # Group by local authority\n",
" for la, wards in las.groupby(['LAD17NM', 'LAD17CD']):\n",
" # Create a node for each local authority\n",
" country_tree.create_node(la[0], la[1], parent=region[1])\n",
" for ward, _ in wards.groupby(['WD17NM', 'WD17CD']):\n",
" # Create a leaf node for each ward\n",
" country_tree.create_node(ward[0], ward[1], parent=la[1])\n",
"\n",
"# Output the hierarchical data\n",
"country_tree.show()"
]
},
{
"cell_type": "markdown",
"id": "2b9be398",
"metadata": {},
"source": [
"Whilst the code works, it is a little messy. More generally, we can create a recursive function to traverse the tree for us:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "d59a75f3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overwriting table2tree.py\n"
]
}
],
"source": [
"%%writefile table2tree.py\n",
"from treelib import Tree\n",
"\n",
"def create_tree(df, items, parent, root=None, tree=None, i=0):\n",
" \"\"\"Create a tree from a dataframe.\"\"\"\n",
" if tree is None:\n",
" tree = Tree()\n",
" root = root if root else parent\n",
" tree.create_node(root, parent)\n",
"\n",
" i = i + 1\n",
"\n",
" for parental, group_df in df.groupby(items[i-1]):\n",
" tree.create_node(parental[0], parental[1], parent=parent)\n",
" if i <= len(items)-1: \n",
" create_tree(group_df, items, parental[1], tree=tree, i=i)\n",
" \n",
" return tree"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "898eec1c",
"metadata": {},
"outputs": [],
"source": [
"# Run the file as if we had run the code cell\n",
"%run table2tree.py"
]
},
{
"cell_type": "markdown",
"id": "1289ee84",
"metadata": {},
"source": [
"We can now specify a list of column pairs (label and ID) for each level of the tree and generate the tree from that:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "188882c2",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Country\n",
"└── England\n",
" ├── East Midlands\n",
" │ └── Derby\n",
" │ ├── Mickleover\n",
" │ ├── Normanton\n",
" │ └── Oakwood\n",
" ├── North East\n",
" │ └── Hartlepool\n",
" │ └── Burn Valley\n",
" ├── North West\n",
" │ └── Blackburn with Darwen\n",
" │ ├── Higher Croft\n",
" │ ├── Little Harwood\n",
" │ └── Livesey with Pleasington\n",
" └── Yorkshire and The Humber\n",
" └── Kingston upon Hull, City of\n",
" ├── Newington\n",
" ├── Newland\n",
" └── Orchard Park and Greenwood\n",
"\n"
]
}
],
"source": [
"# The items specify the label and the ID columns for each node in the tree\n",
"items = [[\"CTRY17NM\", \"CTRY17CD\"],\n",
" [\"GOR10NM\", \"GOR10CD\"],\n",
" ['LAD17NM', 'LAD17CD'],\n",
" ['WD17NM', 'WD17CD']]\n",
"\n",
"tree = create_tree(wards_df.head(10), items, 'countries', 'Country' )\n",
"\n",
"tree.show()"
]
},
{
"cell_type": "markdown",
"id": "e4b8a26f",
"metadata": {},
"source": [
"We can also export the tree as a JSON file:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "ba42bab4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Country': {'children': [{'England': {'children': [{'East Midlands': {'children': [{'Derby': {'children': ['Mickleover',\n",
" 'Normanton',\n",
" 'Oakwood']}}]}},\n",
" {'North East': {'children': [{'Hartlepool': {'children': ['Burn Valley']}}]}},\n",
" {'North West': {'children': [{'Blackburn with Darwen': {'children': ['Higher Croft',\n",
" 'Little Harwood',\n",
" 'Livesey with Pleasington']}}]}},\n",
" {'Yorkshire and The Humber': {'children': [{'Kingston upon Hull, City of': {'children': ['Newington',\n",
" 'Newland',\n",
" 'Orchard Park and Greenwood']}}]}}]}}]}}"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import json\n",
"\n",
"tree_json = json.loads(tree.to_json())\n",
"tree_json"
]
},
{
"cell_type": "markdown",
"id": "ac054ad6",
"metadata": {},
"source": [
"The format of the JSON has interstitial `children` elements that make be convenient in some cases, but that may be surplus to requirements in other cases.\n",
"\n",
"Naively, and explicitly, we could start to remove these elements from the tree using something like following code snippet:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "59dfc543",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Country': {'England': {'East Midlands': [{'Derby': {'children': ['Mickleover',\n",
" 'Normanton',\n",
" 'Oakwood']}}],\n",
" 'North East': [{'Hartlepool': {'children': ['Burn Valley']}}],\n",
" 'North West': [{'Blackburn with Darwen': {'children': ['Higher Croft',\n",
" 'Little Harwood',\n",
" 'Livesey with Pleasington']}}],\n",
" 'Yorkshire and The Humber': [{'Kingston upon Hull, City of': {'children': ['Newington',\n",
" 'Newland',\n",
" 'Orchard Park and Greenwood']}}]}}}"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tmp_pruned_tree = {'Country':{}}\n",
"\n",
"for region in tree_json['Country']['children']:\n",
" for region_key in region.keys():\n",
" tmp_pruned_tree['Country'][region_key] = {}\n",
" for la in region[region_key]['children']:\n",
" for la_key in la.keys():\n",
" tmp_pruned_tree['Country'][region_key][la_key] = la[la_key]['children']\n",
"\n",
"\n",
"tmp_pruned_tree"
]
},
{
"cell_type": "markdown",
"id": "470da9e0",
"metadata": {},
"source": [
"Once again, we can take inspiration from the literal code to come up with a recursive function that will prune the child nodes for us for any depth tree:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "a5f220e5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Appending to table2tree.py\n"
]
}
],
"source": [
"%%writefile -a table2tree.py\n",
"\n",
"def prune_tree(tree, pruned=None, path=None):\n",
" \"\"\"Prune 'children' nodes from tree.\"\"\"\n",
" \n",
" # Create a new pruned tree if we haven't yet started...\n",
" pruned = {} if pruned is None else pruned\n",
"\n",
" # Convert the tree to a dict if it isn't already in dict form\n",
" if isinstance(tree, type(Tree())):\n",
" tree = json.loads(tree.to_json())\n",
" \n",
" # Get the first (root) node\n",
" path = path if path else next(iter(tree))\n",
" \n",
" # This will be our pruned tree dictionary\n",
" pruned[path] = {}\n",
" \n",
" # Now start to check the subtrees...\n",
" for subtree in tree[path]['children']:\n",
" # If we find into another subtree...\n",
" if isinstance(subtree, dict):\n",
" # Descend into it...\n",
" for subtree_key in subtree.keys():\n",
" # Create a new key node for this subtree\n",
" pruned[path][subtree_key] = {}\n",
" # And carry on pruning down into the tree\n",
" prune_tree(subtree, pruned[path], subtree_key)\n",
" else:\n",
" # We've reached the leaves which add as a list\n",
" pruned[path] = tree[path]['children']\n",
" \n",
" return pruned"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "749cabec",
"metadata": {},
"outputs": [],
"source": [
"# Run the file as if we had run the code cell\n",
"%run table2tree.py"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "a4122d3a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Country': {'England': {'East Midlands': {'Derby': ['Mickleover',\n",
" 'Normanton',\n",
" 'Oakwood']},\n",
" 'North East': {'Hartlepool': ['Burn Valley']},\n",
" 'North West': {'Blackburn with Darwen': ['Higher Croft',\n",
" 'Little Harwood',\n",
" 'Livesey with Pleasington']},\n",
" 'Yorkshire and The Humber': {'Kingston upon Hull, City of': ['Newington',\n",
" 'Newland',\n",
" 'Orchard Park and Greenwood']}}}}"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pruned_tree = prune_tree(tree_json)\n",
"pruned_tree"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment