-
-
Save AndiH/4d4ef85e2dec395a0ae5343c648565eb to your computer and use it in GitHub Desktop.
{ | |
"page": 1, | |
"pages": 2270, | |
"limit": 10, | |
"total": 22693, | |
"items": [ | |
{ | |
"address": { | |
"city": "cityname first dataset", | |
"company_name": "companyname first dataset" | |
}, | |
"amount": 998, | |
"items": [ | |
{ | |
"description": "first part of first dataset", | |
"number": "part number of first part of first dataset" | |
} | |
], | |
"number": "number of first dataset", | |
"service_date": { | |
"type": "DEFAULT", | |
"date": "2015-11-18" | |
}, | |
"vat_option": null | |
}, | |
{ | |
"address": { | |
"city": "cityname second dataset", | |
"company_name": "companyname second dataset" | |
}, | |
"amount": 998, | |
"items": [ | |
{ | |
"description": "first part of second dataset", | |
"number": "part number of first part of second dataset" | |
}, | |
{ | |
"description": "second part of second dataset", | |
"number": "part number of second part of second dataset" | |
} | |
], | |
"number": "number of second dataset", | |
"service_date": { | |
"type": "DEFAULT", | |
"date": "2015-11-18" | |
}, | |
"vat_option": null | |
} | |
] | |
} |
item_address_city | item_address_company_name | items_amount | ||
---|---|---|---|---|
0 | cityname first dataset | companyname first dataset | 998 | |
1 | cityname second dataset | companyname second dataset | 998 |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Helping to solve StackOverflow 46724816 with Pandas" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import json" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Load your JSON file into variable" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"with open(\"input.json\") as f:\n", | |
" rawjson = json.load(f)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I want to convert your data to a dictonary (or a list of dictionary) to import into a Pandas DataFrame, visualize (and possible massage) it and export it to CSV.\n", | |
"\n", | |
"In case you only want to have a single element (`[0]`) you could setup your dictionary the following" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'item_address_city': 'cityname first dataset',\n", | |
" 'item_address_company_name': 'companyname first dataset',\n", | |
" 'items_amount': 998}" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"masterDict = {\n", | |
" \"item_address_city\": rawjson[\"items\"][0][\"address\"][\"city\"],\n", | |
" \"item_address_company_name\": rawjson[\"items\"][0][\"address\"][\"company_name\"],\n", | |
" \"items_amount\": rawjson[\"items\"][0][\"amount\"]\n", | |
"}\n", | |
"masterDict" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I reckon you don't want a single element but probably more of the `item`s. So let's create a list of dictionaries." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"data = []\n", | |
"for element in rawjson[\"items\"]:\n", | |
" data.append({\n", | |
" \"item_address_city\": element[\"address\"][\"city\"],\n", | |
" \"item_address_company_name\": element[\"address\"][\"company_name\"],\n", | |
" \"items_amount\": element[\"amount\"]\n", | |
" })" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This we can easily import into Pandas (and have a look at it)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>item_address_city</th>\n", | |
" <th>item_address_company_name</th>\n", | |
" <th>items_amount</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>cityname first dataset</td>\n", | |
" <td>companyname first dataset</td>\n", | |
" <td>998</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>cityname second dataset</td>\n", | |
" <td>companyname second dataset</td>\n", | |
" <td>998</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" item_address_city item_address_company_name items_amount\n", | |
"0 cityname first dataset companyname first dataset 998\n", | |
"1 cityname second dataset companyname second dataset 998" | |
] | |
}, | |
"execution_count": 33, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df = pd.DataFrame(data)\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Finally, export to CSV, if still needed." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"df.to_csv(\"output.csv\")" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
This here is a good start when using pandas, showing the flattening of arrays: https://medium.com/towards-data-science/flattening-json-objects-in-python-f5343c794b10
The result shown there after normalization is, that the hobbies are put each to an own column.
OK so far - got that.
I'd like to have only one column "hobbies" and then create new lines:
https://www.dropbox.com/s/nm24xqgvdrslkl0/john.jpg?dl=0
The this is, that the information like "John" or "Los Angeles" need to be copied to the next line then.
Whoooho... think I got it: http://pandas.pydata.org/pandas-docs/stable/io.html#normalization
Now adapting it to my data set... keep you posted 💃
Okay, so far so good - in general, it is working, currently I have a slight problem with a "distinguishing prefix" - i think because I am in conflict with 'numbers' - seems to be used in pandas itself?
json_normalize is a great way to "massage" the json data - great tip using pandas!
Now also with working code example... :D
Thank you Andi! The loop helps a lot and will collect the data like I need it.
The dictionaries like address->city and address->company are now clear to me.
The problem now is the multiple item arrays like:
If I include that into the loop like:
"items_items": element["items"]["description"]
Of course that doesn’t work, because the whole "items": [ ] set is a string.