Skip to content

Instantly share code, notes, and snippets.

@AndiH
Last active October 13, 2017 16:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AndiH/4d4ef85e2dec395a0ae5343c648565eb to your computer and use it in GitHub Desktop.
Save AndiH/4d4ef85e2dec395a0ae5343c648565eb to your computer and use it in GitHub Desktop.
{
"page": 1,
"pages": 2270,
"limit": 10,
"total": 22693,
"items": [
{
"address": {
"city": "cityname first dataset",
"company_name": "companyname first dataset"
},
"amount": 998,
"items": [
{
"description": "first part of first dataset",
"number": "part number of first part of first dataset"
}
],
"number": "number of first dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
},
{
"address": {
"city": "cityname second dataset",
"company_name": "companyname second dataset"
},
"amount": 998,
"items": [
{
"description": "first part of second dataset",
"number": "part number of first part of second dataset"
},
{
"description": "second part of second dataset",
"number": "part number of second part of second dataset"
}
],
"number": "number of second dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
}
]
}
item_address_city item_address_company_name items_amount
0 cityname first dataset companyname first dataset 998
1 cityname second dataset companyname second dataset 998
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Helping to solve StackOverflow 46724816 with Pandas"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import json"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load your JSON file into variable"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"with open(\"input.json\") as f:\n",
" rawjson = json.load(f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I want to convert your data to a dictonary (or a list of dictionary) to import into a Pandas DataFrame, visualize (and possible massage) it and export it to CSV.\n",
"\n",
"In case you only want to have a single element (`[0]`) you could setup your dictionary the following"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'item_address_city': 'cityname first dataset',\n",
" 'item_address_company_name': 'companyname first dataset',\n",
" 'items_amount': 998}"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"masterDict = {\n",
" \"item_address_city\": rawjson[\"items\"][0][\"address\"][\"city\"],\n",
" \"item_address_company_name\": rawjson[\"items\"][0][\"address\"][\"company_name\"],\n",
" \"items_amount\": rawjson[\"items\"][0][\"amount\"]\n",
"}\n",
"masterDict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I reckon you don't want a single element but probably more of the `item`s. So let's create a list of dictionaries."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data = []\n",
"for element in rawjson[\"items\"]:\n",
" data.append({\n",
" \"item_address_city\": element[\"address\"][\"city\"],\n",
" \"item_address_company_name\": element[\"address\"][\"company_name\"],\n",
" \"items_amount\": element[\"amount\"]\n",
" })"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This we can easily import into Pandas (and have a look at it)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>item_address_city</th>\n",
" <th>item_address_company_name</th>\n",
" <th>items_amount</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>cityname first dataset</td>\n",
" <td>companyname first dataset</td>\n",
" <td>998</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>cityname second dataset</td>\n",
" <td>companyname second dataset</td>\n",
" <td>998</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" item_address_city item_address_company_name items_amount\n",
"0 cityname first dataset companyname first dataset 998\n",
"1 cityname second dataset companyname second dataset 998"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(data)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, export to CSV, if still needed."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.to_csv(\"output.csv\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@thowi84
Copy link

thowi84 commented Oct 13, 2017

Thank you Andi! The loop helps a lot and will collect the data like I need it.
The dictionaries like address->city and address->company are now clear to me.
The problem now is the multiple item arrays like:

      "items": [
        {
          "description": "first part of second dataset",
          "number": "part number of first part of second dataset"
        },
        {
          "description": "second part of second dataset",
          "number": "part number of second part of second dataset"
        }
      ]

If I include that into the loop like:
"items_items": element["items"]["description"]
Of course that doesn’t work, because the whole "items": [ ] set is a string.

@thowi84
Copy link

thowi84 commented Oct 13, 2017

This here is a good start when using pandas, showing the flattening of arrays: https://medium.com/towards-data-science/flattening-json-objects-in-python-f5343c794b10
The result shown there after normalization is, that the hobbies are put each to an own column.

OK so far - got that.
I'd like to have only one column "hobbies" and then create new lines:
https://www.dropbox.com/s/nm24xqgvdrslkl0/john.jpg?dl=0
The this is, that the information like "John" or "Los Angeles" need to be copied to the next line then.

@thowi84
Copy link

thowi84 commented Oct 13, 2017

Whoooho... think I got it: http://pandas.pydata.org/pandas-docs/stable/io.html#normalization
Now adapting it to my data set... keep you posted 💃

@thowi84
Copy link

thowi84 commented Oct 13, 2017

Okay, so far so good - in general, it is working, currently I have a slight problem with a "distinguishing prefix" - i think because I am in conflict with 'numbers' - seems to be used in pandas itself?
json_normalize is a great way to "massage" the json data - great tip using pandas!

https://stackoverflow.com/questions/46731658/json-normalize-delivers-valueerror-need-of-distinguishing-prefix

Now also with working code example... :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment