Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save danyx23/a9491fb86a63514e6f4b4ee7ac03cb6c to your computer and use it in GitHub Desktop.
Save danyx23/a9491fb86a63514e6f4b4ee7ac03cb6c to your computer and use it in GitHub Desktop.
Jupyter notebook for investigating grapher json configs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"source": [
"## Investigate json config unique values\n",
"\n",
"Working on the json schema for the schema config. For this I often want to know what are the actually used values in our graphers for certain fields - this is what we do here\n",
"\n",
"This notebook uses Lars' [mysql jupyter notebook magic](https://gist.github.com/larsyencken/935452e59246a58c647dd3953de2ab07)"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 95,
"source": [
"import pandas as pd\n",
"import json\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 96,
"source": [
"# Declare the default dataframe so autocomplete works\n",
"df : pd.DataFrame = pd.DataFrame()"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Grab all charts from the database\n",
"This requires the mysql magic to be set up and access configured to a running Mysql db filled with grapher configs"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 97,
"source": [
"%%mysql\n",
"select * from charts"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 98,
"source": [
"df"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" id config \\\n",
"0 16 {\"id\": 16, \"map\": {\"colorScale\": {\"baseColorSc... \n",
"1 20 {\"id\": 20, \"map\": {\"colorScale\": {\"baseColorSc... \n",
"2 26 {\"id\": 26, \"map\": {\"colorScale\": {\"baseColorSc... \n",
"3 27 {\"id\": 27, \"map\": {\"colorScale\": {\"baseColorSc... \n",
"4 29 {\"id\": 29, \"map\": {\"mode\": \"specific\", \"maxYea... \n",
"... ... ... \n",
"4184 5068 {\"map\": {\"colorScale\": {\"baseColorScheme\": \"Yl... \n",
"4185 5069 {\"map\": {\"colorScale\": {\"baseColorScheme\": \"Yl... \n",
"4186 5070 {\"data\": {\"availableEntities\": [\"United Kingdo... \n",
"4187 5071 {\"id\": 5071, \"data\": {\"availableEntities\": [\"U... \n",
"4188 5072 {\"id\": 5072, \"data\": {\"availableEntities\": [\"S... \n",
"\n",
" createdAt updatedAt lastEditedAt starred \\\n",
"0 2015-06-30 08:11:02 2019-01-16 14:00:48 2019-01-16 14:00:48 0 \n",
"1 2015-07-02 07:05:32 2019-02-08 19:56:21 2019-02-08 19:56:21 0 \n",
"2 2015-07-07 18:57:10 2019-10-26 17:05:41 2019-10-26 17:05:41 0 \n",
"3 2015-07-07 19:07:38 2018-03-15 18:12:48 2017-10-11 06:47:31 0 \n",
"4 2015-07-09 18:22:14 2018-03-15 18:12:48 2016-09-01 21:29:30 0 \n",
"... ... ... ... ... \n",
"4184 2021-07-01 14:18:50 2021-07-01 14:18:50 2021-07-01 14:18:50 0 \n",
"4185 2021-07-01 14:23:15 2021-07-01 14:23:15 2021-07-01 14:23:15 0 \n",
"4186 2021-07-12 18:07:12 2021-07-12 18:07:12 2021-07-12 18:07:12 0 \n",
"4187 2021-07-12 18:07:16 2021-07-14 19:15:22 2021-07-14 19:15:22 0 \n",
"4188 2021-07-19 12:18:25 2021-07-19 13:03:00 2021-07-19 13:03:00 0 \n",
"\n",
" publishedAt lastEditedByUserId publishedByUserId is_indexable \\\n",
"0 2015-06-30 08:11:02 14 NaN 1 \n",
"1 2015-07-02 07:05:32 9 NaN 1 \n",
"2 2015-07-07 18:57:10 19 NaN 1 \n",
"3 2015-07-07 19:07:38 9 NaN 1 \n",
"4 2015-07-09 18:22:14 7 NaN 1 \n",
"... ... ... ... ... \n",
"4184 2021-07-01 14:18:50 14 14.0 1 \n",
"4185 2021-07-01 14:23:15 14 14.0 1 \n",
"4186 NaT 57 NaN 0 \n",
"4187 2021-07-12 18:07:16 57 57.0 0 \n",
"4188 NaT 57 NaN 0 \n",
"\n",
" isExplorable \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"... ... \n",
"4184 0 \n",
"4185 0 \n",
"4186 0 \n",
"4187 0 \n",
"4188 0 \n",
"\n",
"[4189 rows x 11 columns]"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>config</th>\n",
" <th>createdAt</th>\n",
" <th>updatedAt</th>\n",
" <th>lastEditedAt</th>\n",
" <th>starred</th>\n",
" <th>publishedAt</th>\n",
" <th>lastEditedByUserId</th>\n",
" <th>publishedByUserId</th>\n",
" <th>is_indexable</th>\n",
" <th>isExplorable</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>16</td>\n",
" <td>{\"id\": 16, \"map\": {\"colorScale\": {\"baseColorSc...</td>\n",
" <td>2015-06-30 08:11:02</td>\n",
" <td>2019-01-16 14:00:48</td>\n",
" <td>2019-01-16 14:00:48</td>\n",
" <td>0</td>\n",
" <td>2015-06-30 08:11:02</td>\n",
" <td>14</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>20</td>\n",
" <td>{\"id\": 20, \"map\": {\"colorScale\": {\"baseColorSc...</td>\n",
" <td>2015-07-02 07:05:32</td>\n",
" <td>2019-02-08 19:56:21</td>\n",
" <td>2019-02-08 19:56:21</td>\n",
" <td>0</td>\n",
" <td>2015-07-02 07:05:32</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>26</td>\n",
" <td>{\"id\": 26, \"map\": {\"colorScale\": {\"baseColorSc...</td>\n",
" <td>2015-07-07 18:57:10</td>\n",
" <td>2019-10-26 17:05:41</td>\n",
" <td>2019-10-26 17:05:41</td>\n",
" <td>0</td>\n",
" <td>2015-07-07 18:57:10</td>\n",
" <td>19</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>27</td>\n",
" <td>{\"id\": 27, \"map\": {\"colorScale\": {\"baseColorSc...</td>\n",
" <td>2015-07-07 19:07:38</td>\n",
" <td>2018-03-15 18:12:48</td>\n",
" <td>2017-10-11 06:47:31</td>\n",
" <td>0</td>\n",
" <td>2015-07-07 19:07:38</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>29</td>\n",
" <td>{\"id\": 29, \"map\": {\"mode\": \"specific\", \"maxYea...</td>\n",
" <td>2015-07-09 18:22:14</td>\n",
" <td>2018-03-15 18:12:48</td>\n",
" <td>2016-09-01 21:29:30</td>\n",
" <td>0</td>\n",
" <td>2015-07-09 18:22:14</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4184</th>\n",
" <td>5068</td>\n",
" <td>{\"map\": {\"colorScale\": {\"baseColorScheme\": \"Yl...</td>\n",
" <td>2021-07-01 14:18:50</td>\n",
" <td>2021-07-01 14:18:50</td>\n",
" <td>2021-07-01 14:18:50</td>\n",
" <td>0</td>\n",
" <td>2021-07-01 14:18:50</td>\n",
" <td>14</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4185</th>\n",
" <td>5069</td>\n",
" <td>{\"map\": {\"colorScale\": {\"baseColorScheme\": \"Yl...</td>\n",
" <td>2021-07-01 14:23:15</td>\n",
" <td>2021-07-01 14:23:15</td>\n",
" <td>2021-07-01 14:23:15</td>\n",
" <td>0</td>\n",
" <td>2021-07-01 14:23:15</td>\n",
" <td>14</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4186</th>\n",
" <td>5070</td>\n",
" <td>{\"data\": {\"availableEntities\": [\"United Kingdo...</td>\n",
" <td>2021-07-12 18:07:12</td>\n",
" <td>2021-07-12 18:07:12</td>\n",
" <td>2021-07-12 18:07:12</td>\n",
" <td>0</td>\n",
" <td>NaT</td>\n",
" <td>57</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4187</th>\n",
" <td>5071</td>\n",
" <td>{\"id\": 5071, \"data\": {\"availableEntities\": [\"U...</td>\n",
" <td>2021-07-12 18:07:16</td>\n",
" <td>2021-07-14 19:15:22</td>\n",
" <td>2021-07-14 19:15:22</td>\n",
" <td>0</td>\n",
" <td>2021-07-12 18:07:16</td>\n",
" <td>57</td>\n",
" <td>57.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4188</th>\n",
" <td>5072</td>\n",
" <td>{\"id\": 5072, \"data\": {\"availableEntities\": [\"S...</td>\n",
" <td>2021-07-19 12:18:25</td>\n",
" <td>2021-07-19 13:03:00</td>\n",
" <td>2021-07-19 13:03:00</td>\n",
" <td>0</td>\n",
" <td>NaT</td>\n",
" <td>57</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>4189 rows × 11 columns</p>\n",
"</div>"
]
},
"metadata": {},
"execution_count": 98
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 99,
"source": [
"df.shape"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(4189, 11)"
]
},
"metadata": {},
"execution_count": 99
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Parse grapher JSON into python dicts"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 100,
"source": [
"df[\"configjson\"] = df.config.map(lambda val: json.loads(val))"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Now create a new dataframe where we explode the top level json keys into one column each"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 101,
"source": [
"json_content_df = df.configjson.apply(pd.Series)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 102,
"source": [
"json_content_df.columns\n"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Index(['id', 'map', 'tab', 'data', 'slug', 'type', 'title', 'xAxis', 'yAxis',\n",
" 'maxTime', 'minTime', 'version', 'subtitle', 'hasMapTab', 'originUrl',\n",
" 'stackMode', 'dimensions', 'entityType', 'hideLegend', 'hasChartTab',\n",
" 'isPublished', 'variantName', 'selectedData', 'internalNotes',\n",
" 'addCountryMode', 'hideRelativeToggle', 'note', 'sourceDesc', 'unit',\n",
" 'logos', 'units', 'margins', 'variables', '/charts/29', 'iframe-width',\n",
" 'iframe-height', 'onlyEntityMatch', 'activeLegendKeys',\n",
" 'group-by-variables', 'colorScale', 'hideTimeline', 'baseColorScheme',\n",
" 'invertColorScheme', 'isExplorable', 'timeline', 'hideTitleAnnotation',\n",
" 'compareEndPointsOnly', 'chart-name', 'identityLine',\n",
" 'add-country-control', 'useV2', 'chart-slug', 'chart-notes',\n",
" 'timelineMaxTime', 'isAutoTitle', 'matchingEntitiesOnly',\n",
" 'comparisonLines', 'relatedQuestions', 'selectedEntityIds',\n",
" 'selectedEntityNames', 'excludedEntities', '/charts/255', 'published',\n",
" 'hideConnectedScatterLines', '/charts/300', 'timelineMinTime',\n",
" 'selectedEntityColors', 'scatterPointLabelStrategy',\n",
" 'hideLinesOutsideTolerance', 'highlightToggle', 'lastEditedAt',\n",
" 'minPopulationFilter', 'isAutoSlug', 'zoomToSelection', 'logo',\n",
" 'hideLogo', 'showYearLabels', 'overlay'],\n",
" dtype='object')"
]
},
"metadata": {},
"execution_count": 102
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Explore NA values and usage counts\n",
"\n",
"Here we count the NA values in each column and investigate them"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 103,
"source": [
"pd.set_option('display.max_rows', 100)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 104,
"source": [
"len(json_content_df.id)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"4189"
]
},
"metadata": {},
"execution_count": 104
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 105,
"source": [
"map_na = pd.DataFrame()\n",
"map_na[\"missing\"] = json_content_df.isna().sum(axis=0)\n",
"map_na[\"existing\"] = map_na.missing.max() - map_na[\"missing\"]\n",
"# map_na.sort_values(by=\"existing\", inplace=True)\n",
"map_na.sort_index(inplace=True)\n",
"map_na\n",
"# map_na.to_markdown(\"rare-keys.md\")"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" missing existing\n",
"/charts/255 4188 1\n",
"/charts/29 4188 1\n",
"/charts/300 4188 1\n",
"activeLegendKeys 4188 1\n",
"add-country-control 4187 2\n",
"addCountryMode 1279 2910\n",
"baseColorScheme 3972 217\n",
"chart-name 4170 19\n",
"chart-notes 4184 5\n",
"chart-slug 4184 5\n",
"colorScale 3018 1171\n",
"compareEndPointsOnly 4153 36\n",
"comparisonLines 3920 269\n",
"data 4 4185\n",
"dimensions 0 4189\n",
"entityType 3425 764\n",
"excludedEntities 4084 105\n",
"group-by-variables 4011 178\n",
"hasChartTab 1169 3020\n",
"hasMapTab 610 3579\n",
"hideConnectedScatterLines 4160 29\n",
"hideLegend 3440 749\n",
"hideLinesOutsideTolerance 4183 6\n",
"hideLogo 4100 89\n",
"hideRelativeToggle 1154 3035\n",
"hideTimeline 4010 179\n",
"hideTitleAnnotation 2800 1389\n",
"highlightToggle 4189 0\n",
"id 136 4053\n",
"identityLine 4083 106\n",
"iframe-height 4145 44\n",
"iframe-width 4145 44\n",
"internalNotes 2764 1425\n",
"invertColorScheme 4135 54\n",
"isAutoSlug 3780 409\n",
"isAutoTitle 4043 146\n",
"isExplorable 3570 619\n",
"isPublished 126 4063\n",
"lastEditedAt 4178 11\n",
"logo 4029 160\n",
"logos 4011 178\n",
"map 160 4029\n",
"margins 4011 178\n",
"matchingEntitiesOnly 4066 123\n",
"maxTime 2017 2172\n",
"minPopulationFilter 4176 13\n",
"minTime 1922 2267\n",
"note 3014 1175\n",
"onlyEntityMatch 4188 1\n",
"originUrl 1728 2461\n",
"overlay 4188 1\n",
"published 4188 1\n",
"relatedQuestions 4147 42\n",
"scatterPointLabelStrategy 4173 16\n",
"selectedData 277 3912\n",
"selectedEntityColors 4174 15\n",
"selectedEntityIds 3965 224\n",
"selectedEntityNames 3812 377\n",
"showYearLabels 4188 1\n",
"slug 1 4188\n",
"sourceDesc 2098 2091\n",
"stackMode 1687 2502\n",
"subtitle 681 3508\n",
"tab 759 3430\n",
"timeline 4176 13\n",
"timelineMaxTime 4166 23\n",
"timelineMinTime 4135 54\n",
"title 0 4189\n",
"type 785 3404\n",
"unit 4188 1\n",
"units 4011 178\n",
"useV2 3958 231\n",
"variables 4153 36\n",
"variantName 4023 166\n",
"version 0 4189\n",
"xAxis 616 3573\n",
"yAxis 222 3967\n",
"zoomToSelection 4186 3"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>missing</th>\n",
" <th>existing</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>/charts/255</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>/charts/29</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>/charts/300</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>activeLegendKeys</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>add-country-control</th>\n",
" <td>4187</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>addCountryMode</th>\n",
" <td>1279</td>\n",
" <td>2910</td>\n",
" </tr>\n",
" <tr>\n",
" <th>baseColorScheme</th>\n",
" <td>3972</td>\n",
" <td>217</td>\n",
" </tr>\n",
" <tr>\n",
" <th>chart-name</th>\n",
" <td>4170</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>chart-notes</th>\n",
" <td>4184</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>chart-slug</th>\n",
" <td>4184</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>colorScale</th>\n",
" <td>3018</td>\n",
" <td>1171</td>\n",
" </tr>\n",
" <tr>\n",
" <th>compareEndPointsOnly</th>\n",
" <td>4153</td>\n",
" <td>36</td>\n",
" </tr>\n",
" <tr>\n",
" <th>comparisonLines</th>\n",
" <td>3920</td>\n",
" <td>269</td>\n",
" </tr>\n",
" <tr>\n",
" <th>data</th>\n",
" <td>4</td>\n",
" <td>4185</td>\n",
" </tr>\n",
" <tr>\n",
" <th>dimensions</th>\n",
" <td>0</td>\n",
" <td>4189</td>\n",
" </tr>\n",
" <tr>\n",
" <th>entityType</th>\n",
" <td>3425</td>\n",
" <td>764</td>\n",
" </tr>\n",
" <tr>\n",
" <th>excludedEntities</th>\n",
" <td>4084</td>\n",
" <td>105</td>\n",
" </tr>\n",
" <tr>\n",
" <th>group-by-variables</th>\n",
" <td>4011</td>\n",
" <td>178</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hasChartTab</th>\n",
" <td>1169</td>\n",
" <td>3020</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hasMapTab</th>\n",
" <td>610</td>\n",
" <td>3579</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hideConnectedScatterLines</th>\n",
" <td>4160</td>\n",
" <td>29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hideLegend</th>\n",
" <td>3440</td>\n",
" <td>749</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hideLinesOutsideTolerance</th>\n",
" <td>4183</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hideLogo</th>\n",
" <td>4100</td>\n",
" <td>89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hideRelativeToggle</th>\n",
" <td>1154</td>\n",
" <td>3035</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hideTimeline</th>\n",
" <td>4010</td>\n",
" <td>179</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hideTitleAnnotation</th>\n",
" <td>2800</td>\n",
" <td>1389</td>\n",
" </tr>\n",
" <tr>\n",
" <th>highlightToggle</th>\n",
" <td>4189</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <td>136</td>\n",
" <td>4053</td>\n",
" </tr>\n",
" <tr>\n",
" <th>identityLine</th>\n",
" <td>4083</td>\n",
" <td>106</td>\n",
" </tr>\n",
" <tr>\n",
" <th>iframe-height</th>\n",
" <td>4145</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>iframe-width</th>\n",
" <td>4145</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>internalNotes</th>\n",
" <td>2764</td>\n",
" <td>1425</td>\n",
" </tr>\n",
" <tr>\n",
" <th>invertColorScheme</th>\n",
" <td>4135</td>\n",
" <td>54</td>\n",
" </tr>\n",
" <tr>\n",
" <th>isAutoSlug</th>\n",
" <td>3780</td>\n",
" <td>409</td>\n",
" </tr>\n",
" <tr>\n",
" <th>isAutoTitle</th>\n",
" <td>4043</td>\n",
" <td>146</td>\n",
" </tr>\n",
" <tr>\n",
" <th>isExplorable</th>\n",
" <td>3570</td>\n",
" <td>619</td>\n",
" </tr>\n",
" <tr>\n",
" <th>isPublished</th>\n",
" <td>126</td>\n",
" <td>4063</td>\n",
" </tr>\n",
" <tr>\n",
" <th>lastEditedAt</th>\n",
" <td>4178</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>logo</th>\n",
" <td>4029</td>\n",
" <td>160</td>\n",
" </tr>\n",
" <tr>\n",
" <th>logos</th>\n",
" <td>4011</td>\n",
" <td>178</td>\n",
" </tr>\n",
" <tr>\n",
" <th>map</th>\n",
" <td>160</td>\n",
" <td>4029</td>\n",
" </tr>\n",
" <tr>\n",
" <th>margins</th>\n",
" <td>4011</td>\n",
" <td>178</td>\n",
" </tr>\n",
" <tr>\n",
" <th>matchingEntitiesOnly</th>\n",
" <td>4066</td>\n",
" <td>123</td>\n",
" </tr>\n",
" <tr>\n",
" <th>maxTime</th>\n",
" <td>2017</td>\n",
" <td>2172</td>\n",
" </tr>\n",
" <tr>\n",
" <th>minPopulationFilter</th>\n",
" <td>4176</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>minTime</th>\n",
" <td>1922</td>\n",
" <td>2267</td>\n",
" </tr>\n",
" <tr>\n",
" <th>note</th>\n",
" <td>3014</td>\n",
" <td>1175</td>\n",
" </tr>\n",
" <tr>\n",
" <th>onlyEntityMatch</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>originUrl</th>\n",
" <td>1728</td>\n",
" <td>2461</td>\n",
" </tr>\n",
" <tr>\n",
" <th>overlay</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>published</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>relatedQuestions</th>\n",
" <td>4147</td>\n",
" <td>42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>scatterPointLabelStrategy</th>\n",
" <td>4173</td>\n",
" <td>16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>selectedData</th>\n",
" <td>277</td>\n",
" <td>3912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>selectedEntityColors</th>\n",
" <td>4174</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>selectedEntityIds</th>\n",
" <td>3965</td>\n",
" <td>224</td>\n",
" </tr>\n",
" <tr>\n",
" <th>selectedEntityNames</th>\n",
" <td>3812</td>\n",
" <td>377</td>\n",
" </tr>\n",
" <tr>\n",
" <th>showYearLabels</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>slug</th>\n",
" <td>1</td>\n",
" <td>4188</td>\n",
" </tr>\n",
" <tr>\n",
" <th>sourceDesc</th>\n",
" <td>2098</td>\n",
" <td>2091</td>\n",
" </tr>\n",
" <tr>\n",
" <th>stackMode</th>\n",
" <td>1687</td>\n",
" <td>2502</td>\n",
" </tr>\n",
" <tr>\n",
" <th>subtitle</th>\n",
" <td>681</td>\n",
" <td>3508</td>\n",
" </tr>\n",
" <tr>\n",
" <th>tab</th>\n",
" <td>759</td>\n",
" <td>3430</td>\n",
" </tr>\n",
" <tr>\n",
" <th>timeline</th>\n",
" <td>4176</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>timelineMaxTime</th>\n",
" <td>4166</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>timelineMinTime</th>\n",
" <td>4135</td>\n",
" <td>54</td>\n",
" </tr>\n",
" <tr>\n",
" <th>title</th>\n",
" <td>0</td>\n",
" <td>4189</td>\n",
" </tr>\n",
" <tr>\n",
" <th>type</th>\n",
" <td>785</td>\n",
" <td>3404</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unit</th>\n",
" <td>4188</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>units</th>\n",
" <td>4011</td>\n",
" <td>178</td>\n",
" </tr>\n",
" <tr>\n",
" <th>useV2</th>\n",
" <td>3958</td>\n",
" <td>231</td>\n",
" </tr>\n",
" <tr>\n",
" <th>variables</th>\n",
" <td>4153</td>\n",
" <td>36</td>\n",
" </tr>\n",
" <tr>\n",
" <th>variantName</th>\n",
" <td>4023</td>\n",
" <td>166</td>\n",
" </tr>\n",
" <tr>\n",
" <th>version</th>\n",
" <td>0</td>\n",
" <td>4189</td>\n",
" </tr>\n",
" <tr>\n",
" <th>xAxis</th>\n",
" <td>616</td>\n",
" <td>3573</td>\n",
" </tr>\n",
" <tr>\n",
" <th>yAxis</th>\n",
" <td>222</td>\n",
" <td>3967</td>\n",
" </tr>\n",
" <tr>\n",
" <th>zoomToSelection</th>\n",
" <td>4186</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
]
},
"metadata": {},
"execution_count": 105
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### and now for the nested \"map\" property the same"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 106,
"source": [
"map_content = json_content_df[json_content_df.map.notna()]\n",
"map_content = map_content.map.apply(pd.Series)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Figure out unique values"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 107,
"source": [
"json_content_df.maxTime.value_counts(dropna=False)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"NaN 2017\n",
"latest 628\n",
"2017 521\n",
"2015 205\n",
"2016 193\n",
"2018 149\n",
"2014 89\n",
"2019 74\n",
"2013 53\n",
"2010 38\n",
"2020 26\n",
"2012 22\n",
"1996 20\n",
"2100 13\n",
"2009 12\n",
"2000 12\n",
"1990 11\n",
"2011 11\n",
"2099 7\n",
"2030 7\n",
"2050 6\n",
"2005 6\n",
"2008 4\n",
"2002 4\n",
"1980 4\n",
"10 3\n",
"2006 3\n",
"1992 3\n",
"1870 3\n",
"2004 2\n",
"2007 2\n",
"1960 2\n",
"222 2\n",
"315 2\n",
"1820 2\n",
"1900 2\n",
"56 2\n",
"177 1\n",
"1700 1\n",
"80 1\n",
"1600 1\n",
"59 1\n",
"1914 1\n",
"1751 1\n",
"44 1\n",
"2089 1\n",
"40 1\n",
"476 1\n",
"526 1\n",
"496 1\n",
"48 1\n",
"237 1\n",
"1.5 1\n",
"1801 1\n",
"1806 1\n",
"1827 1\n",
"2001 1\n",
"1850 1\n",
"1995 1\n",
"1860 1\n",
"1991 1\n",
"1989 1\n",
"1902 1\n",
"1975 1\n",
"1961 1\n",
"2035 1\n",
"Name: maxTime, dtype: int64"
]
},
"metadata": {},
"execution_count": 107
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 108,
"source": [
"map_content.minYear.value_counts(dropna=False)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"NaN 3996\n",
"1980 11\n",
"2000 2\n",
"1995 2\n",
"1990 2\n",
"1981 2\n",
"1978 2\n",
"1956 1\n",
"1970 1\n",
"1950 1\n",
"-3000 1\n",
"1945 1\n",
"1920 1\n",
"1909 1\n",
"1870 1\n",
"1816 1\n",
"1800 1\n",
"1749 1\n",
"2004 1\n",
"Name: minYear, dtype: int64"
]
},
"metadata": {},
"execution_count": 108
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 109,
"source": [
"color_scale_content = map_content[map_content.colorScale.notna()]\n",
"color_scale_content = color_scale_content.colorScale.apply(pd.Series)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 110,
"source": [
"json_content_df.entityType.value_counts()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"country 538\n",
"region 39\n",
"country/region 31\n",
"food 17\n",
"state 11\n",
"disaster category 9\n",
" 8\n",
"taxonomic group 7\n",
"city 6\n",
"entity 6\n",
"technology 5\n",
"society 5\n",
"group 5\n",
"data 5\n",
"age group/region 4\n",
"industry 4\n",
"world region 4\n",
"energy source 3\n",
"product 3\n",
"start year 2\n",
"species 2\n",
"year 2\n",
"category 2\n",
"sex or race demographic 2\n",
"Industry 2\n",
"intervention 2\n",
"Region 2\n",
"river 1\n",
"transport mode 1\n",
"commodity 1\n",
"or hide relationship groups 1\n",
"agreement 1\n",
"country & age cohort 1\n",
"region/group 1\n",
"country group 1\n",
"metal 1\n",
"air pollutant 1\n",
"oil blend 1\n",
"number/rate 1\n",
"global 1\n",
"UN Revision 1\n",
"territory/region 1\n",
"disaster 1\n",
"butterfly species 1\n",
"projection 1\n",
"world 1\n",
"basin 1\n",
"Technology 1\n",
"mode 1\n",
"occupation 1\n",
"age 1\n",
"country published titles 1\n",
"US/Southwest US 1\n",
"source 1\n",
"size of city (in 000s) 1\n",
"production zone 1\n",
"measure 1\n",
"threshold 1\n",
"series 1\n",
"country or region 1\n",
"region/country 1\n",
"country, region or grouping 1\n",
"gas 1\n",
"travel mode 1\n",
"species group 1\n",
"Name: entityType, dtype: int64"
]
},
"metadata": {},
"execution_count": 110
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Dimensions is an array of json objects - use DF.explode to turn arrays into repeated rows with a single element, then turn it into a dataframe as per usual"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 111,
"source": [
"exploded = json_content_df.dimensions.explode()"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 112,
"source": [
"dimensions_content = exploded.apply(pd.Series)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 114,
"source": [
"dimensions_content"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" display property targetYear \\\n",
"0 {'name': 'suicide rate', 'unit': 'age-adjusted... y 2000.0 \n",
"1 {'name': '', 'unit': '', 'tolerance': 5, 'isPr... y NaN \n",
"2 {'unit': '%', 'shortUnit': '%'} y NaN \n",
"3 {'name': 'Historical European trade', 'toleran... y NaN \n",
"4 {'name': 'OECD PISA education score', 'unit': ... y NaN \n",
"... ... ... ... \n",
"4187 NaN x NaN \n",
"4187 NaN color NaN \n",
"4188 NaN y NaN \n",
"4188 NaN x NaN \n",
"4188 NaN color NaN \n",
"\n",
" variableId id order chartId numDecimalPlaces \n",
"0 19 NaN NaN NaN NaN \n",
"1 27 NaN NaN NaN NaN \n",
"2 105690 NaN NaN NaN NaN \n",
"3 35 8329.0 0.0 27.0 NaN \n",
"4 37 169.0 0.0 29.0 NaN \n",
"... ... ... ... ... ... \n",
"4187 72 NaN NaN NaN NaN \n",
"4187 123 NaN NaN NaN NaN \n",
"4188 145609 NaN NaN NaN NaN \n",
"4188 72 NaN NaN NaN NaN \n",
"4188 123 NaN NaN NaN NaN \n",
"\n",
"[9957 rows x 8 columns]"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>display</th>\n",
" <th>property</th>\n",
" <th>targetYear</th>\n",
" <th>variableId</th>\n",
" <th>id</th>\n",
" <th>order</th>\n",
" <th>chartId</th>\n",
" <th>numDecimalPlaces</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>{'name': 'suicide rate', 'unit': 'age-adjusted...</td>\n",
" <td>y</td>\n",
" <td>2000.0</td>\n",
" <td>19</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>{'name': '', 'unit': '', 'tolerance': 5, 'isPr...</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>27</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>{'unit': '%', 'shortUnit': '%'}</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>105690</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>{'name': 'Historical European trade', 'toleran...</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>35</td>\n",
" <td>8329.0</td>\n",
" <td>0.0</td>\n",
" <td>27.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>{'name': 'OECD PISA education score', 'unit': ...</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>37</td>\n",
" <td>169.0</td>\n",
" <td>0.0</td>\n",
" <td>29.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4187</th>\n",
" <td>NaN</td>\n",
" <td>x</td>\n",
" <td>NaN</td>\n",
" <td>72</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4187</th>\n",
" <td>NaN</td>\n",
" <td>color</td>\n",
" <td>NaN</td>\n",
" <td>123</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4188</th>\n",
" <td>NaN</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>145609</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4188</th>\n",
" <td>NaN</td>\n",
" <td>x</td>\n",
" <td>NaN</td>\n",
" <td>72</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4188</th>\n",
" <td>NaN</td>\n",
" <td>color</td>\n",
" <td>NaN</td>\n",
" <td>123</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9957 rows × 8 columns</p>\n",
"</div>"
]
},
"metadata": {},
"execution_count": 114
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 116,
"source": [
"dimensions_content.property.value_counts()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"y 7827\n",
"color 794\n",
"x 793\n",
"size 543\n",
"Name: property, dtype: int64"
]
},
"metadata": {},
"execution_count": 116
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 117,
"source": [
"dimensions_display_df = dimensions_content.display.apply(pd.Series)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 118,
"source": [
"dimensions_display_df.zeroDay.value_counts()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"2020-01-01 23\n",
"2020-01-21 21\n",
"2021-01-01 11\n",
"2020-12-18 1\n",
"Name: zeroDay, dtype: int64"
]
},
"metadata": {},
"execution_count": 118
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 119,
"source": [
"json_content_df.xAxis.dropna()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 {'scaleType': 'linear'}\n",
"1 {'scaleType': 'linear'}\n",
"2 {'scaleType': 'linear'}\n",
"3 {'scaleType': 'linear'}\n",
"4 {'scaleType': 'linear'}\n",
" ... \n",
"4074 {'canChangeScaleType': True}\n",
"4083 {'scaleType': 'log', 'canChangeScaleType': True}\n",
"4119 {'scaleType': 'log', 'canChangeScaleType': True}\n",
"4153 {'scaleType': 'log', 'canChangeScaleType': True}\n",
"4154 {'scaleType': 'log', 'canChangeScaleType': True}\n",
"Name: xAxis, Length: 3573, dtype: object"
]
},
"metadata": {},
"execution_count": 119
}
],
"metadata": {}
}
],
"metadata": {
"interpreter": {
"hash": "40c188d7f09e73b6baff016c76fe93234271a5f8c00107ec837ff90530bba4cc"
},
"kernelspec": {
"display_name": "Python 3.9.1 64-bit ('base': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment