GregHilston/test.ipynb

## test.ipynb
{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Visualizing Geospatial Data"]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["# ensure our graphs are displayed inline\n", "%matplotlib inline"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["import os\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sb\n", "import numpy as np\n", "import folium\n", "from folium import plugins\n", "from folium.plugins import HeatMap\n", "from folium.plugins import MarkerCluster"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["# useful to define where we'll be storing our data\n", "data_directory = \"data/\"\n", "\n", "# useful to define where we'll be storing our output\n", "output_directory = \"output/\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Introduction\n", "\n", "Our goal today is to create some visualizations for some geospatial data. We'll do that by first acquiring the data itself, quickly looking at the data set and doing a very minor cleanup.\n", "\n", "Then we'll walk through creating multiple visualizations, which can be applied to many data sets. Specifically we'll bedoing the following:\n", "\n", "* display geospatial data\n", "* cluster close points\n", "* generate a heat map\n", "* overlay census population data"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Data Acquisition"]}, {"cell_type": "markdown", "metadata": {}, "source": ["First we'll create a `Pandas.DataFrame` out of a `json` file hosted by NASA."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["# Data from NASA on meteorite landings\n", "df = pd.read_json(data_directory + \"y77d-th95.json\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now we'll simply do some high level overview of the data."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Initial Data High Level View\n", "\n", "I like to always start out by looking at the thirty thousand foot view of any data set."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'pandas.core.frame.DataFrame'>\n", "Int64Index: 1000 entries, 0 to 999\n", "Data columns (total 12 columns):\n", ":@computed_region_cbhk_fwbd    133 non-null float64\n", ":@computed_region_nnqa_25f4    134 non-null float64\n", "fall                           1000 non-null object\n", "geolocation                    988 non-null object\n", "id                             1000 non-null int64\n", "mass                           972 non-null float64\n", "name                           1000 non-null object\n", "nametype                       1000 non-null object\n", "recclass                       1000 non-null object\n", "reclat                         988 non-null float64\n", "reclong                        988 non-null float64\n", "year                           999 non-null object\n", "dtypes: float64(5), int64(1), object(6)\n", "memory usage: 101.6+ KB\n"]}], "source": ["df.info()"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>:@computed_region_cbhk_fwbd</th>\n", "      <th>:@computed_region_nnqa_25f4</th>\n", "      <th>id</th>\n", "      <th>mass</th>\n", "      <th>reclat</th>\n", "      <th>reclong</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>count</th>\n", "      <td>133.000000</td>\n", "      <td>134.000000</td>\n", "      <td>1000.00000</td>\n", "      <td>9.720000e+02</td>\n", "      <td>988.000000</td>\n", "      <td>988.000000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>mean</th>\n", "      <td>26.939850</td>\n", "      <td>1537.888060</td>\n", "      <td>15398.72800</td>\n", "      <td>5.019020e+04</td>\n", "      <td>29.691592</td>\n", "      <td>19.151208</td>\n", "    </tr>\n", "    <tr>\n", "      <th>std</th>\n", "      <td>12.706929</td>\n", "      <td>899.826915</td>\n", "      <td>10368.70402</td>\n", "      <td>7.539857e+05</td>\n", "      <td>23.204399</td>\n", "      <td>68.644015</td>\n", "    </tr>\n", "    <tr>\n", "      <th>min</th>\n", "      <td>1.000000</td>\n", "      <td>10.000000</td>\n", "      <td>1.00000</td>\n", "      <td>1.500000e-01</td>\n", "      <td>-44.116670</td>\n", "      <td>-157.866670</td>\n", "    </tr>\n", "    <tr>\n", "      <th>25%</th>\n", "      <td>17.000000</td>\n", "      <td>650.250000</td>\n", "      <td>7770.50000</td>\n", "      <td>6.795000e+02</td>\n", "      <td>21.300000</td>\n", "      <td>-5.195832</td>\n", "    </tr>\n", "    <tr>\n", "      <th>50%</th>\n", "      <td>24.000000</td>\n", "      <td>1647.000000</td>\n", "      <td>12757.50000</td>\n", "      <td>2.870000e+03</td>\n", "      <td>35.916665</td>\n", "      <td>17.325000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>75%</th>\n", "      <td>37.000000</td>\n", "      <td>2234.250000</td>\n", "      <td>18831.25000</td>\n", "      <td>1.005000e+04</td>\n", "      <td>45.817835</td>\n", "      <td>76.004167</td>\n", "    </tr>\n", "    <tr>\n", "      <th>max</th>\n", "      <td>50.000000</td>\n", "      <td>3190.000000</td>\n", "      <td>57168.00000</td>\n", "      <td>2.300000e+07</td>\n", "      <td>66.348330</td>\n", "      <td>174.400000</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["       :@computed_region_cbhk_fwbd  :@computed_region_nnqa_25f4           id  \\\n", "count                   133.000000                   134.000000   1000.00000   \n", "mean                     26.939850                  1537.888060  15398.72800   \n", "std                      12.706929                   899.826915  10368.70402   \n", "min                       1.000000                    10.000000      1.00000   \n", "25%                      17.000000                   650.250000   7770.50000   \n", "50%                      24.000000                  1647.000000  12757.50000   \n", "75%                      37.000000                  2234.250000  18831.25000   \n", "max                      50.000000                  3190.000000  57168.00000   \n", "\n", "               mass      reclat     reclong  \n", "count  9.720000e+02  988.000000  988.000000  \n", "mean   5.019020e+04   29.691592   19.151208  \n", "std    7.539857e+05   23.204399   68.644015  \n", "min    1.500000e-01  -44.116670 -157.866670  \n", "25%    6.795000e+02   21.300000   -5.195832  \n", "50%    2.870000e+03   35.916665   17.325000  \n", "75%    1.005000e+04   45.817835   76.004167  \n", "max    2.300000e+07   66.348330  174.400000  "]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["df.describe()"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>:@computed_region_cbhk_fwbd</th>\n", "      <th>:@computed_region_nnqa_25f4</th>\n", "      <th>fall</th>\n", "      <th>geolocation</th>\n", "      <th>id</th>\n", "      <th>mass</th>\n", "      <th>name</th>\n", "      <th>nametype</th>\n", "      <th>recclass</th>\n", "      <th>reclat</th>\n", "      <th>reclong</th>\n", "      <th>year</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>{'type': 'Point', 'coordinates': [6.08333, 50....</td>\n", "      <td>1</td>\n", "      <td>21.0</td>\n", "      <td>Aachen</td>\n", "      <td>Valid</td>\n", "      <td>L5</td>\n", "      <td>50.77500</td>\n", "      <td>6.08333</td>\n", "      <td>1880-01-01T00:00:00.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>{'type': 'Point', 'coordinates': [10.23333, 56...</td>\n", "      <td>2</td>\n", "      <td>720.0</td>\n", "      <td>Aarhus</td>\n", "      <td>Valid</td>\n", "      <td>H6</td>\n", "      <td>56.18333</td>\n", "      <td>10.23333</td>\n", "      <td>1951-01-01T00:00:00.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>{'type': 'Point', 'coordinates': [-113, 54.216...</td>\n", "      <td>6</td>\n", "      <td>107000.0</td>\n", "      <td>Abee</td>\n", "      <td>Valid</td>\n", "      <td>EH4</td>\n", "      <td>54.21667</td>\n", "      <td>-113.00000</td>\n", "      <td>1952-01-01T00:00:00.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>{'type': 'Point', 'coordinates': [-99.9, 16.88...</td>\n", "      <td>10</td>\n", "      <td>1914.0</td>\n", "      <td>Acapulco</td>\n", "      <td>Valid</td>\n", "      <td>Acapulcoite</td>\n", "      <td>16.88333</td>\n", "      <td>-99.90000</td>\n", "      <td>1976-01-01T00:00:00.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>{'type': 'Point', 'coordinates': [-64.95, -33....</td>\n", "      <td>370</td>\n", "      <td>780.0</td>\n", "      <td>Achiras</td>\n", "      <td>Valid</td>\n", "      <td>L6</td>\n", "      <td>-33.16667</td>\n", "      <td>-64.95000</td>\n", "      <td>1902-01-01T00:00:00.000</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   :@computed_region_cbhk_fwbd  :@computed_region_nnqa_25f4  fall  \\\n", "0                          NaN                          NaN  Fell   \n", "1                          NaN                          NaN  Fell   \n", "2                          NaN                          NaN  Fell   \n", "3                          NaN                          NaN  Fell   \n", "4                          NaN                          NaN  Fell   \n", "\n", "                                         geolocation   id      mass      name  \\\n", "0  {'type': 'Point', 'coordinates': [6.08333, 50....    1      21.0    Aachen   \n", "1  {'type': 'Point', 'coordinates': [10.23333, 56...    2     720.0    Aarhus   \n", "2  {'type': 'Point', 'coordinates': [-113, 54.216...    6  107000.0      Abee   \n", "3  {'type': 'Point', 'coordinates': [-99.9, 16.88...   10    1914.0  Acapulco   \n", "4  {'type': 'Point', 'coordinates': [-64.95, -33....  370     780.0   Achiras   \n", "\n", "  nametype     recclass    reclat    reclong                     year  \n", "0    Valid           L5  50.77500    6.08333  1880-01-01T00:00:00.000  \n", "1    Valid           H6  56.18333   10.23333  1951-01-01T00:00:00.000  \n", "2    Valid          EH4  54.21667 -113.00000  1952-01-01T00:00:00.000  \n", "3    Valid  Acapulcoite  16.88333  -99.90000  1976-01-01T00:00:00.000  \n", "4    Valid           L6 -33.16667  -64.95000  1902-01-01T00:00:00.000  "]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We see twleve columns:\n", "* five floats\n", "* six strings or mixed data\n", "* one int64\n", "\n", "Additionally, the `geolocation` column is JSON, which is something I've never worked with inside of a Pandas DataFrame. Also, we may be able to leverage Pandas' DateTime `dtype` for the `year` column."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Removing Redundant Data\n", "\n", "As `geolocation`'s data is already represented in `reclat` and `reclong`, we'll simply remove it. We're specifically picking this column as its a more complex JSON data type, instead of already separated columns."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": ["df.drop(labels=\"geolocation\", axis=1, inplace=True)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## `NaN` Inspection\n", "\n", "Lets look at all columns that have atleast one `NaN` value."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["[':@computed_region_cbhk_fwbd',\n", " ':@computed_region_nnqa_25f4',\n", " 'mass',\n", " 'reclat',\n", " 'reclong',\n", " 'year']"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["nan_columns = df.columns[df.isna().any()].tolist()\n", "nan_columns"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We see that seven of the tweleve columns have atleast one `NaN` value. Lets look into how many `NaN` values are in each column so we can get an idea on how to proceed with cleaning."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["{':@computed_region_cbhk_fwbd': 867,\n", " ':@computed_region_nnqa_25f4': 866,\n", " 'mass': 28,\n", " 'reclat': 12,\n", " 'reclong': 12,\n", " 'year': 1}"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["nan_column_counts = {}\n", "\n", "for nan_column in nan_columns:\n", "    nan_column_counts[nan_column] = sum(pd.isnull(df[nan_column]))\n", "    \n", "nan_column_counts"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We see here that number of `NaN` values ranges from as high as 867, to as low as 1. We recall that there are 1000 rows in this data set, so that means most of the rows have `:@computed_region_cbhk_fwbd` and `:@computed_region_nnqa_25f4` as an `NaN` value.\n", "\n", "We'll have to handle these after performing some more data inspection."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Unique Values Inspection\n", "\n", "We'll now look at the unique values.\n", "\n", "_The following cell has been made a raw cell to avoid its large output from printing._"]}, {"cell_type": "raw", "metadata": {}, "source": ["for column in list(df):\n", "    print(f\"{column} has {df[column].nunique()} unique values:\")\n", "    print(df[column].unique())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## `NaN` Handling\n", "\n", "Since we're not building any specific model, we're going to leave the `NaN` values as they are. I just want to note that usually you'll have to handle the `NaN` values in a data set, or at the very least, be aware that they exist. There are many techniques for handling `NaN` values, but they won't be disucssed here."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Geospatial Visualizations\n", "\n", "Now we're going to work on creating geospatial visualizations for our data set. These can be incredibly helpful for exploring your data, as well as when it comes time to present or share your work.\n", "\n", "These visualizations can be handy as they can help you quickly answer questions. For example, currently we don't know how many meteorites land in the oceans. We'd expect that many to, infact probably more often than land, but we don't have an easy way to determine this. Once we have our visualizations created, we can quickly answer this question."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Data Preparation\n", "\n", "First, we'll need to prepare a dataframe of our latitude and longitude values"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": ["# Create a new dataframe of just the lat and long columns\n", "geo_df = df.dropna(axis=0, how=\"any\", subset=['reclat', 'reclong'])\n", "geo_df = geo_df.set_index(\"id\") # we'll preserve the id from the data set"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>:@computed_region_cbhk_fwbd</th>\n", "      <th>:@computed_region_nnqa_25f4</th>\n", "      <th>fall</th>\n", "      <th>mass</th>\n", "      <th>name</th>\n", "      <th>nametype</th>\n", "      <th>recclass</th>\n", "      <th>reclat</th>\n", "      <th>reclong</th>\n", "      <th>year</th>\n", "    </tr>\n", "    <tr>\n", "      <th>id</th>\n", "      <th></th>\n", "      <th></th>\n", "      <th></th>\n", "      <th></th>\n", "      <th></th>\n", "      <th></th>\n", "      <th></th>\n", "      <th></th>\n", "      <th></th>\n", "      <th></th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>21.0</td>\n", "      <td>Aachen</td>\n", "      <td>Valid</td>\n", "      <td>L5</td>\n", "      <td>50.77500</td>\n", "      <td>6.08333</td>\n", "      <td>1880-01-01T00:00:00.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>720.0</td>\n", "      <td>Aarhus</td>\n", "      <td>Valid</td>\n", "      <td>H6</td>\n", "      <td>56.18333</td>\n", "      <td>10.23333</td>\n", "      <td>1951-01-01T00:00:00.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>6</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>107000.0</td>\n", "      <td>Abee</td>\n", "      <td>Valid</td>\n", "      <td>EH4</td>\n", "      <td>54.21667</td>\n", "      <td>-113.00000</td>\n", "      <td>1952-01-01T00:00:00.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>10</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>1914.0</td>\n", "      <td>Acapulco</td>\n", "      <td>Valid</td>\n", "      <td>Acapulcoite</td>\n", "      <td>16.88333</td>\n", "      <td>-99.90000</td>\n", "      <td>1976-01-01T00:00:00.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>370</th>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>Fell</td>\n", "      <td>780.0</td>\n", "      <td>Achiras</td>\n", "      <td>Valid</td>\n", "      <td>L6</td>\n", "      <td>-33.16667</td>\n", "      <td>-64.95000</td>\n", "      <td>1902-01-01T00:00:00.000</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["     :@computed_region_cbhk_fwbd  :@computed_region_nnqa_25f4  fall      mass  \\\n", "id                                                                              \n", "1                            NaN                          NaN  Fell      21.0   \n", "2                            NaN                          NaN  Fell     720.0   \n", "6                            NaN                          NaN  Fell  107000.0   \n", "10                           NaN                          NaN  Fell    1914.0   \n", "370                          NaN                          NaN  Fell     780.0   \n", "\n", "         name nametype     recclass    reclat    reclong  \\\n", "id                                                         \n", "1      Aachen    Valid           L5  50.77500    6.08333   \n", "2      Aarhus    Valid           H6  56.18333   10.23333   \n", "6        Abee    Valid          EH4  54.21667 -113.00000   \n", "10   Acapulco    Valid  Acapulcoite  16.88333  -99.90000   \n", "370   Achiras    Valid           L6 -33.16667  -64.95000   \n", "\n", "                        year  \n", "id                            \n", "1    1880-01-01T00:00:00.000  \n", "2    1951-01-01T00:00:00.000  \n", "6    1952-01-01T00:00:00.000  \n", "10   1976-01-01T00:00:00.000  \n", "370  1902-01-01T00:00:00.000  "]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["geo_df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Creation of the Visualizations"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Everything looks good.\n", "\n", "Now we'll create our visualizations. First lets make one with every row as a single marker. This may be overkill."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/html": ["<div style=\"width:100%;\"><div style=\"position:relative;width:100%;height:0;padding-bottom:60%;\"></div></div>"], "text/plain": ["<folium.folium.Map at 0x1098f51d0>"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["markers_map = folium.Map(zoom_start=4, tiles=\"CartoDB dark_matter\")\n", "\n", "# create an individual marker for each meteorite\n", "for coord in [tuple(x) for x in geo_df.to_records(index=False)]:\n", "    latitude = coord[7]\n", "    longitude = coord[8]\n", "    mass = coord[3]\n", "    name = coord[4]\n", "    rec_class = coord[6]\n", "    index = geo_df[(geo_df[\"reclat\"] == latitude) & (geo_df[\"reclong\"] == longitude)].index.tolist()[0]    \n", "    \n", "    html = f\"\"\"\n", "    <table border=\"1\">\n", "        <tr>\n", "            <th> Index </th>\n", "            <th> Latitude </th>\n", "            <th> Longitude </th>\n", "            <th> Mass </th>\n", "            <th> Name </th>\n", "            <th> Recclass </th>\n", "        </tr>\n", "        <tr> \n", "            <td> {index} </td> \n", "            <td> {latitude} </td> \n", "            <td> {longitude} </td> \n", "            <td> {mass} </td>\n", "            <td> {name} </td>\n", "            <td> {rec_class} </td>\n", "        </tr>\n", "    </table>\"\"\"\n", "    iframe = folium.IFrame(html=html, width=375, height=125)\n", "    popup = folium.Popup(iframe, max_width=375)\n", "    \n", "    folium.Marker(location=[latitude, longitude], popup=popup).add_to(markers_map)\n", "\n", "markers_map.save(output_directory + \"markers_map.html\")\n", "markers_map"]}, {"cell_type": "markdown", "metadata": {}, "source": ["After seeing the visualization, I don't believe showing a single marker for every row is a good idea, as we have so much data that zooming out pretty far makes it difficult to understand what we're looking at. \n", "\n", "Lets cluster nearby rows to improve readability."]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["<div style=\"width:100%;\"><div style=\"position:relative;width:100%;height:0;padding-bottom:60%;\"></div></div>"], "text/plain": ["<folium.folium.Map at 0x10bab1ac8>"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["clusters_map = folium.Map(zoom_start=4, tiles=\"CartoDB dark_matter\")\n", "\n", "clusters_map_cluster = MarkerCluster().add_to(clusters_map)\n", "\n", "# create an individual marker for each meteorite, adding it to a cluster\n", "for coord in [tuple(x) for x in geo_df.to_records(index=False)]:\n", "    latitude = coord[7]\n", "    longitude = coord[8]\n", "    mass = coord[3]\n", "    name = coord[4]\n", "    rec_class = coord[6]\n", "    index = geo_df[(geo_df[\"reclat\"] == latitude) & (geo_df[\"reclong\"] == longitude)].index.tolist()[0]    \n", "    \n", "    html = f\"\"\"\n", "    <table border=\"1\">\n", "        <tr>\n", "            <th> Index </th>\n", "            <th> Latitude </th>\n", "            <th> Longitude </th>\n", "            <th> Mass </th>\n", "            <th> Name </th>\n", "            <th> Recclass </th>\n", "        </tr>\n", "        <tr> \n", "            <td> {index} </td> \n", "            <td> {latitude} </td> \n", "            <td> {longitude} </td> \n", "            <td> {mass} </td>\n", "            <td> {name} </td>\n", "            <td> {rec_class} </td>\n", "        </tr>\n", "    </table>\"\"\"\n", "    iframe = folium.IFrame(html=html, width=375, height=125)\n", "    popup = folium.Popup(iframe, max_width=375)\n", "    \n", "    folium.Marker(location=[latitude, longitude], popup=popup).add_to(clusters_map_cluster)\n", "\n", "clusters_map.save(output_directory + \"clusters_map.html\")\n", "clusters_map"]}, {"cell_type": "markdown", "metadata": {}, "source": ["This looks much better.\n", "\n", "Just for kicks, lets make a heat map as well!"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/html": ["<div style=\"width:100%;\"><div style=\"position:relative;width:100%;height:0;padding-bottom:60%;\"></div></div>"], "text/plain": ["<folium.folium.Map at 0x1098f23c8>"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["heat_map = folium.Map(zoom_start=4, tiles=\"CartoDB dark_matter\") \n", "\n", "# Ensure you're handing it floats\n", "geo_df['latitude'] = geo_df[\"reclat\"].astype(float)\n", "geo_df['longitude'] = geo_df[\"reclong\"].astype(float)\n", "\n", "# Filter the DF for rows, then columns, then remove NaNs\n", "heat_df = geo_df[['latitude', 'longitude']]\n", "heat_df = heat_df.dropna(axis=0, subset=['latitude','longitude'])\n", "\n", "# List comprehension to make out list of lists\n", "heat_data = [[row['latitude'],row['longitude']] for index, row in heat_df.iterrows()]\n", "\n", "# Plot it on the map\n", "HeatMap(heat_data).add_to(heat_map)\n", "\n", "# Display the map\n", "heat_map.save(output_directory + \"heat_map.html\")\n", "heat_map"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We see here that most of the meteorites land on land. My prediction is that meteorites do infact land in water, probably more often than land due to water's higher proportion on Earth, but all the meteorites must be reported by humans, which explains all of the data points existing on land.\n", "\n", "I'd go as far to sar that higher populated areas are more likely to report meteorites, as well as non first world countries."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Cencus Data\n", "\n", "Now we'll explore the notion that higher populated areas are more likely to report meteorites, visually. We'll do that by combing population data gathered from a census. The logic for our heat map of our countries is rudimentry, as it does not act as a density factoring in land mass, but is good enough for this example.\n", "\n", "Lets start by redoing our first graphic, where every meteorite got its own marker, and we'll overlay the poplation of the world by country."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/html": ["<div style=\"width:100%;\"><div style=\"position:relative;width:100%;height:0;padding-bottom:60%;\"></div></div>"], "text/plain": ["<folium.folium.Map at 0x10381a978>"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["markers_census_layered_map = folium.Map(zoom_start=4, tiles='Mapbox bright')\n", "\n", "fg = folium.FeatureGroup(name=\"Meteorites\")\n", "\n", "# create an individual marker for each meteorite, adding it to a layer\n", "for coord in [tuple(x) for x in geo_df.to_records(index=False)]:\n", "    latitude = coord[7]\n", "    longitude = coord[8]\n", "    mass = coord[3]\n", "    name = coord[4]\n", "    rec_class = coord[6]\n", "    index = geo_df[(geo_df[\"reclat\"] == latitude) & (geo_df[\"reclong\"] == longitude)].index.tolist()[0]    \n", "    \n", "    html = f\"\"\"\n", "    <table border=\"1\">\n", "        <tr>\n", "            <th> Index </th>\n", "            <th> Latitude </th>\n", "            <th> Longitude </th>\n", "            <th> Mass </th>\n", "            <th> Name </th>\n", "            <th> Recclass </th>\n", "        </tr>\n", "        <tr> \n", "            <td> {index} </td> \n", "            <td> {latitude} </td> \n", "            <td> {longitude} </td> \n", "            <td> {mass} </td>\n", "            <td> {name} </td>\n", "            <td> {rec_class} </td>\n", "        </tr>\n", "    </table>\"\"\"\n", "    iframe = folium.IFrame(html=html, width=375, height=125)\n", "    popup = folium.Popup(iframe, max_width=375)\n", "    \n", "    fg.add_child(folium.Marker(location=[latitude, longitude], popup=popup))\n", "    \n", "# add our markers to the map\n", "markers_census_layered_map.add_child(fg)\n", "\n", "# add the census population outlined and colored countries to our map\n", "world_geojson = os.path.join(data_directory, \"world_geojson_from_ogr.json\")\n", "world_geojson_data = open(world_geojson, \"r\", encoding=\"utf-8\")\n", "markers_census_layered_map.add_child(folium.GeoJson(world_geojson_data.read(), name=\"Population\", style_function=lambda x: {\"fillColor\":\"green\" if x[\"properties\"][\"POP2005\"] <= 10000000 else \"orange\" if 10000000 < x[\"properties\"][\"POP2005\"] < 20000000 else \"red\"}))\n", "\n", "# add a toggleable menu for all the layers\n", "markers_census_layered_map.add_child(folium.LayerControl())\n", "\n", "# save our map as a separate HTML file\n", "markers_census_layered_map.save(outfile=output_directory + \"markers_census_layered_map.html\")\n", "\n", "# display our map inline\n", "markers_census_layered_map"]}, {"cell_type": "markdown", "metadata": {}, "source": ["This is a good start, but again its difficult to see what's truly going on due to the sheer number of markers. Lets cluster our markers again:"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/html": ["<div style=\"width:100%;\"><div style=\"position:relative;width:100%;height:0;padding-bottom:60%;\"></div></div>"], "text/plain": ["<folium.folium.Map at 0x103840278>"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["markers_census_layered_map = folium.Map(zoom_start=4, tiles='Mapbox bright')\n", "\n", "clusters_map_cluster = MarkerCluster(\n", "    name=\"Meteorites\",\n", "    overlay=True,\n", "    control=False,\n", "    icon_create_function=None\n", ")\n", "\n", "# create an individual marker for each meteorite, adding it to a cluster\n", "for coord in [tuple(x) for x in geo_df.to_records(index=False)]:\n", "    latitude = coord[7]\n", "    longitude = coord[8]\n", "    mass = coord[3]\n", "    name = coord[4]\n", "    rec_class = coord[6]\n", "    index = geo_df[(geo_df[\"reclat\"] == latitude) & (geo_df[\"reclong\"] == longitude)].index.tolist()[0]    \n", "    \n", "    html = f\"\"\"\n", "    <table border=\"1\">\n", "        <tr>\n", "            <th> Index </th>\n", "            <th> Latitude </th>\n", "            <th> Longitude </th>\n", "            <th> Mass </th>\n", "            <th> Name </th>\n", "            <th> Recclass </th>\n", "        </tr>\n", "        <tr> \n", "            <td> {index} </td> \n", "            <td> {latitude} </td> \n", "            <td> {longitude} </td> \n", "            <td> {mass} </td>\n", "            <td> {name} </td>\n", "            <td> {rec_class} </td>\n", "        </tr>\n", "    </table>\"\"\"\n", "    iframe = folium.IFrame(html=html, width=375, height=125)\n", "    popup = folium.Popup(iframe, max_width=375)\n", "    \n", "    clusters_map_cluster.add_child(folium.Marker(location=[latitude, longitude], popup=popup))\n", "    \n", "# add our cluster to the map\n", "markers_census_layered_map.add_child(clusters_map_cluster)\n", "\n", "# add the census population outlined and colored countries to our map\n", "world_geojson = os.path.join(data_directory, \"world_geojson_from_ogr.json\")\n", "world_geojson_data = open(world_geojson, \"r\", encoding=\"utf-8\")\n", "markers_census_layered_map.add_child(folium.GeoJson(world_geojson_data.read(), name=\"Population\", style_function=lambda x: {\"fillColor\":\"green\" if x[\"properties\"][\"POP2005\"] <= 10000000 else \"orange\" if 10000000 < x[\"properties\"][\"POP2005\"] < 20000000 else \"red\"}))\n", "\n", "# add a toggleable menu for all the layers\n", "markers_census_layered_map.add_child(folium.LayerControl())\n", "\n", "# save our map as a separate HTML file\n", "markers_census_layered_map.save(outfile=output_directory + \"markers_census_layered_map.html\")\n", "\n", "# display our map inline\n", "markers_census_layered_map"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now we can see all data much clearer thanks to the clustering. At this point we could start adding in additional layers, like the gross domestic product of each country to try and inspect other relations."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Conclusion\n", "\n", "We've demonstrated how to:\n", "* display geospatial data\n", "* cluster close points\n", "* generate a heat map\n", "* overlay census population data\n", "\n", "With these skills, one can quickly generate helpful visualizations for geospatial data. Moving forward, we could use different markers depending on some value, like `Recclass` in this insance, or improve the popup text.\n", "\n", "I hope this helps you with working with geospatial data."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## References\n", "\n", "To do this work, I had the help from:\n", "\n", "* [the official folium documentation](https://python-visualization.github.io/folium/docs-v0.4.0/modules.html)\n", "* [a guide from PythonHow](https://pythonhow.com/web-mapping-with-python-and-folium/)"]}], "metadata": {"front-matter": {"title": "Visualizing Geospatial Data", "subtitle": "Using Meteorites!", "date": "2018-04-24", "slug": "visualizing-geospatial-data"}, "kernelspec": {"display_name": "meetup", "language": "python", "name": "meetup"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}, "hugo-jupyter": {"render-to": "content/post/"}}, "nbformat": 4, "nbformat_minor": 2}