eskibars/instructions.md

## instructions.md

      
    Raw
  

              instructions.md
            
          
    Finding Data

Open geo data can be found in a lot of places.  Open city data is a great source of geo data in many jurisdictions.  Searching for "open data <cityname>" can yield a lot of results.  For example, https://datasf.org/opendata/ is San Francisco's open data portal.  Some jurisdictions will have dedicated GIS portals.
You'll often find geo data in a few formats:

A CSV of geo points
Shapefiles for geo points
Shapefiles for geo shapes
WKT (Well-known text)
GeoJSON

Elasticsearch natively supports WKT and GeoJSON and I'll leave the work to import CSVs as an exercise to the reader for now.  I'm going to focus this on how to import shapefiles.  Sometimes GeoJSON has a full FeatureCollection which does need to be converted to a list of Features, and I will cover that here in Breaking a GeoJSON FeatureCollection up
In this example, we'll use the counties in Atlanta, which can be found at http://gisdata.fultoncountyga.gov/datasets/53ca7db14b8f4a9193c1883247886459_67. You can go to Download -> Shapefile to get the shapefile zip file.  In this counties example, this looks like this once I've unzipped:
$ ls  
Counties_Atlanta_Region.cpg Counties_Atlanta_Region.dbf Counties_Atlanta_Region.prj Counties_Atlanta_Region.shp Counties_Atlanta_Region.shx

Converting Shapefiles to GeoJSON

After you have a shapefile, the next step is to get the data into geojson format.
Looking at the Atlanta counties again, the .shp file is the one that's interesting to us and the ogr2ogr tool can be used to convert .shp files to  geojson. ogr2ogr is part of  GDAL and can be installed on a Mac if you have  homebrew installed by using:

brew install gdal

Alternatively, you can install it manually. ogr2ogr is a wonderful tool to have on your laptop for using/testing geo data. Once you have it, continuing with our example, you should be able to run:

ogr2ogr -f GeoJSON -t_srs crs:84 output_counties.json Counties_Atlanta_Region.shp

This means:

-f GeoJSON: Output to GeoJSON format
-t_srs crs:84: There are a lot of coordinate reference systems.  If you know you need data in a different coordinate system, you can override this with something else, though that's going to generally be a highly specialized case.  We're telling ogr2ogr to use WGS84 on the output, which is the same system that GPS uses.
output_counties.json is the output file
Counties_Atlanta_Region.shp is the input file.

After you run this, you now have a GeoJSON file.
Breaking a GeoJSON FeatureCollection Up

If we look at the resulting GeoJSON file, we see at the top of it:

"type": "FeatureCollection",
"name": "Counties_Atlanta_Region",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
...

Elasticsearch handles most GeoJSON, but FeatureCollections are composed of an array of Feature objects (Features can be geo points or shapes). FeatureCollections are sort of like a "bulk" dataset and we need to get individual points/shapes (Features) so that Elasticsearch can index them. In this example, the individual features are individual counties in Atlanta.  This is where jq comes in handy.
jq can also be installed via homebrew:

brew install jq

Afterwards, you can do "select the array of features[] from output_counties.json, and output 1 feature per line" by

jq -c '.features[]' output_counties.json

The -c flag means "compact" -- it outputs 1 feature per line, which can be useful for what we're about to do next...
Simultaneously Extracting Features and Converting to Bulk Format

We can do 1 step better than just extracting the features array by simultaneously converting the output to Elasticsearch's  bulk format  with sed:

jq -c '.features[]' output_counties.json | sed -e 's/^/{ "index" : { "_index" : "geodata", "_type" : "_doc" } }\
/' > output_counties_bulk.json && echo "" >> output_counties_bulk.json

The sed bit just adds a bulk header line (and a newline) per record and the echo "" >> output_counties_bulk.json makes sure the file ends in a newline, as this is required by Elasticsearch.
Change geodata to an index name of your choosing.
Set Up Elasticsearch Mappings

At this point, I'd set up the Elasticsearch mappings for this "geodata" index (or whatever name you want to give it). Metadata related to the shape is often in .properties and geo shape data is  often  in .geometry. The county data here looks typical:

jq -c '.features[].properties' output_counties.json

Shows us a list of properties like:

{"OBJECTID":28,"STATEFP10":"13","COUNTYFP10":"013","GEOID10":"13013","NAME10":"Barrow","NAMELSAD10":"Barrow County","totpop10":69367,"WFD":"N","RDC_AAA":"N","MNGWPD":"N","MPO":"Partial","MSA":"Y","F1HR_NA":"N","F8HR_NA":"N","Reg_Comm":"Northeast Georgia","Acres":104266,"Sq_Miles":162.914993,"Label":"BARROW","GlobalID":"{36E2EA48-1481-44D7-91C9-7C51AC8AB9E9}","last_edite":"2015-10-14T17:19:34.000Z"}

At this point, you can add any mappings around these fields and/or use an ingest node pipeline to manipulate the data prior to indexing.  For now, I'm just going to set up the geo_shape field, but you can add extras.
PUT /geodata  
{  
  "settings": {  
    "number_of_shards": 1  
  },  
  "mappings": {  
    "_doc": {  
      "properties": {  
        "geometry": {  
          "type": "geo_shape"  
        }  
      }  
    }  
  }  
}

Bulk Loading Data to Elasticsearch

And at this point, you can bulk-load the data

curl -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@output_counties_bulk.json"

And then you can set up or reload Kibana index patterns for your index to make sure it shows up. Make sure to change any time filters to be appropriate with any visualizations you use. I often turn off the "time" field for quick demos as it can often be inconsistent/missing dates (as I found this data to be).
Recap / Short Form

Get a shapefile

ogr2ogr -f  GeoJSON  -t_srs crs:84 your_geojson.json your_shapefile.shp


jq -c '.features[]' your_geojson.json | sed -e 's/^/{ "index" : { "_index" : "your_index", "_type" : "_doc" } }\
/' > your_geojson_bulk.json && echo "" >>  your_geojson_bulk.json

Set up your mappings. Often the following works, but you may need to check field names:
PUT /geodata  
{  
  "settings": {  
    "number_of_shards": 1  
  },  
  "mappings": {  
    "_doc": {  
      "properties": {  
        "geometry": {  
          "type": "geo_shape"  
        }  
      }  
    }  
  }  
}


curl -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@your_geojson_bulk.json"

Set up (or refresh) Kibana index patterns to include your_index