Skip to content

Instantly share code, notes, and snippets.

@kgjenkins
Last active February 20, 2020 17:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kgjenkins/a2a0c9f5a0d42a4ef7aeea97de51a6a9 to your computer and use it in GitHub Desktop.
Save kgjenkins/a2a0c9f5a0d42a4ef7aeea97de51a6a9 to your computer and use it in GitHub Desktop.
Splitting MS Buildings into county-based files

Splitting Microsoft Buildings into county-based files

Microsoft US Building Footprints are available at:
https://github.com/Microsoft/USBuildingFootprints

The state-based downloads are in .geojson format, which is a pretty bad choice for datasets this large. Bad because .geojson has no spatial index, making it very slow to load and render. So it will be useful to split and save them into a separate file (geopackage or shapefile) for each county. Also, some of the polygons have invalid geometries that we should fix along the way.

Below are instructions for splitting New York into counties using QGIS

1. Download the data

Be sure to unzip both .zip files, then load both into QGIS (the buildings may take a couple minutes).

2. Fix geometries

The MS Buildings polygons may have some invalid geometries that will cause problems later on, so let's fix them.

  • Processing toolbox > Fix Geometries (just output to temp layer)

3. Join the county names to the buildings

We can make the join process much faster by selecting just the New York counties.

  • In the toolbar, click the yellowish "Select Features by Value" tool

    • select where STATEFP = 36
  • Processing toolbox > Join Attributes by Location

    • Input layer = Fixed Geometries
    • Join layer = tl_2019_us_county
    • Check "Selected features only" for the counties
    • Geometric predicate = "intersects"
    • Fields to add -- click "...' select "NAME" (which is the county name)
    • Join type = "Create separate feature for each located feature (one-to-many)"
    • Output to temp layer
    • Click "Run" -- you'll see a warning about "no spatial index exists for input layer" but that's okay since it needs to loop through every feature anyway. (Plus, by the time you create the index, you won't save that much time.)

By choosing a one-to-many join, any buildings that cross a boundary will be copied, so that the building will be included in both county files in the end.

4. Add an id column

It will be helpful to have an arbitrary id column, so that we'll still have at least one column left if we want to remove the county column after splitting. By saving to a geopackage file, an "fid" column will automatically be added.

  • Right-click "Joined layer" > Make Permanent...
    • Format = GeoPackage
    • File name -- click the "..." to specify the output location and save as "joined.gpkg"
    • leave the other settings as they are and click "OK"

5. Split into separate files by county

  • Processing toolbox > Split vector layer
    • Input layer = Joined layer
    • Unique ID field = "NAME"
    • Click the "..." to specify the output directory (create and save to a directory called "split")

Before running, notice that there is no place to specify the output file type. It will be whatever the default vector extension is set to in your QGIS settings.

  • Settings menu > Options
    • Click the "processing" tab on the left
    • Search (at top left) for "default"
    • Set the "Default output vector layer extension" to whatever you want (probably gpkg or shp)

Finally, go back to your "Split vector layer" dialog and click "Run".

Watch out -- there might be an extra, empty shapefile for "NULL" county, which can be deleted.

6. Remove the "NAME" column

Now that the buildings are split into multiple files, the "NAME" column that contains the county name is rather unnecessary. We can delete this column from all the files using a batch process.

  • Processing toolbox > Drop field(s)
    • Click "Run as Batch Process..." (in the bottom left)
    • Under "Input layer", click "Autofill..." > "Add Files by Pattern..."
    • File pattern = *.gpkg (or *.shp if using shapefiles)
    • Look in = select your "split" directory
    • Click "Find Files"
    • OK

This adds a row for each file that will be processed.

  • Under "Fields to drop", click the "..." in the first row
  • Select the "NAME" column
  • Click "Autofill..." > Fill Down

This copies that option to all the rows. Now we need to set the output filename.

  • Under "Remaining fields", click "Autofill..." > Calculate by expression
  • Use this expression: regexp_replace( @INPUT , 'NAME_', 'msbuildings_ny_')
  • Click OK, then Run

This should give us a nicely-named file for each county.

  • Delete all the old files like "NAME_Albany.gpkg"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment