Skip to content

Instantly share code, notes, and snippets.

@alexmi256
Created April 15, 2024 02:02
Show Gist options
  • Save alexmi256/18ed4e32ef1c16feaacfa61e89bca8ad to your computer and use it in GitHub Desktop.
Save alexmi256/18ed4e32ef1c16feaacfa61e89bca8ad to your computer and use it in GitHub Desktop.
Shrink GeoJSON Arrtibute Properties

What

I wanted to extract buildings from OpenStreetMaps so I can display them on a map using Folium/Leaflet.

The first recommendation online was to use QGIS with the QuickOSM plugin.

While this was annoying to do via the QuickOSM GUI (I should've downloaded the files locally and used SQL) I was able to output a GeoJSON file of buildings.

Unfortunately, due to data quality issues and the format QGIS outputs, I had a ton of null feature properties that I wanted to get rid of.

How

tl;dr

  • 1.8G -> 101M
  • pretty print JSON using jq
  • use sed to replace null values in place
  • remove remaining null values using jq
  • compact the JSON using jq

jq to remove nulls

The first idea was to use jq to delete null values and this seemed simple enough jq 'del(..|nulls)' buildings.geojson > buildings_nn.geojson

While I though an AMD Ryzen 7 7840HS and 32GB RAM would be plenty for a 1.8G JSON file, I turned out to be wrong as what ended up happening is that linux terminated the process. Apparently jq does stuff in memory and I confirmed this via htop which saw the memory spike up to its max after a while.

After some searching, I found out jq can stream data but examples for this aren't great and some users mentioned speed issues. I was also lazy and didn't feel like trying at this too hard since I had a dumber solution.

regex the nulls

We should be able to remove null values using regex.

By pure luck, the GeoJSON format does not have any "important" nulls.

I wanted to know what the GeoJSON data was so I ended up pretty printing it

jq . buildings.geojson > buildings_pp.geojson

Here we go from 1.8G to 2.5G

Now we can use sed to delete lines with nulls

sed -i -E '/\s+"[[:alnum:]_:]+": null,/d' buildings_pp.geojson

We're using the '/d' flag to delete a whole line since we pretty printed but you may be able to get away without pretty printing.

You mightr also be able to remove the \s+ since were deleting the whole line and matching on "key": null data.

This brings us from 2.5G to 291M.

However, our regex didn't remove cases where the last key had a null value since we were match , and not ,? I did the former because if I did the latter we might mess up the JSON format and I also didn't want to write a longer regex.

Now that the file is smaller (291M) let's use jq to actually remove the last null values that are are the end of each feature property.

jq 'del(..|nulls)' buildings_pp.geojson > buildings_ppnn.geojson This brings us down to 284M

Finally we can compact the JSON and remove the pretty printing.

jq -c . buildings_ppnn.geojson > buildings_ppnnc.geojson

This results into a more managable 101M GeoJSON

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment