Skip to content

Instantly share code, notes, and snippets.

@mheberger
Created November 15, 2022 13:17
Show Gist options
  • Save mheberger/9b99aa7c185bbb070c9eff7f8e497091 to your computer and use it in GitHub Desktop.
Save mheberger/9b99aa7c185bbb070c9eff7f8e497091 to your computer and use it in GitHub Desktop.
Python script for delineating watersheds in the continental United States with the USGS NLDI API
#!/usr/bin/env python
# coding: utf-8
# # Demo of watershed delineation using the USGS NLDI API
#
# This demo shows you how to use Python to find the watershed for any point in the continental United States. We'll use
# anew water data service created by the US Geological Survey (USGS), called the Hydro Network Linked Data Index (NLDI).
# For more info, see: https://labs.waterdata.usgs.gov/about-nldi/index.html
#
# The service is an application programming interface (API), and lets you query geodata and other information from the
# National Hydrographic Dataset (NHD), version 2. The NHD is "the nation's water data backbone," and contains tons of
# useful information about rivers and drainage areas (or catchments) that have been compiled by state and federal
# agencies over the course of decades.
#
# Despite the fact that their goal in creating this is to "make the data in the NHD available to new audiences,"
# I found that it was not a completely easy process to get the data I wanted. Although it is probably much easier than
# it was a few years ago! Follow along, or download the Jupyter notebook and try it for yourself.
# First, import the libraries we'll be using
import pynhd
import pandas
import geopandas
import requests
import matplotlib.pyplot as plt
import contextily
from shapely.geometry import Polygon
# We'll use pynhd, a library that makes it easy to get data from the USGS NLDI API and
# puts the results into GeoPandas dataframes.
# https://github.com/hyriver/pynhd
pygeoapi = pynhd.PyGeoAPI()
# ## Specify the watershed outlet
# Here are the coordinates for the watershed outlet, in latitude and longitude. In my experimenting, I found it is
# really hard to get good results for watersheds with the API. When you enter coodinates on a large river and expect a
# big watershed, it often gives you a small watershed of a little local drainage area. I think I found the reason in
# this [USGS blog post](https://waterdata.usgs.gov/blog/nldi-processes/):
#
# ...the split catchment algorithm requires that a **precise location** along a flowline be provided (as can be
# derived from the raindrop trace algorithm). [emphasis added]
#
# It turns out that we need to do our own "snap to river centerline" in order to get our expected results. In truth, we
# have no way of knowing exactly where this will be on their digital representation of the world. The good news is that
# the API provides a function for this, called `hydrolocation`.
# This is our REQUESTED point. We are going to use their service to nudge it onto a nearby river centerline.
lat, lng = 46.666, -117.44
# Here is how to use the API to snap to the river centerline
url = "https://labs.waterdata.usgs.gov/api/nldi/linked-data/hydrolocation?f=json&coords=POINT({} {})".format(lng, lat)
r = requests.get(url=url)
r.json()
# If we inspect the JSON returned from their server, we can see a few things. It found a river 83.5 meters away. That
# river reach has the comid (for "common id") of 23449539. And the new coordinates are the ridiculously precise
# [117.44090682783197, 46.66579934880307]. (See the relevant XKCD). So we'll use these coordinates going forward.
data = r.json()
coordinates = data['features'][0]['geometry']['coordinates']
lng = coordinates[0]
lat = coordinates[1]
# Make a little GeoPandas dataframe for the point, just to make plotting super easy later on
point_df = pandas.DataFrame({'id': [1], 'name': ['Watershed outlet'], 'latitude': lat, 'longitude': lng})
geometry = geopandas.points_from_xy(x=[lng], y=[lat], crs="EPSG:4326")
point_gdf = geopandas.GeoDataFrame(point_df, geometry=geometry)
point_gdf.head()
# ## Get the watershed pynhd's split_catchment() function gives the upstream watershed boundary as a polygon (all of
# the upstream unit catchments merged and dissolved).
watershed_gdf = pygeoapi.split_catchment((lng, lat), crs="epsg:4326", upstream=True)
watershed_gdf.head()
# Plot the watershed boundary -- GeoPandas makes this easy using Matplotlib
ax = watershed_gdf.plot(alpha=0.5)
# Add the outlet point to the map
point_gdf.plot(ax=ax, edgecolor="red", facecolor="red", markersize=10)
plt.title("Watershed boundary")
plt.show()
# Calculate the area in "square decimal degrees" (!) and put it in a Python list
areas = geopandas.GeoSeries(watershed_gdf['geometry']).area.tolist()
print("Area of the two features: 1: {0[0]:.5f}, 2: {0[1]:.5f}".format(areas))
# GeoPandas gives us a scary-looking warning. It's true: areas calculated from unprojected lat, lng coordinates are
# mostly meaningless. They can't be converted in square meters or square miles. But, for our purposes,
# it's good enough, and can tell us which polygon is bigger than the other one. We'll use this one-liner to get the
# index (list position) of the feature with the smallest area. Then we'll delete (drop) the row with that index or
# row number. Remember that indexes start with _zero_ in Python!
idx_small_polygon = areas.index(min(areas))
# Drop the smallest polygon in the watershed boundaries dataframe
watershed_gdf = watershed_gdf.drop(watershed_gdf.index[idx_small_polygon])
watershed_gdf.plot()
plt.title("Watershed boundary, large part only")
# Our watershed has lots of little donut holes in them. This happens when there are "sinks" inside of our watershed,
# where water drains to a depression and gets stuck there, rather than flowing into a river or stream and toward the
# watershed outlet.
# I believe that how to deal with these is an unsettled question in hydrology. My view is that they are usually
# unwanted. So in this case, let's fill them in. I adapated this code snippet on Stack Overflow that does the job:
# https://stackoverflow.com/questions/61427797/filling-a-hole-in-a-multipolygon-with-shapely-netherlands-2-digit-postal-codes
def close_holes(poly: Polygon, area_max: float) -> Polygon:
"""
Close polygon holes by limitation to the exterior ring.
Args:
poly: Input shapely Polygon
area_max: keep holes that are larger than this.
Fill any holes less than or equal to this.
We're working with unprojected lat, lng
so this needs to be in square decimal degrees...
Settting area_max=0 fills *all holes*
Example:
df.geometry.apply(lambda p: close_holes(p))
"""
if area_max == 0:
if poly.interiors:
return Polygon(list(poly.exterior.coords))
else:
return poly
else:
list_interiors = []
for interior in poly.interiors:
p = Polygon(interior)
if p.area > area_max:
list_interiors.append(interior)
return Polygon(poly.exterior.coords, holes=list_interiors)
watershed_gdf = watershed_gdf.geometry.apply(lambda p: close_holes(p, 0))
watershed_gdf.plot()
plt.title("Watershed with holes filled in")
plt.show()
# ## Get the upstream river reaches I mistakenly thought you could use pynhd's `flow_trace()` function to get all of
# the upstream river reaches. It turns out it only gives you one river reach, the most immediately upstream (or
# downstream).
rivers_gdf = pygeoapi.flow_trace((lng, lat), crs="epsg:4326", direction="up")
rivers_gdf.head()
rivers_gdf.plot()
plt.title("River reaches returned by the flow_trace() function")
plt.show()
# It looks like this function is not working the way we want it to. It is only returning one upstream river reach
# rather than all of the reaches in the upstream watershed. It turns out that pynhd has a function navigate_byloc().
# See: https://docs.hyriver.io/autoapi/pynhd/pynhd/index.html#pynhd.pynhd.NLDI.navigate_byloc
#
# But this function queries a different, older service at USGS: an ESRI ArcGIS map server. I want to keep using the
# newer NLDI data, so we can query the USGS API on our own. It just requires a few more lines of code.
#
# First, we need to find out what **unit catchment** our watershed outlet is in.
# We'll use the "linked data" API. We are looking for its comid, or "common id"
url = "https://labs.waterdata.usgs.gov/api/nldi/linked-data/comid/position?f=json&coords=POINT({} {})".format(lng, lat)
r = requests.get(url=url)
# The response is a JSON string representing the river reach.
# It has a unique ID (comid) and polyline (LineString) geometry
# Convert the server response into a Python object (dictionary)
data = r.json()
# Get the comid of the returned unit catchment. It's buried in the dictionary.
comid = data['features'][0]['properties']['comid']
print("Watershed outlet is in unit catchment with comid = {}".format(comid))
# The comid is kind of the starting point of our watershed. It is the unique identifier for the unit catchment that
# our Now that we have the COMID, get all of the upstream river reaches using the navigation API.
url = "https://labs.waterdata.usgs.gov/api/nldi/linked-data/comid/{}/navigation/UT/flowlines?f=json&distance=9999".\
format(comid)
r = requests.get(url=url)
# Conver the tributaries GeoJSON into a GeoPandas dataframe
tribs_gdf = geopandas.read_file(r.text, driver='GeoJSON')
# Plot the tributaries
ax = tribs_gdf.plot(edgecolor="blue")
plt.title("Rivers returned by the navigation API")
# Add the outlet point so we can keep our bearings
point_gdf.plot(ax=ax, edgecolor="red", facecolor="red", markersize=15, zorder=2)
plt.show()
# We now have the watershed boundary and the rivers. We could stop here. Another thing you'll notice though is that
# for a large watershed, there are too many rivers to see them all. We need a way to "prune" the river network. In
# the next step, we'll add extra information or attributes to each river reach to help us do this.
#
# A strength of the NHD is that it contains _lots_ of useful information about the rivers. At present, we only have
# the **comid** and the **geometry** for each river reach. But the USGS has tabulated lots of other attributes, like
# stream order, name, length, slope, and many more.
#
# They call these "Value Added Attributes" (VAA). You can get these in the GIS data from the USGS, but downloading
# all of these data files would be overkill for us. I found this online resource compiled by scientist J. Michael
# Johnson at UC Santa Barbara: https://www.hydroshare.org/resource/6092c8a62fac45be97a09bfd0b0bf726/
#
# Here, I have downloaded the file and unzipped the files to my desktop. The important file is `nhdplusVAA.parquet`.
# This "parquet" file is a highly-compressed binary file that contains the tabular data for the entire NHD catalong
# (about 2.7 million records). Of course, this is not efficient at all to read this huge file to extract a few rows.
# This would be worth optimizing if you are delineating many watersheds... for example by loading it into a database
# and using a query to get the rows you need. Or loading it once and looping over several watershed outlets.
fname = 'nhdplusVAA.parquet'
df_attribs = pandas.read_parquet(fname)
# Suppose you only wanted a few of the columns:
# df_attribs = pandas.read_parquet(fname, columns=['comid', 'streamorde', 'lengthkm', 'gnis_name'])
df_attribs.head()
# It was a little hard to find documentation for what all these fields mean, but I found them in a table on page 6 of
# this document:
# https://edap-ow-data-commons.s3.amazonaws.com/NHDPlusV21/Data/NationalData/0Release_Notes_NationalData_Seamless_GeoDatabase.pdf
#
# Next, we will join this information to the tribuataries geodataframe, `tribs_ddf`. But first, we need to make sure
# the join fields have the same name and data type.
# For convenience, rename the column "nhdplus_comid" to "comid"
# and convert it from a float to an int
tribs_gdf['comid'] = tribs_gdf['nhdplus_comid'].astype('int64')
tribs_gdf = tribs_gdf.drop(['nhdplus_comid'], axis=1)
# We need to convert the comid field from a float to an integer
df_attribs['comid'] = df_attribs['comid'].astype('int64')
# GeoPandas makes it easy to do a table join and add the attributes columns to our tributaries geodataframe
tribs_attrs_gdf = tribs_gdf.merge(df_attribs, on='comid')
tribs_attrs_gdf.head()
# ## Prune the river network Next we will prune the river network by removing smaller tributaries. I have found that
# keeping a total of 4 or 5 orders of streams results in maps that are visually pleasing and easier to read.
# Find the minimum stream order in our river network
min_order = tribs_attrs_gdf.streamorde.min()
max_order = tribs_attrs_gdf.streamorde.max()
min_order_to_keep = max_order - 4
print("Stream order ranges from {} to {}".format(min_order, max_order))
tribs_attrs_gdf = tribs_attrs_gdf[tribs_attrs_gdf.streamorde >= min_order_to_keep]
len(tribs_attrs_gdf)
# Before, we had 85,393 river reaches. We filtered out all those with an order less than 4, leaving us with orders 4
# to 8, leaving us wtih 13,195 reaches. This is a pretty big reduction! Let's see the results...
#
# ## Make our final map Here we'll use geopandas' built-in plotting methods, which rely on matplotlib, to make a nice
# map that we can print or embed in a document.
#
# When we plot the rivers on a map, I find it looks nice if big rivers have a heavier line, and use a lighter line
# for small streams. We can approximate this setting the line width proportional to the "stream order," one of the
# attributes that we added above. Little headwater streams have order = 1, and the the numbers increase as you move
# downstream. For big watersheds, you will want to remove smaller streams or your map will be too "busy."
#
# We'll also reproject our data to Web Mercator so we can use an existing basemap from the web.
# Add a field for the river width
tribs_attrs_gdf['linewidth'] = (tribs_attrs_gdf['streamorde'] - 0.5) ** 0.5
# Reproject to Web Mercator so we can show a nice basemap
watershed_webmercator = watershed_gdf.to_crs(epsg=3857)
# Plot the watershed boundary, and save the "axes" object that is returned
# so we can plot other layers on the same plot
ax = watershed_webmercator.plot(alpha=1, facecolor="none", edgecolor='red', zorder=1, linewidth=2, figsize=(7, 8))
# Use the contextily library to add a basemap
# Default is Stamen terrain, which is nice, but there are tons of different options:
# https://contextily.readthedocs.io/en/latest/providers_deepdive.html
contextily.add_basemap(ax)
# Plot the rivers
tribs_attrs_webmercator = tribs_attrs_gdf.to_crs(epsg=3857)
tribs_attrs_webmercator.plot(ax=ax, linewidth=tribs_attrs_gdf['linewidth'], edgecolor="blue", zorder=2)
# Plot the outlet point
point_webmercator = point_gdf.to_crs(epsg=3857)
point_webmercator.plot(ax=ax, linewidth=2, edgecolor="purple", facecolor="none", markersize=40, alpha=1, zorder=3)
plt.show()
# On some of the maps I made, it looked like some of the streams cross the watershed boundary, which is a big no-no.
# It may be due to some errors or inconsistencies in the underlying data. But it's hard to tell from the static
# image. I can't figure out how to get plt.show() to work in a Jupyter notebook, but we can do something even better.
#
# # Map #2: Interactive Map Make a Leaflet map with Folium. This is a map you can open in a web browser, like Google
# Maps. This allows you to pan and zoom to explore the watershed
# We can use the explore() method of Geopandas to put the features on a folium map
# This returns a folium map object that we can reuse to add other layers.
folium_map = watershed_gdf.explore(color="red", popup=False, style_kwds={'fill': False})
# Add a marker to show the outlet
point_gdf.explore(m=folium_map, style_kwds=dict(stroke=True, weight=3, color='purple', radius=5))
# Add the rivers to the map. The style function varies the line width (or weight) by stream order
tribs_attrs_gdf.explore(
m=folium_map,
style_kwds=dict(
color='blue',
style_function=lambda x: {"weight": x["properties"]['linewidth']}
)
)
# In an IDE like Pycharm, you need to save the map to an html file.
print("Exporting interactive map. Open map.html in a browser!")
folium_map.save("map.html")
# # Save your results
# If you're happy with the results, and want to save them for later, GeoPandas makes it easy to save the results in a
# bunch of different formats, like GeoJSON or shapefiles.
print("Exporting watershed and rivers")
# Export the watershed boundary and rivers as GeoJSON files
watershed_gdf.to_file('watershed.geojson', driver='GeoJSON')
# I could not get the dataframe with the joined attributes to write, because two of the columns were of type
# 'categorical,' and GeoPandas threw an error trying to handle them. Here I'm just deleting those columns with drop(),
# but presumably one could also change the data type to a string or integer, and that would probably fix the
# problem too.
tribs_attrs_gdf.drop(['reachcode'], axis=1, inplace=True)
tribs_attrs_gdf.drop(['wbareatype'], axis=1, inplace=True)
# Try saving to a shapefile this time.
tribs_attrs_gdf.to_file('rivers.shp')
# # Conclusion
# In this script, we saw how to:
#
# * use the new USGS NLDI API to delineate a watershed for anywhere in the continental United States
# * load the river reaches in the watershed
# * attach "value added attributes" to learn more about those rivers
# * create attractive plots and an interactive map
# * save the results in a format that can be shared or used in other software
#
# **Happy watershed delineation!**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment