Skip to content

Instantly share code, notes, and snippets.

@barronh
Last active April 9, 2025 17:40
Show Gist options
  • Save barronh/c53a08e150e6cd32423920f65b93199d to your computer and use it in GitHub Desktop.
Save barronh/c53a08e150e6cd32423920f65b93199d to your computer and use it in GitHub Desktop.
Monitor Coverage

#0000000001111111111222222222233333333334444444444555555555566666666667777777777 #1234567890123456789012345678901234567890123456789012345678901234567890123456789

County and Population Coverge of Monitors

The incomplete spatial coverage of monitors is a good motivation for data fusion. Monitors cover less than 30% of US counties, but more like 70% of the population. This is because dense populations often coincide with sources of polution and, therefore, are better monitored. The precise coverage of monitoring data depends on the period of interest. For example, ozone monitors are often not depolyed during winter.

According to this GAO report[1], in 2019 "2,120 out of 3,142 counties had no ambient air quality monitor associated with the monitoring system." They are quoting EPA. The quote has always confused me… It doesn’t say O3 or PM25 or PM10 or NO2 or CO or SO2 -- just ambient air quality monitor. So that is 1,020 monitors of some type.

This gist saves a quick method to verify my approach (code attached). I ran it for all pollutants for 2019. I got 1056 monitors reporting data. That’s 33.6% of counties in 2019 (31.9% in 2023) but that covers 80.9% of the population (using 2022 ACS county populations). Fast forward to 2023 monitors, its 1023 covering 79.9% of the population. If you slice it by pollutant…

• PM25: 766 counties (23.2%) and 241.4M people (72.2%) • Ozone: 765 counties (23.1%) and 242.0M people (72.4%)

[1] https://www.gao.gov/assets/gao-21-38.pdf

__doc__ = """
Quick Check for specific Period of AirNow
=========================================
This script shows how to get an estimate of monitor coverage using census
population data and EPA's Remote Sensing Information Gateway and AirNow.
To summarize:
1. Query a time period for Ozone and PM measurements.
2. Take unique locations that reported for any hour.
3. Spatially intersect with county boundaries.
4. Sum population with ozone, pm, both or neither measurements.
Expected output
---------------
If run as-is, you should get the following output.
| | Pop [M] | County% | Pop% |
|:---------------|----------:|----------:|-------:|
| Ozone and PM25 | 210.8 | 15.3 | 62.8 |
| Ozone Only | 26.7 | 7.3 | 8 |
| PM25 Only | 24.9 | 6.1 | 7.4 |
| Neither | 73.1 | 71.3 | 21.8 |
You can modify the dates (bdate, edate) for period of interest to get similar output.
Requirements
------------
Requires Python3, numpy, pandas, requests, geopandas and pyrsig
on most systems, these can be satisfied with the command
python -m pip install numpy pandas requests geopandas pyrsig
Requires US Census Tiger Line County Shapefile as a zip file. This can be
downloaded at https://www2.census.gov/geo/tiger/TIGER2024/COUNTY/tl_2024_us_county.zip
"""
__version__ = '1.0.0'
import numpy as np
import pandas as pd
import pyrsig
import requests
import geopandas as gpd
bdate = '2024-07-01T00Z'
edate = '2024-07-07T00Z'
api = pyrsig.RsigApi()
# Download Ozone and PM data for time period
pmdf = api.to_dataframe('airnow.pm25', bdate=bdate, edate=edate)
o3df = api.to_dataframe('airnow.ozone', bdate=bdate, edate=edate)
cntydf = gpd.read_file('tl_2024_us_county.zip')
# Get county total population estimate B01001_001E for all counties as a dataframe
r = requests.get('https://api.census.gov/data/2023/acs/acs5?get=B01001_001E,GEO_ID&for=county')
j = r.json()
df = pd.DataFrame.from_records(j[1:], columns=j[0])
df['POP'] = df['B01001_001E'].astype('l')
# Add population to county geodataframe
cntydf = pd.merge(cntydf, df[['GEO_ID', 'POP']], left_on='GEOIDFQ', right_on='GEO_ID', how='left')
# Create unique site locations for O3 and PM25
o3gdf = o3df.groupby(['LATITUDE(deg)', 'LONGITUDE(deg)', 'SITE_NAME'], as_index=False).first()
o3gdf = gpd.GeoDataFrame(o3gdf, geometry=gpd.points_from_xy(o3gdf['LONGITUDE(deg)'], o3gdf['LATITUDE(deg)']), crs=4326).to_crs(cntydf.crs)
pmgdf = pmdf.groupby(['LATITUDE(deg)', 'LONGITUDE(deg)', 'SITE_NAME'], as_index=False).first()
pmgdf = gpd.GeoDataFrame(pmgdf, geometry=gpd.points_from_xy(pmgdf['LONGITUDE(deg)'], pmgdf['LATITUDE(deg)']), crs=4326).to_crs(cntydf.crs)
# Identify which counties have ozone, pm, or both
cntydf['haso3'] = cntydf.geometry.intersects(o3gdf.geometry.union_all())
cntydf['haspm'] = cntydf.geometry.intersects(pmgdf.geometry.union_all())
cntydf['hasboth'] = cntydf['haso3'] & cntydf['haspm']
# Store as a categorical variable
cntydf['MON'] = 'Neither'
cntydf.loc[cntydf.haso3, 'MON'] = 'Ozone Only'
cntydf.loc[cntydf.haspm, 'MON'] = 'PM25 Only'
cntydf.loc[cntydf.hasboth, 'MON'] = 'Ozone and PM25'
# Create a national summary with millions of people, percent of people, and percent of county.
popdf = cntydf.groupby('MON')[['POP']].agg(POP=('POP', 'sum'), CNT=('POP', 'count'))
popdf['County%'] = popdf['CNT'] / popdf['CNT'].sum() * 100
popdf['Pop%'] = popdf['POP'] / popdf['POP'].sum() * 100
popdf['Pop [M]'] = popdf['POP'] / 1e6
popdf = popdf.loc[['Ozone and PM25', 'Ozone Only', 'PM25 Only', 'Neither']]
popdf.index.name = None
# Print Output as a Markdown
print(popdf.round(1)[['Pop [M]', 'County%', 'Pop%']].to_markdown())
# save results to disk
popdf.to_csv(f'censuscounties_{bdate}_{edate}.csv')
__doc__ = """
Gridded Population Distance From Monitor
========================================
This script shows how to get an estimate of monitor coverage using census
population data and EPA's Remote Sensing Information Gateway and AirNow.
To summarize:
1. Query a time period for Ozone and PM measurements.
2. Take unique locations that reported for any hour.
3. Spatially intersect with county boundaries.
4. Sum population with ozone, pm, both or neither measurements.
Expected output
---------------
If run as-is, you should get the following output.
1km, 1.98 millions, 0.65%
10km, 99.71 millions, 32.51%
25km, 216.59 millions, 70.62%
50km, 267.88 millions, 87.35%
100km, 295.39 millions, 96.32%
150km, 302.94 millions, 98.78%
200km, 305.26 millions, 99.54%
You can modify the year (2023 to 2024) to get similar output.
Requirements
------------
Requires Python3, numpy, pandas, geopandas, xarray, and rioxarray
on most systems, these can be satisfied with the command
python -m pip install numpy pandas requests geopandas xarray rioxarray
Gridded Population of the World[1,2]
[1] Center for International Earth Science Information Network - CIESIN -
Columbia University. 2017. U.S. Census Grids (Summary File 1), 2010.
Palisades, New York: NASA Socioeconomic Data and Applications Center
(SEDAC). https://doi.org/10.7927/H40Z716C. Accessed October 11, 2024.
[2] Download as geotiff. To open with xarray, requires xarray and rioxarray.
"""
__version__ = '1.0.0'
import pandas as pd
import geopandas as gpd
import xarray as xr
# 2010 us population as a 1km (0.008333 degree) downloaded from CIESIN
# open, mask missing values, remove the band dimension, fill masked
# values with 0, and rename data as pop
f = xr.open_dataset(
f'geotiff/uspop10.tif', engine="rasterio"
).where(lambda x: x > -3.4e+38).sel(band=1).fillna(0).rename_vars(band_data='pop')
popdf = f['pop'].to_dataframe().query('pop > 0')['pop'].reset_index()
gpopdf = gpd.GeoDataFrame(
popdf[['pop']], geometry=gpd.points_from_xy(popdf['x'], popdf['y']), crs=4326)
# Open AQS observations of ozone (44201) for a year
aqsdf = pd.read_csv(
f'https://aqs.epa.gov/aqsweb/airdata/daily_44201_2023.zip'
)
# Create a dataframe of individual sites
sitedf = aqsdf.groupby(['State Code', 'County Code', 'Site Num']).mean(numeric_only=True)
# Upgrade dataframe to a geodataframe
gsitedf = gpd.GeoDataFrame(
sitedf[['Arithmetic Mean', '1st Max Value']],
geometry=gpd.points_from_xy(sitedf['Longitude'], sitedf['Latitude']),
crs=4326
)
# Find the nearest monitor to the population centroid.
nsitedf = gpd.sjoin_nearest(gpopdf, gsitedf, distance_col='dist')
# Because the dataset is nominally 1km, the averag of teh degree distance (0.008333)
# is the degree equivalent of 1km
onekm = float(f['x'].diff('x').mean())
ntot = nsitedf['pop'].sum()
for m in [1, 10, 25, 50, 100, 150, 200]:
n = nsitedf.query(f'dist < {m * onekm}')['pop'].sum()
print(f'{m:3d}km, {n / 1e6:6.2f} millions, {n / ntot:6.2%}')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment