Getting GSOD
This is a discursive recipe for turning the Global Summary of the Day weather dataset into a useful PostgreSQL database on a Unix system. It's aimed at data nerds with no prior knowledge of GSOD. For simplicity, it only describes one way of doing things, but the canny reader will see many possible variations (for example, using staged temporary files rather than a pipeline to do data conversion) and substitutions (mysql for postgres, curl for wget, perl for python, …).
Depending on how you do things, you’ll need something like 50 free gigabytes of disk space to work in, and the final database will use about 20 gigabytes.
At the end you'll have more than 120 million daily weather observations from around the world in efficiently queryable form, plus the ability to correlate them with some metadata about the stations that made them. This gives you, for example, the back end to support a world map that shows you a graph of weather averages from a point near where it's clicked.
GSOD is 1.2e8 (and counting) observations. We'll receive them as rows of ASCII, each with a weather station's WMO/Datsav3 code (6 decimal digits, the most significant of which correlate with geolocation) and/or its WBAN code (5 decimal digits), plus a date. This (WMO, WBAN, day) tuple, hereinafter WWD, will serve as our primary key, with some complexities to be explored later.
Then there are the payload fields. By default, fields are means of the named value – temperature means mean temperature, etc. (ftp://ftp.ncdc.noaa.gov/pub/data/gsod/readme.txt):
-
Temperature and a temperature count. Count fields tell you how many hourly observations went into averages, maxes, and mins. (All fields have an non-null average temperature field and a temperature count of at least 3.)
-
An average dewpoint and its count.
-
Measured and sea-level equivalent air pressure, both with counts. (//Redundancy)
-
Visibility and its count.
-
Wind speed and its count.
-
Max sustained wind speed.
-
Max gust (instantaneous wind) speed.
-
Max temperature, plus a flag for whether this number is merely the highest of the hourly counts (= false), or the actual highest of the continuous day (= true).
-
Min temperature and the analogous flag.
-
Amount of precipitation as water (so a cm of snow would count as about a mm here), plus a flag character (A through I) encoding how the precipitation was measured over the day.
-
Snowfall.
-
Booleans for fog, rain, snow, hail, thunderstorms, and tornados.
-
Observation count for the temperature (a count is the number of hourly observations that went into generating a min, max, or mean – e.g., if you only measure temperature at 11:00, noon, and 13:00 today, your
or both. Stations have basic metadata like latitude and longitude.
-
GSOD is not pure public domain; there are restrictions on its use. See WMO Resolution 40.
-
Playing with this datum doesn’t make you a climate scientist – any more than downloading the human genome gives you standing as a biologist, or looking at high-res Hubble images turns you into an astronomer. As a layperson (like me), GSOD can entertain and intrigue you by answering many questions about local weather over time, but it can’t make you a peer-reviewed expert. So please don’t make extravagant claims when you find, say, a temperature trendline that points up or down – it doesn't prove anything by itself. One reason is that:
-
The datum is imperfect in several ways. Individual stations have big gaps in their records, many stations don't record all fields (e.g. barometric pressure), and stations are distributed in a geographically unrepresentative way. High-GDP land is oversampled, and so on.
And that's just on the hardware side. As you'll see, while this dataset embodies a huge amount of mostly very careful and productive effort, and many people deserve our gratitude for making it and publishing it for free, it shows no trace of influence from anyone who could, say, define "first normal form". Data people, perhaps, but not programmers as we know them. Some fun features:
(A) Sometimes “no data” is represented as a zero value; sometimes it’s represented by a field being filled by the numeral 9; sometimes it’s represented by a separate count field being zero. Out-of-band nulls do not exist in the GSODniverse.
(B) A station is identified by a WMO/Datasav3 number inclusive-or a WBAN number. I'm using (WMO, WBAN, date) -- with the first two nullable -- as the primary key, but you might prefer something different. (E.g., preserving the nulls as ^9*$ integers to make some queries simpler, or deriving your own scalar station IDs with Gödel encoding or whatever makes you happy.)
(C) The observations are not unique on station and day. As of 2011-11-27, 466 (WMO, WBAN, date) tuples match 2 observations apiece. Going through these and deciding which ones to discard was ... a bit of a slog.
// ASCII data format
The records are at ftp://ftp.ncdc.noaa.gov/pub/data/gsod/$year/ ($year = 1929...) in two forms:
-
Gzip files, one per station with reports that year, named $wmo-$wban-$year.op.gz. For a recent year, there are about 10k of these.
-
A tar file, gsod_$year.tar, which aggregates all the gzip files. For a recent year, it's about 100 megabytes.
I’m not an FTP wizard, but apparently even mget is subject to a handshake per file – which, for a 50 ms ping, means a best-case of more than 8 minutes of wasted time per 10k files, which adds up to several hours of overhead for the whole dataset. So let’s get the tar versions.
$ wget --no-verbose -cm -A tar 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/'
--no-verbose
is the happy medium between silence and listing every file it doesn't get.
-c
is continue. Should't matter, but if you have to cancel this, you'll want it. -m
is mirror, which implies various useful flags for this kind of recursive download. -A tar
means only accept files ending in .tar, so it's a very selective mirroring.
My download took about 4.5 hours wall time – a 225 kB/s average or so, with a hard cap at 00 kB/s. I’m sure it depends heavily on load, this year’s federal climatology funding, etc.
Now we have a bunch of tarfiles containing many gz files each. The obvious route is a big tar xfv */*.tar
and then a gunzip */*.gz
, but that involves a bunch of redundant disk i/o. Instead, let's untar and ungzip piecewise in memory and write the data contiguously to stdout. (We can write to a single stream easily without losing information because each line has all the data we'll use as a primary key. The lines don't have to be in any order, or refer to the filenames they come out of, or anything like that; each one is complete as it is.)
UNTAR_GSOD WORKS AT ABOUT 50 MEGS/SECOND, 70% CPU, pv at 10%, python about the same
So this essentially acts as if the GSOD dataset were all in a single file, albeit with header lines and blank lines. It runs at about 50 megs/second when CPU-bound on the fairly average machine I tested it on.
Now there are two things we want to do with the data:
-
Convert it from the various parochial units used (knots, miles, Fahrenheit) to SI units.
-
Put it in a format that postgres's copy command recognizes. (We might as well use the defaults: tab-separated columns, null as \N.)
We do these at the same time, with gsod-to-tab-delimited-si.py.
tab-si works at about 6 megabytes/second. 19.9GB 0:40:30 [8.38MB/s]
(This does not prove or disprove anthropogenic global warming.) TDNPODAGW
Station list
createdb schema create index copy
station list
duplicates sanity checks
Other indexes Sample queries Joins
sanity checks: select count() from obs; select max(temperature) from obs; select min(pressure), max(pressure) from obs; select (select count() from obs where tornado = true) / (select count() from obs); select avg(temperature) from obs where extract(year from day) = 1980 select avg(temperature) from obs where extract(year from day) = 2010 select count() from obs where temperature_count > 24; select count() from obs where visibility > 10 and fog = true; select count() from obs where max_temperature < min_temperature; select count() from obs where max_temperature - min_temperature > 50; select count() from obs where (max_temperature < avg_temperature) or (min_temperature > avg_temperature); select count() from obs where mintemperature > 0 and snow = true; select count() from obs where mintemperature > 20 and snow = true; select count() from obs where precipitation > 1 and (rain = false and snow = false and hail = false); select count() from obs where extract(doy from day) = 1; select count() from obs where extract(doy from day) = 183; select count() from obs where extract(doy from day) = 366; select count(*) from obs where extract(month from year) = 3 and extract(day from month) = 29;
$ createdb gsod $ psql -d gsod -f gsod-schema.sql $ psql -d gsod
postgres uses about 50% cpu
gsod=# copy obs from '/Users/chas/Desktop/tabbed'; COPY 120453392
LAX = 722950
23532 (1 row)
create index day_idx on obs(day);
//
- Turns out there are 466 duplicate records! On one hand, it's kind of amazing that there are any at all, because it implies that no one's stored this in a database with a proper primary key before. On the other hand, it's kind of amazing that, given it's possible to allow the mistake at all, it only appears about once per 4×10^6 records. And almost all the duplicates are clustered -- they appear to be specific (groups of) stations having trouble at specific times, not uniform errors. The weirdest of these is that 24// different Australian stations all recorded duplicates on 1952-03-01. I think it might be something to do with the leap day.