Skip to content

Instantly share code, notes, and snippets.

@celoyd
Created December 13, 2016 21:04
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save celoyd/831799cb2ad8ce2573063b08f5c86774 to your computer and use it in GitHub Desktop.
Save celoyd/831799cb2ad8ce2573063b08f5c86774 to your computer and use it in GitHub Desktop.

Getting GSOD

0. Introduction

00: What’s in this guide?

This is a discursive recipe for turning the Global Summary of the Day weather dataset into a useful PostgreSQL database on a Unix system. It's aimed at data nerds with no prior knowledge of GSOD. For simplicity, it only describes one way of doing things, but the canny reader will see many possible variations (for example, using staged temporary files rather than a pipeline to do data conversion) and substitutions (mysql for postgres, curl for wget, perl for python, …).

Depending on how you do things, you’ll need something like 50 free gigabytes of disk space to work in, and the final database will use about 20 gigabytes.

At the end you'll have more than 120 million daily weather observations from around the world in efficiently queryable form, plus the ability to correlate them with some metadata about the stations that made them. This gives you, for example, the back end to support a world map that shows you a graph of weather averages from a point near where it's clicked.

01: What’s in GSOD?

GSOD is 1.2e8 (and counting) observations. We'll receive them as rows of ASCII, each with a weather station's WMO/Datsav3 code (6 decimal digits, the most significant of which correlate with geolocation) and/or its WBAN code (5 decimal digits), plus a date. This (WMO, WBAN, day) tuple, hereinafter WWD, will serve as our primary key, with some complexities to be explored later.

Then there are the payload fields. By default, fields are means of the named value – temperature means mean temperature, etc. (ftp://ftp.ncdc.noaa.gov/pub/data/gsod/readme.txt):

  • Temperature and a temperature count. Count fields tell you how many hourly observations went into averages, maxes, and mins. (All fields have an non-null average temperature field and a temperature count of at least 3.)

  • An average dewpoint and its count.

  • Measured and sea-level equivalent air pressure, both with counts. (//Redundancy)

  • Visibility and its count.

  • Wind speed and its count.

  • Max sustained wind speed.

  • Max gust (instantaneous wind) speed.

  • Max temperature, plus a flag for whether this number is merely the highest of the hourly counts (= false), or the actual highest of the continuous day (= true).

  • Min temperature and the analogous flag.

  • Amount of precipitation as water (so a cm of snow would count as about a mm here), plus a flag character (A through I) encoding how the precipitation was measured over the day.

  • Snowfall.

  • Booleans for fog, rain, snow, hail, thunderstorms, and tornados.

  • Observation count for the temperature (a count is the number of hourly observations that went into generating a min, max, or mean – e.g., if you only measure temperature at 11:00, noon, and 13:00 today, your

or both. Stations have basic metadata like latitude and longitude.

1. Caveats

  1. GSOD is not pure public domain; there are restrictions on its use. See WMO Resolution 40.

  2. Playing with this datum doesn’t make you a climate scientist – any more than downloading the human genome gives you standing as a biologist, or looking at high-res Hubble images turns you into an astronomer. As a layperson (like me), GSOD can entertain and intrigue you by answering many questions about local weather over time, but it can’t make you a peer-reviewed expert. So please don’t make extravagant claims when you find, say, a temperature trendline that points up or down – it doesn't prove anything by itself. One reason is that:

  3. The datum is imperfect in several ways. Individual stations have big gaps in their records, many stations don't record all fields (e.g. barometric pressure), and stations are distributed in a geographically unrepresentative way. High-GDP land is oversampled, and so on.

And that's just on the hardware side. As you'll see, while this dataset embodies a huge amount of mostly very careful and productive effort, and many people deserve our gratitude for making it and publishing it for free, it shows no trace of influence from anyone who could, say, define "first normal form". Data people, perhaps, but not programmers as we know them. Some fun features:

(A) Sometimes “no data” is represented as a zero value; sometimes it’s represented by a field being filled by the numeral 9; sometimes it’s represented by a separate count field being zero. Out-of-band nulls do not exist in the GSODniverse.

(B) A station is identified by a WMO/Datasav3 number inclusive-or a WBAN number. I'm using (WMO, WBAN, date) -- with the first two nullable -- as the primary key, but you might prefer something different. (E.g., preserving the nulls as ^9*$ integers to make some queries simpler, or deriving your own scalar station IDs with Gödel encoding or whatever makes you happy.)

(C) The observations are not unique on station and day. As of 2011-11-27, 466 (WMO, WBAN, date) tuples match 2 observations apiece. Going through these and deciding which ones to discard was ... a bit of a slog.

2. Getting the data

// ASCII data format

The records are at ftp://ftp.ncdc.noaa.gov/pub/data/gsod/$year/ ($year = 1929...) in two forms:

  1. Gzip files, one per station with reports that year, named $wmo-$wban-$year.op.gz. For a recent year, there are about 10k of these.

  2. A tar file, gsod_$year.tar, which aggregates all the gzip files. For a recent year, it's about 100 megabytes.

I’m not an FTP wizard, but apparently even mget is subject to a handshake per file – which, for a 50 ms ping, means a best-case of more than 8 minutes of wasted time per 10k files, which adds up to several hours of overhead for the whole dataset. So let’s get the tar versions.

$ wget --no-verbose -cm -A tar 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/'

--no-verbose is the happy medium between silence and listing every file it doesn't get.

-c is continue. Should't matter, but if you have to cancel this, you'll want it. -m is mirror, which implies various useful flags for this kind of recursive download. -A tar means only accept files ending in .tar, so it's a very selective mirroring.

My download took about 4.5 hours wall time – a 225 kB/s average or so, with a hard cap at 00 kB/s. I’m sure it depends heavily on load, this year’s federal climatology funding, etc.

3. Preparing the data

Now we have a bunch of tarfiles containing many gz files each. The obvious route is a big tar xfv */*.tar and then a gunzip */*.gz, but that involves a bunch of redundant disk i/o. Instead, let's untar and ungzip piecewise in memory and write the data contiguously to stdout. (We can write to a single stream easily without losing information because each line has all the data we'll use as a primary key. The lines don't have to be in any order, or refer to the filenames they come out of, or anything like that; each one is complete as it is.)

UNTAR_GSOD WORKS AT ABOUT 50 MEGS/SECOND, 70% CPU, pv at 10%, python about the same

So this essentially acts as if the GSOD dataset were all in a single file, albeit with header lines and blank lines. It runs at about 50 megs/second when CPU-bound on the fairly average machine I tested it on.

Now there are two things we want to do with the data:

  1. Convert it from the various parochial units used (knots, miles, Fahrenheit) to SI units.

  2. Put it in a format that postgres's copy command recognizes. (We might as well use the defaults: tab-separated columns, null as \N.)

We do these at the same time, with gsod-to-tab-delimited-si.py.

tab-si works at about 6 megabytes/second. 19.9GB 0:40:30 [8.38MB/s]

(This does not prove or disprove anthropogenic global warming.) TDNPODAGW

Station list

4. Importing the data

createdb schema create index copy

station list

5. Cleaning the data

duplicates sanity checks

6. Working with the data

Other indexes Sample queries Joins

sanity checks: select count() from obs; select max(temperature) from obs; select min(pressure), max(pressure) from obs; select (select count() from obs where tornado = true) / (select count() from obs); select avg(temperature) from obs where extract(year from day) = 1980 select avg(temperature) from obs where extract(year from day) = 2010 select count() from obs where temperature_count > 24; select count() from obs where visibility > 10 and fog = true; select count() from obs where max_temperature < min_temperature; select count() from obs where max_temperature - min_temperature > 50; select count() from obs where (max_temperature < avg_temperature) or (min_temperature > avg_temperature); select count() from obs where mintemperature > 0 and snow = true; select count() from obs where mintemperature > 20 and snow = true; select count() from obs where precipitation > 1 and (rain = false and snow = false and hail = false); select count() from obs where extract(doy from day) = 1; select count() from obs where extract(doy from day) = 183; select count() from obs where extract(doy from day) = 366; select count(*) from obs where extract(month from year) = 3 and extract(day from month) = 29;

$ createdb gsod $ psql -d gsod -f gsod-schema.sql $ psql -d gsod

copy obs from '/Users/chas/Desktop/tabbed';

postgres uses about 50% cpu

gsod=# copy obs from '/Users/chas/Desktop/tabbed'; COPY 120453392

LAX = 722950

gsod=# select count(*) from obs where tornado = true; count

23532 (1 row)

create index day_idx on obs(day);

//

  1. Turns out there are 466 duplicate records! On one hand, it's kind of amazing that there are any at all, because it implies that no one's stored this in a database with a proper primary key before. On the other hand, it's kind of amazing that, given it's possible to allow the mistake at all, it only appears about once per 4×10^6 records. And almost all the duplicates are clustered -- they appear to be specific (groups of) stations having trouble at specific times, not uniform errors. The weirdest of these is that 24// different Australian stations all recorded duplicates on 1952-03-01. I think it might be something to do with the leap day.
create table obs (
wmo int,
wban int,
day date,
temperature float,
temperature_count int,
dewpoint float,
dewpoint_count int,
sealevelpressure float,
sealevelpressure_count int,
stationpressure float,
stationpressure_count int,
visibility float,
visibility_count int,
windspeed float,
windspeed_count int,
maxwindspeed float,
gustspeed float,
maxtemperature float,
maxtemperature_special bool,
mintemperature float,
mintemperature_special bool,
precipitation float,
precipitation_note varchar(1),
snowdepth float,
fog bool,
rain bool,
snow bool,
hail bool,
thunder bool,
tornado bool
);
create index temperature_idx on obs(temperature);
create index wmo_idx on obs(wmo);
create index wban_idx on obs(wban);
create index day_idx on obs(day);
# delete
941200 99999 19520301 86.3 6 74.9 6 9999.9 0 1004.1 6 999.9 0 6.5 6 8.0 999.9 89.1* 83.7* 0.00I 999.9 000000
select * from obs where wmo = 941200 and wban = 99999 and day = '1952-03-01';
I RAN THIS:
delete from obs where wmo = 941200 and wban = 99999 and day = '1952-03-01' and temperature_count = 6;
//
duplicates for WBAN 88703 and day = '1964-12-31'
I RAN THIS:
delete from obs where wban = 88703 and day = '1964-12-31' and temperature = -4.16666666667;
//
722520 | 12920 | 1965-03-10 | 23 | 8 | 12.2777777778 | 8 | 101.24 | 8 | 99.44 | 8 | 22.3698816 | 8 | 14.4456 | 8 | 37.04 | | 31.1111111111 | f | 16.1111111111 | f | 0 | I | | f | f | f | f | f | f
I RAN THIS:
delete from obs where wban = 12920 and day = '1965-03-10' and temperature_count = 4;
//
ERROR: could not create unique index "obs_pkey"
DETAIL: Key (wmo, wban, day)=(722535, 12909, 1973-12-01) is duplicated.
I RAN THIS:
delete from obs where wmo = 722535 and day = '1973-12-01' and temperature_count = 4;
//
I RAN THIS:
delete from obs where wmo = 945100 and day = '1952-03-01' and temperature_count = 6;
//
DETAIL: Key (wmo, wban, day)=(722520, 12920, 1965-03-11) is duplicated.
I RAN THIS:
delete from obs where wban = 12920 and day = '1965-03-11' and temperature_count = 16;
//
I RAN THIS:
delete from obs where wmo = 722535 and day = '1973-12-02' and temperature_count = 17;
//
I RAN THIS:
delete from obs where wmo = 723600 and day = '1977-08-29' and temperature_count = 5;
//
725110-14751-1985.op is quite fucked
725111-14751-1985.op is less fucked
I RAN THIS:
delete from obs where wmo = 725111 and day >= '1985-01-01' and day <= '1985-12-31' and precipitation_note = 'I' or (precipitation is null and precipitation_note is null);
AND THIS:
delete from obs where wban = 14751 and day = '1985-06-03' and precipitation is null;
PLUS ANYWAY THIS:
delete from obs where wmo = 725111 and day >= '1985-01-01' and day <= '1985-12-31' and precipitation_note = ' ';
select distinct day from obs where wmo = 725111; = 11940
raw count = 11940 too
//
delete from obs where wmo = 722915 and day = '1989-11-15' and temperature_count = 10;
//
delete from obs where day = '1999-05-28' and wmo = '722540' and temperature_count = 9;
//
722630 | 23034 | 1952-03-01
These are very similar and both out of sync with the neighboring days. Kept the one that seemed to fit /slightly/ better, but I consider the remaining one /highly/ suspect.
I RAN THIS:
delete from obs where day = '1952-03-01' and wmo = '722630' and temperature = 8.33333333333;
//
I RAN THIS:
delete from obs where day = '1973-12-03' and wmo = '722535' and temperature_count = 20;
//
delete from obs where day = '1977-08-28' and wmo = '723600' and temperature_count = 5;
//
delete from obs where wmo = 722540 and day = '1999-05-27' and temperature_count = 20;
//
delete from obs where wmo = 722650 and day = '1952-03-01' and temperature_count = 23;
//
delete from obs where wmo = 722535 and day = '1973-12-05' and temperature_count = 7;
//
delete from obs where wmo = 723600 and day = '1977-08-30' and temperature_count = 5;
//
delete from obs where wmo = 722540 and day = '1999-10-13' and maxtemperature_special = false;
//
delete from obs where wmo = 942030 and day = '1952-03-01' and temperature_count = 6;
//
wmo = 722535 and day = '1973-12-07' appear to be precise duplicates!
722535 | 12909 | 1973-12-07 | 11 | 24 | -1.72222222222 | 24 | 103.18 | 8 | | 0 | 31.382208 | 24 | 6.2968 | 24 | 16.4828 | | 15.5 | f | 6.11111111111 | f | 0 | I | | f | f | f | f | f | f
I RAN THIS:
delete from obs where ctid = (select ctid from obs where wmo = 722535 and day = '1973-12-07' limit 1);
//
delete from obs where wmo = 722540 and day = '1999-10-09' and maxtemperature_special = false;
//
delete from obs where wmo = 722535 and day = '1973-12-06' and temperature_count = 19;
//
delete from obs where wmo = 942120 and day = '1952-03-01' and maxtemperature_special = false
//
delete from obs where wban = 23051 and day = '1977-08-31' and temperature_count = 5;
//
delete from obs where wmo = 722535 and day = '1973-12-08' and temperature_count = 23;
//
delete from obs where wmo = 722540 and day = '1999-05-24' and stationpressure is not null;
//
delete from obs where wmo = 948210 and day = '1952-03-01' and temperature_count = 6;
//
delete from obs where wmo = 943000 and day = '1952-03-01' and temperature_count = 6;
//
Hell if I know. Let's flip a coin:
delete from obs where ctid = (select ctid from obs where wmo = 722535 and day = '1973-12-09' limit 1);
//
delete from obs where wmo = 722535 and day >= '1973-12-01' and day <= '1973-12-31' and temperature_count != 24;
//
delete from obs where wmo = 722540 and day >= '1999-12-01' and day <= '1999-12-31';
AND BUT SO ALSO AND AS WELL:
delete from obs where wmo = 722540 and day >= '1999-01-01' and day <= '1999-12-31';
//
select count(*) from obs group by wmo, wban, day having count(*) > 1
//
gsod=# select wmo, wban, day from obs group by wmo, wban, day having count(*) > 1;
wmo | wban | day
--------+-------+------------
943120 | 99999 | 1952-03-01 --
943350 | 99999 | 1952-03-01 --
943740 | 99999 | 1952-03-01 --
944030 | 99999 | 1952-03-01 --
944300 | 99999 | 1952-03-01
944760 | 99999 | 1952-03-01
946370 | 99999 | 1952-03-01
946460 | 99999 | 1952-03-01
946530 | 99999 | 1952-03-01
946590 | 99999 | 1952-03-01
946930 | 99999 | 1952-03-01
947760 | 99999 | 1952-03-01
947910 | 99999 | 1952-03-01
948650 | 99999 | 1952-03-01
949070 | 99999 | 1952-03-01
949100 | 99999 | 1952-03-01
949260 | 99999 | 1952-02-29
949680 | 99999 | 1952-03-01
(18 rows)
//
delete from obs where wmo = 943120 and day = '1952-03-01' and temperature_count = 6;
delete from obs where wmo = 943350 and day = '1952-03-01' and maxtemperature_special = false;
delete from obs where wmo = 943740 and day = '1952-03-01' and temperature_count = 7;
delete from obs where wmo = 944030 and day = '1952-03-01' and temperature_count = 6;
#!/usr/bin/env python
'''
Turn GSOD data into tab-delimited data in SI units with nulls as '\N',
suitable for a postgres import via COPY.
See ftp://ftp.ncdc.noaa.gov/pub/data/gsod/readme.txt for the input format,
e.g. if you're wondering what's up with all the hardcoded 9* values.
To prepare completely raw GSOD data, remove:
1. Blank lines.
2. Lines beginning "STN---".
Also, all data lines and no non-data lines should contain "1" and/or "0".
All imperial to SI conversions are definitional, i.e. as precise as the
underlying representations allow. For example, a knot is defined as 1.852
km per hour; that's not just close. Note that we don't do type conversions
where not necessary for unit conversions (e.g., we don't bother making the
station numbers integers instead of strings); postgres will do that anyway.
'''
from sys import stdin, argv, exit
from string import digits
null = '\N'
def check_wmo(n):
if n == '999999': return null
return n
def check_wban(n):
if n == '99999': return null
return n
def f_to_c(t):
t = float(t)
if t > 9999: return null
return (t-32) * 5.0/9.0
def kt_to_kph(s):
s = float(s)
if s > 999: return null
return s * 1.852
def millibars_to_kPa(p):
p = float(p)
if p > 9999: return null
return p/10.0
def mi_to_km(d):
d = float(d)
if d > 999: return null
# We assume an international (not survey) mile -_-
return d*1.609344
def in_to_cm(l, invalid=99):
l = float(l)
if l > int(invalid): return null
return l*2.54
def special(f):
if f == '*':
return 'false'
else:
return 'true'
def tf(bit):
if bit == '1':
return 'true'
else:
return 'false'
for line in stdin:
if line[0] not in digits:
# Header line or trailing blank line:
continue
wmo = check_wmo(line[0:6]) # technically, WMO/DATSAV3
wban = check_wban(line[7:12])
day = line[14:22]
temperature = f_to_c(line[24:30])
temperature_count = line[31:33]
dewpoint = f_to_c(line[35:41])
dewpoint_count = line[42:44]
sealevelpressure = millibars_to_kPa(line[46:52])
sealevelpressure_count = line[53:55]
stationpressure = millibars_to_kPa(line[57:63])
stationpressure_count = line[64:66]
visibility = mi_to_km(line[68:73])
visibility_count = line[74:76]
windspeed = kt_to_kph(line[78:83])
windspeed_count = line[84:86]
maxwindspeed = kt_to_kph(line[88:93]) # 16
gustspeed = kt_to_kph(line[95:100])
maxtemperature = f_to_c(line[102:108])
maxtemperature_special = special(line[108:109])
mintemperature = f_to_c(line[110:116])
mintemperature_special = special(line[116:117])
precipitation = in_to_cm(line[118:123])
precipitation_note = line[123:124]
snowdepth = in_to_cm(line[125:130], invalid=999.9)
fog = tf(line[132:133])
rain = tf(line[133:134])
snow = tf(line[134:135])
hail = tf(line[135:136])
thunder = tf(line[136:137])
tornado = tf(line[137:138])
# Weirdly, this is actually pretty fast:
print '\t'.join(map(str, (wmo, wban, day, temperature, temperature_count, dewpoint, dewpoint_count, sealevelpressure, sealevelpressure_count, stationpressure, stationpressure_count, visibility, visibility_count, windspeed, windspeed_count, maxwindspeed, gustspeed, maxtemperature, maxtemperature_special, mintemperature, mintemperature_special, precipitation, precipitation_note, snowdepth, fog, rain, snow, hail, thunder, tornado)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment