Skip to content

Instantly share code, notes, and snippets.

@jbrown123
Last active June 29, 2024 18:44
Show Gist options
  • Save jbrown123/d2fa72e211fafaaee8a74fd964ca9b59 to your computer and use it in GitHub Desktop.
Save jbrown123/d2fa72e211fafaaee8a74fd964ca9b59 to your computer and use it in GitHub Desktop.
SQLite database for USHCN data

Database to process USHCN data

The U.S. Historical Climatology Network USHCN data are used to quantify national and regional-scale temperature changes in the contiguous United States (CONUS). The dataset provides adjustments for systematic, non-climatic changes that bias temperature trends of monthly temperature records of long-term COOP stations.

USHCN defines these elements as follows:

  • tmax = monthly mean maximum temperature
  • tmin = monthly mean minimum temperature
  • tavg = average monthly temperature (tmax+tmin)/2
  • prcp = total monthly precipitation

Unfortunately the data is stored in fixed record format which makes it difficult to process and use easily. The attached SQL file is a set of tables and views for SQLite that allows direct access to the USHCN data. At the very bottom of the SQL file are several commands, commented out, that show how to import data and export it in a usable form for further processing in Excel or other systems.

There are tables that hold the fixed format data from the USHCN. In addition, there are views that understand the data format and how to separate it into fields. The following tables are defined:

  • tavg_fixed
  • tmin_fixed
  • tmax_fixed
  • prcp_fixed

For each of these tables there is a corresponding view that splits the data out into one record per month. These views are named the same as the table, but without the "_fixed" on the end. So the temperature average (tavg) veiw is named "tavg" and so on. The views have the following fields:

  • tavg(station,year,month,value,celsius,fahrenheit,dmflag,qcflag,dsflag)
  • tmin(station,year,month,value,celsius,fahrenheit,dmflag,qcflag,dsflag)
  • tmax(station,year,month,value,celsius,fahrenheit,dmflag,qcflag,dsflag)
  • prcp(station,year,month,value,mm,inches,dmflag,qcflag,dsflag)

In each case the 'value' field is the original value from the USHCN file. In the case of temperature, this is an integer value in degrees celsius times 100. In the case of precipitation, it's in millimeters times 10. The views adjust these stored values into usable units and present them in other fields (celsius, fahrenheit, mm, inches). Note that in all cases a value of -9999 indicates a missing value. This sentinal value is preserved across the conversions meaning that the celsius, fahrenheit, mm, and inches values will still be -9999 for missing values.

In addition to the main data files there are the following support tables

  • stations_fixed
  • dmflag
  • qcflag
  • dsflag

The USHCN reports use a "station ID" which can be translated to a real location via the stations view / stations_fixed table. Each value stored includes 3 flags, DM, QC, and DS. Explainations of these are represented in the associated 'flag' table.

Finally there is a 'combined' view that aggregates all the information by station into a single record. Note that this is a complex join and so it takes significant work for SQLite to process this.

  • combined(station,lat,lon,elevation,state,name,comp1,comp2,comp3,UTCoffset,year,month,min_c,min_f,min_dm,min_qc,min_ds,max_c,max_f,max_dm,max_qc,max_ds,avg_c,avg_f,avg_dm,avg_qc,avg_ds,mm,inches,prcp_dm,prcp_qc,prcp_ds)

  • station: the station identifier

  • lat: the station's latitude

  • lon: the station's longitude

  • state: the state the station is located in

  • comp1 - comp3: previous stations combined into this one, see USHCN documentation

  • UTCoffset: the offset from UTC where this station is located

  • year: the year for these readings

  • month: the month for these readings

  • min_c: tmin (monthly mean minimum temperature) in celsius

  • min_f: tmin in fahrenheit

  • min_dm, min_qc, min_ds: flags for the minimum temperature

  • max_, avg_: same as mininimum above but for maximum and average temperature

  • mm: precipitation in millimeters

  • inches: precipitation in inches

  • prcp_dm, prcp_qc, prcp_ds: flags for the precipitation value

In order to use this database you will need to download data from USHCN. Data will be loaded into the following tables as indicated. USHCN has 3 formats of the same data. The RAW format is exactly as it was reported by the station. The TOB format has been corrected for time of observation errors. And finally the FLs.52j data has been fully adjusted based on NOAAs algorithim as described in detail in their documentation (see their readme for more information).

File Table
ushcn-v2.5-stations.txt stations_fixed
ushcn.prcp.latest.[format].tar.gz prcp_fixed
ushcn.tavg.latest.[format].tar.gz tavg_fixed
ushcn.tmax.latest.[format].tar.gz tmax_fixed
ushcn.tmin.latest.[format].tar.gz tmin_fixed

The station text file can be imported into the SQLite database directly (.import ushcn-v2.5-stations.txt stations_fixed). For the other data, the downloaded file needs to be unzipped. Instructions for doing this based on your operating system are included in section 1.3 of the USHCN readme. Once unzipped you will have one file per station per element. You can then import each station file into the appropriate table shown above (e.g. .import USH00011084.raw.tmin tmin_fixed) . If you plan to use the combined view it is strongly encouraged to only import station data of interest as there are no indexes used in the database, therefore the more data you have the longer the processing will take.

Hopefully others will find this helpful.

CREATE TABLE IF NOT EXISTS stations_fixed (fixed char);
DROP VIEW stations;
CREATE VIEW stations AS
SELECT
TRIM(SUBSTR(fixed, 1, 11)) AS station,
TRIM(SUBSTR(fixed, 13, 8)) AS lat,
TRIM(SUBSTR(fixed, 22, 9)) AS lon,
TRIM(SUBSTR(fixed, 33, 5)) AS elevation,
TRIM(SUBSTR(fixed, 39, 2)) AS state,
TRIM(SUBSTR(fixed, 42, 30)) AS name,
TRIM(SUBSTR(fixed, 73, 6)) AS comp1,
TRIM(SUBSTR(fixed, 80, 6)) AS comp2,
TRIM(SUBSTR(fixed, 87, 6)) AS comp3,
TRIM(SUBSTR(fixed, 94, 2)) AS UTCoffset
FROM stations_fixed;
CREATE TABLE IF NOT EXISTS tavg_fixed (fixed char);
DROP VIEW tavg;
CREATE VIEW tavg AS
WITH RECURSIVE cte(station, year, month, value, rest) AS
(
SELECT TRIM(SUBSTR(fixed,1,11)),
TRIM(SUBSTR(fixed,13,4)),
1,
SUBSTR(fixed,17,9),
SUBSTR(fixed,17+9)
FROM tavg_fixed
UNION ALL
SELECT
station,
year,
month+1,
SUBSTR(rest,1,9),
SUBSTR(rest,10)
FROM cte
WHERE month < 12
)
SELECT
station,
year+0 as year,
month+0 as month,
TRIM(SUBSTR(value,1,6)) AS value, /* stored in celsius * 100 */
CASE TRIM(SUBSTR(value,1,6))
WHEN '-9999' THEN -9999
ELSE TRIM(SUBSTR(value,1,6))/100.0
END celsius,
CASE TRIM(SUBSTR(value,1,6))
WHEN '-9999' THEN -9999
ELSE ROUND((TRIM(SUBSTR(value,1,6))/100.0)*9.0/5.0, 4)+32
END fahrenheit,
TRIM(SUBSTR(value,7,1)) AS dmflag,
TRIM(SUBSTR(value,8,1)) AS qcflag,
TRIM(SUBSTR(value,9,1)) AS dsflag
FROM cte;
CREATE TABLE IF NOT EXISTS tmin_fixed (fixed char);
DROP VIEW tmin;
CREATE VIEW tmin AS
WITH RECURSIVE cte(station, year, month, value, rest) AS
(
SELECT TRIM(SUBSTR(fixed,1,11)),
TRIM(SUBSTR(fixed,13,4)),
1,
SUBSTR(fixed,17,9),
SUBSTR(fixed,17+9)
FROM tmin_fixed
UNION ALL
SELECT
station,
year,
month+1,
SUBSTR(rest,1,9),
SUBSTR(rest,10)
FROM cte
WHERE month < 12
)
SELECT
station,
year+0 as year,
month+0 as month,
TRIM(SUBSTR(value,1,6)) AS value, /* stored in celsius * 100 */
CASE TRIM(SUBSTR(value,1,6))
WHEN '-9999' THEN -9999
ELSE TRIM(SUBSTR(value,1,6))/100.0
END celsius,
CASE TRIM(SUBSTR(value,1,6))
WHEN '-9999' THEN -9999
ELSE ROUND((TRIM(SUBSTR(value,1,6))/100.0)*9.0/5.0, 4)+32
END fahrenheit,
TRIM(SUBSTR(value,7,1)) AS dmflag,
TRIM(SUBSTR(value,8,1)) AS qcflag,
TRIM(SUBSTR(value,9,1)) AS dsflag
FROM cte;
CREATE TABLE IF NOT EXISTS tmax_fixed (fixed char);
DROP VIEW tmax;
CREATE VIEW tmax AS
WITH RECURSIVE cte(station, year, month, value, rest) AS
(
SELECT TRIM(SUBSTR(fixed,1,11)),
TRIM(SUBSTR(fixed,13,4)),
1,
SUBSTR(fixed,17,9),
SUBSTR(fixed,17+9)
FROM tmax_fixed
UNION ALL
SELECT
station,
year,
month+1,
SUBSTR(rest,1,9),
SUBSTR(rest,10)
FROM cte
WHERE month < 12
)
SELECT
station,
year+0 as year,
month+0 as month,
TRIM(SUBSTR(value,1,6)) AS value, /* stored in celsius * 100 */
CASE TRIM(SUBSTR(value,1,6))
WHEN '-9999' THEN -9999
ELSE TRIM(SUBSTR(value,1,6))/100.0
END celsius,
CASE TRIM(SUBSTR(value,1,6))
WHEN '-9999' THEN -9999
ELSE ROUND((TRIM(SUBSTR(value,1,6))/100.0)*9.0/5.0, 4)+32
END fahrenheit,
TRIM(SUBSTR(value,7,1)) AS dmflag,
TRIM(SUBSTR(value,8,1)) AS qcflag,
TRIM(SUBSTR(value,9,1)) AS dsflag
FROM cte;
CREATE TABLE IF NOT EXISTS prcp_fixed (fixed char);
DROP VIEW prcp;
CREATE VIEW prcp AS
WITH RECURSIVE cte(station, year, month, value, rest) AS
(
SELECT TRIM(SUBSTR(fixed,1,11)),
TRIM(SUBSTR(fixed,13,4)),
1,
SUBSTR(fixed,17,9),
SUBSTR(fixed,17+9)
FROM prcp_fixed
UNION ALL
SELECT
station,
year,
month+1,
SUBSTR(rest,1,9),
SUBSTR(rest,10)
FROM cte
WHERE month < 12
)
SELECT
station,
year+0 as year,
month+0 as month,
TRIM(SUBSTR(value,1,6)) AS value, /* stored in mm * 10 */
CASE TRIM(SUBSTR(value,1,6))
WHEN '-9999' THEN -9999
ELSE TRIM(SUBSTR(value,1,6))/10.0
END mm,
CASE TRIM(SUBSTR(value,1,6))
WHEN '-9999' THEN -9999
ELSE ROUND(TRIM(SUBSTR(value,1,6))/254.0, 4) /* divide by 10 then divide by 25.4 */
END inches,
TRIM(SUBSTR(value,7,1)) AS dmflag,
TRIM(SUBSTR(value,8,1)) AS qcflag,
TRIM(SUBSTR(value,9,1)) AS dsflag
FROM cte;
DROP VIEW combined;
CREATE VIEW combined AS
SELECT
s.*, mn.year as year, mn.month as month,
mn.celsius as min_c, mn.fahrenheit as min_f, mn.dmflag as min_dm, mn.qcflag as min_qc, mn.dsflag as min_ds,
mx.celsius as max_c, mx.fahrenheit as max_f, mx.dmflag as max_dm, mx.qcflag as max_qc, mx.dsflag as max_ds,
av.celsius as avg_c, av.fahrenheit as avg_f, av.dmflag as avg_dm, av.qcflag as avg_qc, av.dsflag as avg_ds,
p.mm as mm, p.inches as inches, p.dmflag as prcp_dm, p.qcflag as prcp_qc, p.dsflag as prcp_ds
FROM tavg av
join tmin mn on av.station=mn.station and av.year=mn.year and av.month=mn.month
join tmax mx on av.station=mx.station and av.year=mx.year and av.month=mx.month
join prcp p on av.station=p.station and av.year=p.year and av.month=p.month
join stations s on av.station=s.station;
CREATE TABLE IF NOT EXISTS DMFLAG (dmflag char, note char);
DELETE FROM DMFLAG;
INSERT INTO DMFLAG (dmflag, note)
VALUES
('a', '1 day missing in calculation of monthly mean temperature'),
('b', '2 days missing in calculation of monthly mean temperature'),
('c', '3 days missing in calculation of monthly mean temperature'),
('d', '4 days missing in calculation of monthly mean temperature'),
('e', '5 days missing in calculation of monthly mean temperature'),
('f', '6 days missing in calculation of monthly mean temperature'),
('g', '7 days missing in calculation of monthly mean temperature'),
('h', '8 days missing in calculation of monthly mean temperature'),
('i', '9 days missing in calculation of monthly mean temperature'),
('E', 'The value is estimated using values from surrounding stations because a monthly value could not be computed from daily data; or, the pairwise homogenization algorithm removed the value because of too many apparent inhomogeneities occuring close together in time.');
CREATE TABLE IF NOT EXISTS QCFLAG (qcflag char, note char);
DELETE FROM QCFLAG;
INSERT INTO QCFLAG (qcflag, note)
VALUES
('D', 'Monthly value is part of an annual series of values that are exactly the same (e.g. duplicated) within another year in the stations record.'),
('I', 'Flag is set when TMIN > TMAX for a given month.'),
('L', 'Monthly value is isolated in time within the station record, and this is defined by having no immediate non- missing values 18 months on either side of the value.'),
('M', 'Manually flagged as erroneous'),
('O', 'Monthly value that is >= 5 bi-weight standard deviations from the bi-weight mean. '),
('S', 'Monthly value has failed spatial consistency check.'),
('W', 'Monthly value is duplicated from the previous month.'),
('A', 'Alternative method of adjustment used.');
CREATE TABLE IF NOT EXISTS DSFLAG (dsflag char, note char);
DELETE FROM DSFLAG;
INSERT INTO DSFLAG (dsflag, note)
VALUES
('1', 'NCDC Tape Deck 3220, Summary of the Month Element Digital File'),
('2', 'Means Book - Smithsonian Institute, C.A. Schott (1876, 1881 thru 1931)'),
('3', 'Manuscript - Original Records, National Climatic Data Center '),
('4', 'Climatological Data (CD), monthly NCDC publication '),
('5', 'Climate Record Book, as described in History of Climatological Record Books, U.S. Department of Commerce, Weather Bureau, USGPO (1960)'),
('6', 'Bulletin W - Summary of the Climatological Data for the United States (by section), F.H. Bigelow, U.S. Weather Bureau (1912); and, Bulletin W - Summary of the Climatological Data for the United States, 2nd Ed.'),
('7', 'Local Climatological Data (LCD), monthly NCDC publication'),
('8', 'State Climatologists, various sources'),
('B', 'Professor Raymond Bradley - Refer to Climatic Fluctuations of the Western United States During the Period of Instrumental Records, Bradley, et. al., Contribution No. 42, Dept. of Geography and Geology, University of Massachusetts (1982)'),
('D', 'Dr. Henry Diaz, a compilation of data from Bulletin W, LCD, and NCDC Tape Deck 3220 (1983)'),
('G', 'Professor John Griffiths - primarily from Climatological Data');
-- .import ushcn-v2.5-stations.txt stations_fixed
-- .mode csv
-- .once climate.csv
-- select station, year, month, min_f, avg_f, max_f, inches, name from combined where year>=1900 and avg_f!=-9999 order by year,month,station;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment