etoyoda/bufrtemptypes.md

## bufrtemptypes.md

      
    Raw
  

              bufrtemptypes.md
            
          
    Precision-based heuristic algorithm to identify types of BUFR TEMP messages

TOYODA Eizi
Introduction

Here I show an algorithm to identify types of BUFR TEMPs to tell whether the message is native or reformatted, and what part(s) are used for reformatting.  An implementation is found here.  Both are just a quick hack that worked with bulletins of small number of days.  NO WARRANTY.
Design Principles

Avoid relying on external database

The algorithm does not look at Climatology or station catalogue (WMO Pub.9 Volume A).  They are not necessary at this stage of processing.  I was thinking of realtime processing of bulletins in mind, though I'm not in the position to guarranty any kinds of operational applicability.
Actual application of data processing will need such external database for quality control, but that's another story.
Avoid relying on metadata inside the message

The more input fields, the more error.  There are so many (too many) metadata attached to the data, to assist the interpretation, parsing, or sorting of data.  But every fields are not free from errors.
BUFR TEMPs uses a flag field 008001 (vertical sounding significance) or 008024 (extended v. s. significance).  But they are sometimes incorrect.  A data assimilation gulu said "just ignore it."  So I tried to avoid it: so far it seems working for classification purpose.
Algorithm

Parsing

Decodes a BUFR message into subsets.  I used Canadian Met Center's libecbufr.  Properly I should develop a C application linking to the library, but I processed the output of "bufr_decoder -dump" command using Ruby to save time of my development.  This is a quick hack.
Postprocessed data structure is a Hash that contains Hashes corresponding vertical levels.  A data element has key of String of six-digit BUFR descriptor, which belongs to either top-level or a vertical level hash.  A vertical level is a Hash indexed by a number.  Unfortunately the level number is not always in the order of appearance, to avoid collision by nested or repeated replication.  It would be much better to convert BUFR descriptor to Ruby Integer, and the level is indexed by some unique String --- current one is misdesign by quick hack.
@dcd = {
  "001001" => 47,  # station block
  "001002" => 646, # station index number
  ...
  48 => {
    "007004" => 50000,  # pressure [hPa]
    "012001" => 268.15, # temperature [K]
    ...
  }
}

Identification of station

Method BufrSubset#diag_stid.

If there is station index number (001001 and 001002) with non-missing values, the five-digit number identifies the station;
otherwise, if the station name or callsign (001011) begins with uppercase letter followed by printable ASCII characters, i.e. maches to regex /^[A-Z][\x20-\x7E]*$/, the string is used as the identifier; and
otherwise, as a failsafe that should not happen (but sometimes it does), latitude and longitude (005001 and 006001) are used like "26N085W" as the identifier.

Preparation of pressure levels

Method BufrSubset#diag_pval.

All levels containing non-missing value of pressure as coordinate (007004) are collected.
RJTD special: when all following three conditions met, fix the pressure value by adding add 1000 hPa

reported pressure is less than 5 hPa
fixed pressure of previous level is more than 1000 hPa
originating center is Tokyo (center 034)
(this algorithm depends on Ruby 1.9 or later...)


Number of pressure levels, maximum pressure, and minimum pressure are stored as np, pmax, and pmin respectively.

Preparation of height levels

Method BufrSubset#diag_zval.
Levels conatining non-missing value of height as coordinate (007003) are collected.  Like pressure, values nz, zmax, and zmin are stored.  No special hack is done so far, since I didn't look at the value carefully.
Proper soudning data does not include data in height level.  But I had to test this because BUFRs reformatted from FM 32 PILOT are circulating among GTS with headers of TEMP.
Identification of code form

Method BufrSubset#diag_part1 identifies one of three cases - Native TEMP BUFR, Reformatted TEMP BUFR, and PILOT BUFR in height coordinate.  Reformatted TEMP BUFR is known from its limited precision due to structure of FM 35 TEMP.

For each pressure levels (RJTD hack applied as above) test:

Temperature:

if high precision temperature (012101 in 0.01 K) is reported the last digit is collected.  Reformatted TEMP BUFR has temperature ending with 0 or 5.  If all pressure levels have such values, the entire subset is considered having temperature in limited precision.
temperature in 012001 (in 0.1 K) is always considered limited precision (oddness of the last digit is limited due to FM 35 TEMP limitation, but the form depends on reformatting)


Geopotential height (007003 in 1 m): for only pressue 500 hPa or less,

if the last digit is 0, count it as limited precision
otherwise count it as high precision
exception: if originating center is 028, pressure < 200 hPa, and z < 1000 m, the geopotential height is clearly error and ignored (that happens unfortunately).


Wind direction (011001 in 1 degree true):

if the last digit is 0 or 5, count it as limited precision
otherwise count it as high precision
exception: if wind direction > 360, that is clear error and not counted


Motion of radiosonde: if all of elapsed time 004086, latitude displacement 005015, and longitude displacement 006015 have non-missing and non-zero values, the level is counted as reporting motion of radiosonde.  Non-zero test is needed because of reformatted TEMP BUFR from DEMS has all zero in these fields.


Code form identification

the report (subset) is marked PILOT BUFR when zmin and zmax are non-missing number
otherwise, the report is marked native BUFR when any one of following holds:

temperature has high precision
at least one level with high precision geopotential height is found
at least two levels with high precision wind is found (some apparently reformatted BUFRs have one level with wind direction of non-zero last digit. I don't know why)


otherwise, the report is marked reformatted BUFR TEMP if any one of following holds:

temperature has limited precision
at least one level with limited precision geopotential height is found
at least one level with limited precision wind is found


otherwise the report is marked empty and rejected from statistics


Part identification of PILOT BUFR

ZH is returned if zmin >= 15000 m (Part B or perhaps A)
otherwise ZL is returned if zmax < 19000 m (Part D or perhaps C)
otherwise ZLH is returned (Part A/B or C/D combined)


Part identification of native BUFR

NH is printed if pmax < 100 hPa
otherwise NL is returned if pmin >= 100 hPa
otherwise NLH is returned
The B/C 25 regulation requires NL and NLH styles of reports; both are actually seen.


Part identification of reformatted BUFR

Clear counters ns (standard level), nx (maximum wind), nt (temperature significant level), nw (wind significant level), and no (other levels typically surface and tropause)
For each pressure level (fixed as above):

ns is incremented if the pressure is standard pressure (excluding 100 hPa and including US standard 7, 5, 3, 2, 1 hPa) and any height (at least one of 010009, 010008, or 010003) is found (missing value included)
otherwise nx is incremented if vertical shear (011061 or 011062) is found (missing value included)
hereafter wind direction is descriptor 011001, but the value > 360 is considered missing (that actually happens in RPMM messages)
otherwise nt is incremented if temperature (0120101 or 012101) is non-missing and wind direction is missing
otherwise nw is incremented if wind direction is non-missing and temperature is missing
otherwise no is incremented


If ns > 0 and nx <= 3 and no <= 4 and nt + nw < 7

TA is returned if the minimum standard pressure found above >= 100 hPa
otherwise TC is returned if the maximum standard pressure found above < 100 hPa
otherwise TAC is returned
The threashold 7 of nt+nw is just empirical and maybe too many: I didn't look into detailed reason, but normal Part B/D reports have more than 10 significant levels.


Otherwise if ns <= 3

TB is returned if pmin >= 100 hPa
otherwise TD is returned if pmax <= 100 hPa
otherwsie TBD is returned
Many Part B or D messages include a few standard pressure especially 1000 hPa and 100 hPa.  The 100 hPa level is sometimes duplicated in B-D combined message, that's why 100 hPa is excluded from above-described counting of standard levels.  Really ugly trick.


Otherwise

TAB is returned if pmin >= 100 hPa
otherwise TCD is returned if pmax <= 100 hPa
otherwsie TABCD is returned


The report type is prefixed plus sign "+" if motion of sonde is reported at more than half of high-precision wind levels.

So far we found this flag only for native BUFR.