Skip to content

Instantly share code, notes, and snippets.

@TomNicholas
Last active June 1, 2026 21:52
Show Gist options
  • Select an option

  • Save TomNicholas/6e908090583c3c4bbed1a305ac8c3bfd to your computer and use it in GitHub Desktop.

Select an option

Save TomNicholas/6e908090583c3c4bbed1a305ac8c3bfd to your computer and use it in GitHub Desktop.
GOES-16 ingestion

GOES-16 ABI-L2-MCMIPF data inconsistencies

A running log of every real-world anomaly we've hit while ingesting the s3://noaa-goes16/ABI-L2-MCMIPF/ archive into Icechunk via VirtualiZarr.

Each entry: what the file/day looks like, how it surfaced, how we handle it, and (when known) the most likely upstream cause.


1. Compression codec changes mid-archive (the "codec change marker")

Boundary file: s3://noaa-goes16/ABI-L2-MCMIPF/2023/109/15/OR_ABI-L2-MCMIPF-M6_G16_s20231091500205_e20231091509514_c20231091510007.nc (2023-04-19 15:00 UTC, DOY 109). This is the most annoying inconsistency in the archive because it cannot be papered over inside a single virtual dataset — see "Why it matters" below.

At this scan, NOAA enabled the Shuffle byte-shuffle filter in the codec pipeline used for the per-channel CMI_C{NN} and DQF_C{NN} variables. Verified by probing one file on each side of the boundary:

Pre-change (≤ 14:50 UTC, 2023-04-19) Post-change (≥ 15:00 UTC)
CMI_C* codecs (BytesCodec(little), Zlib(level=1)) (BytesCodec(little), Shuffle(elementsize=2), Zlib(level=1))
DQF_C* codecs (BytesCodec(endian=None), Zlib(level=1)) (BytesCodec(endian=None), Shuffle(elementsize=1), Zlib(level=1))

Everything else (dtype, fill_value, chunk_shape, BytesCodec endianness) is identical on both sides. Shuffle improves compression ratio on multi-byte integer data, so this was almost certainly a deliberate algorithm/encoding update on NOAA's end.

Why it matters: virtualizarr produces a ManifestArray per source-file chunk, and concatenating ManifestArrays into a single virtual zarr array requires that every contributing chunk share an identical codec pipeline. The codec change therefore prevents creating one continuous virtual zarr array across the entire GOES-16 archive. Any single virtual dataset must live entirely within one codec regime.

Handling:

  • mcmipf_archive.CODEC_CHANGE_MARKER names the first post-boundary file.
  • iter_archive_by_day / iter_archive default to start=CODEC_CHANGE_MARKER, so iterating the archive yields only the new-codec era.
  • EXPECTED_CODECS_BY_PREFIX (in mcmipf_pipeline) codifies the post- boundary codec pipeline. The validator (validate_raw's _check_codecs) flags pre-boundary files as a codec mismatch, so the boundary is enforced at ingest time as well.
  • The icechunk group used for ingest is named ABI-L2-MCMIPF/post-2023-04-19 so a sibling group could be used later for the older-codec data without collision.

2. Day with one channel entirely missing from a single granule

Where: 2023-06-14 (DOY 165), hour 18:20 UTC File: s3://noaa-goes16/ABI-L2-MCMIPF/2023/165/18/OR_ABI-L2-MCMIPF-M6_G16_s20231651820208_e20231651829527_c20231651832574.nc

This single 18:20 file is missing all C04 variables (CMI_C04, DQF_C04, and every C04 stat / band_id / band_wavelength). The other 143 files of the day have the full 16-channel set.

Likely cause: a transient channel-4 outage during MCMIPF product generation; NOAA still emitted the file but omitted C04.

Handling: fill_missing_channels back-fills the missing channel with a fill-only ManifestArray (CMI/DQF) plus NaN scalars (stats), so the file still produces a complete cleaned dataset with C04 readable as NaN.


3. Aborted-scan placeholders with sentinel t values

Where: scattered through the archive — confirmed on 2023-05-03 (DOY 123). Example: s3://noaa-goes16/ABI-L2-MCMIPF/2023/123/10/OR_ABI-L2-MCMIPF-M4_G16_s20231231055224_e20231231055224_c20231231102341.nc

The _s and _e tokens in the filename are identical (same to-the-second scan start and end), and the file's internal t coordinate is set to the J2000 epoch (~2000-01-01T11:43:21). The image content is otherwise present but useless — clearly a placeholder for an aborted scan.

Likely cause: ABI scheduler started a scan, aborted, and still wrote a stub product record.

Handling:

  1. Filename-level: mcmipf_archive.list_day_files drops any file where the _s token equals the _e token (_is_aborted_scan).
  2. Pipeline-level: validate_raw raises if the file's internal t is before 2017-01-01 (catches any sentinel that slips past #1).
  3. Pipeline-level: validate_raw also raises if abs(t - filename_scan_start) > 1 hour (catches mis-stamped files).

4. Non-M6 scan-mode files mixed with operational M6

Where: scattered through the archive — confirmed on 2023-05-03 (DOY 123). Examples (both legitimate scans, not aborted):

  • OR_ABI-L2-MCMIPF-M4_G16_s20231231050229_e20231231055049_…nc
  • OR_ABI-L2-MCMIPF-M4_G16_s20231231605205_e20231231610031_…nc

GOES-16 ABI has multiple operational modes:

  • M6 — standard 10-minute flex (full disk every 10 min).
  • M4 — continuous 5-minute full disk (used briefly during special-event coverage e.g. major hurricanes).
  • M3 — legacy 15-minute schedule, mostly pre-2019.

Lex-sorting full URLs interleaves modes incorrectly: M4 < M6 causes an M4 file at 10:50 to sort before every M6 file in the hour-10 prefix block.

Handling:

  • list_day_files sorts by the embedded sYYYYDDDHHMMSST token (a parse_scan_start_to_datetime-friendly stamp), not the full URL. M4 and M6 scans interleave correctly into chronological order.
  • We do not filter M4 scans out: they're real radiometry, same schema and codecs as M6 (verified). Only aborted placeholders are filtered (#3).

5. Day with one channel-set entirely missing from a single granule (DQF)

Where: 2024-01-18 (DOY 018), hour 03:50 UTC File: s3://noaa-goes16/ABI-L2-MCMIPF/2024/018/03/OR_ABI-L2-MCMIPF-M6_G16_s20240180350207_e20240180359521_c20240180400008.nc

This file appears to have CMI imagery for all 16 channels but no DQF variables at all (none of DQF_C01..DQF_C16). Trips fill_missing_channels' "No DQF_* variable present — cannot infer ManifestArray template" because the function uses an existing per-prefix variable as the metadata template for fill placeholders.

Likely cause: similar to #2 but more aggressive — a quality-flag generator outage that left every DQF channel out of the product.

Handling: fill_missing_channels synthesizes fill-only ManifestArrays for every missing channel directly from hard-coded CMI_METADATA / DQF_METADATA constants. This works regardless of how many channels are missing — including the zero-channels-present case — because no source template is required. The time axis stays continuous; readers see the missing channels as all-NaN for that t.

Confirmed working on the specific file above (downloaded locally as no_dqf.nc); 10 CMI channels are present and pass through unchanged while the 6 missing CMI + all 16 DQF channels are synthesized.


6. Per-channel stats NaN for one or two channels in an otherwise-complete file

Where: 2024-01-26 (DOY 026), hour 19:40 UTC File: s3://noaa-goes16/ABI-L2-MCMIPF/2024/026/19/OR_ABI-L2-MCMIPF-M6_G16_s20240261940204_e20240261949523_c20240261952567.nc

All 16 CMI and DQF channels are present, but every per-channel stat for C04 (reflectance) and C14 (brightness temperature) is NaN:

  • min/max/mean/std_dev_reflectance_factor_C04 = NaN
  • min/max/mean/std_dev_brightness_temperature_C14 = NaN

These are float scalars, so NaN is a legal storage value; it just signals that the algorithm couldn't compute a meaningful statistic for those channels in that scan (no valid pixels passed the QC threshold).

Likely cause: a single-scan outage on those two channels — the imagery is still there, but no pixels survived the algorithm's "good or conditionally usable" filter, so the aggregate stats are undefined.

Handling: make_mcmipf_schema marks every per-channel outlier_pixel_count and stat variable as nullable=True, so a one-off NaN doesn't fail validate_raw. (The cleaned schema already had nullable=True for the stacked stats, see #2's NaN-half-empty layout.)


7. Same file listed under two hour-prefix directories

Where: 2024-DOY-272 / 273 and scattered other days Example: each of the following lists the same underlying object:

s3://noaa-goes16/ABI-L2-MCMIPF/2024/272/0/OR_ABI-L2-MCMIPF-M6_G16_s20242720000207_e..._c20242720010000.nc
s3://noaa-goes16/ABI-L2-MCMIPF/2024/272/00/OR_ABI-L2-MCMIPF-M6_G16_s20242720000207_e..._c20242720010000.nc

For some single-digit hours, NOAA's bucket has entries under both the non-zero-padded prefix (.../272/0/) and the zero-padded prefix (.../272/00/). obstore.list returns each underlying file twice, once under each prefix. Scoped check on a single 15-day window (2024-09-25..10-09) found 105 such duplicates spread across days 271–273 and a few others.

The two URLs share the exact same _s (scan-start) and _c (creation) tokens — same file, two listing entries. This trips the per-batch is_unique check in ingest_all_days (monotonic=True, unique=False).

Handling: list_day_files deduplicates URLs by scan-start token after sorting. The first occurrence (by lex sort of the full URL after the scan-start-token sort) is kept; subsequent duplicates are dropped.


Audit notes / known patterns

  • The aborted-scan placeholder pattern (#3) is the only one that yields a literal sentinel t value. All other anomalies preserve a believable t.
  • The "missing N channels of variable X" pattern (#2, #5) can affect any per-channel variable independently. With hard-coded CMI_METADATA / DQF_METADATA we can synthesize placeholders for zero-template cases too.
  • Lex-sorting filenames across mode tokens (#4) is the only ordering pitfall we've found at the filename level. Within a given mode, lex sort of the embedded _s token matches scan-start order.
  • The codec change (#1) is the only inconsistency that cannot be papered over inside a single virtual dataset — it forces a hard boundary between pre- and post-2023-04-19 data.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment