Skip to content

Instantly share code, notes, and snippets.

@TomNicholas
Created June 2, 2026 14:34
Show Gist options
  • Select an option

  • Save TomNicholas/4c441c1e69ada4d2c1e0abe35262e007 to your computer and use it in GitHub Desktop.

Select an option

Save TomNicholas/4c441c1e69ada4d2c1e0abe35262e007 to your computer and use it in GitHub Desktop.
GOES-16 virtual store
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

GOES-16 ABI-L2-MCMIPF data inconsistencies

A running log of every real-world anomaly we've hit while ingesting the s3://noaa-goes16/ABI-L2-MCMIPF/ archive into Icechunk via VirtualiZarr.

Each entry: what the file/day looks like, how it surfaced, how we handle it, and (when known) the most likely upstream cause.


1. Compression codec changes mid-archive (the "codec change marker")

Boundary file: s3://noaa-goes16/ABI-L2-MCMIPF/2023/109/15/OR_ABI-L2-MCMIPF-M6_G16_s20231091500205_e20231091509514_c20231091510007.nc (2023-04-19 15:00 UTC, DOY 109). This is the most annoying inconsistency in the archive because it cannot be papered over inside a single virtual dataset — see "Why it matters" below.

At this scan, NOAA enabled the Shuffle byte-shuffle filter in the codec pipeline used for the per-channel CMI_C{NN} and DQF_C{NN} variables. Verified by probing one file on each side of the boundary:

Pre-change (≤ 14:50 UTC, 2023-04-19) Post-change (≥ 15:00 UTC)
CMI_C* codecs (BytesCodec(little), Zlib(level=1)) (BytesCodec(little), Shuffle(elementsize=2), Zlib(level=1))
DQF_C* codecs (BytesCodec(endian=None), Zlib(level=1)) (BytesCodec(endian=None), Shuffle(elementsize=1), Zlib(level=1))

Everything else (dtype, fill_value, chunk_shape, BytesCodec endianness) is identical on both sides. Shuffle improves compression ratio on multi-byte integer data, so this was almost certainly a deliberate algorithm/encoding update on NOAA's end.

Why it matters: virtualizarr produces a ManifestArray per source-file chunk, and concatenating ManifestArrays into a single virtual zarr array requires that every contributing chunk share an identical codec pipeline. The codec change therefore prevents creating one continuous virtual zarr array across the entire GOES-16 archive. Any single virtual dataset must live entirely within one codec regime.

Handling:

  • mcmipf_archive.CODEC_CHANGE_MARKER names the first post-boundary file.
  • iter_archive_by_day / iter_archive default to start=CODEC_CHANGE_MARKER, so iterating the archive yields only the new-codec era.
  • EXPECTED_CODECS_BY_PREFIX (in mcmipf_pipeline) codifies the post- boundary codec pipeline. The validator (validate_raw's _check_codecs) flags pre-boundary files as a codec mismatch, so the boundary is enforced at ingest time as well.
  • The icechunk group used for ingest is named ABI-L2-MCMIPF/post-2023-04-19 so a sibling group could be used later for the older-codec data without collision.

2. Day with one channel entirely missing from a single granule

Where: 2023-06-14 (DOY 165), hour 18:20 UTC File: s3://noaa-goes16/ABI-L2-MCMIPF/2023/165/18/OR_ABI-L2-MCMIPF-M6_G16_s20231651820208_e20231651829527_c20231651832574.nc

This single 18:20 file is missing all C04 variables (CMI_C04, DQF_C04, and every C04 stat / band_id / band_wavelength). The other 143 files of the day have the full 16-channel set.

Likely cause: a transient channel-4 outage during MCMIPF product generation; NOAA still emitted the file but omitted C04.

Handling: fill_missing_channels back-fills the missing channel with a fill-only ManifestArray (CMI/DQF) plus NaN scalars (stats), so the file still produces a complete cleaned dataset with C04 readable as NaN.


3. Aborted-scan placeholders with sentinel t values

Where: scattered through the archive — confirmed on 2023-05-03 (DOY 123). Example: s3://noaa-goes16/ABI-L2-MCMIPF/2023/123/10/OR_ABI-L2-MCMIPF-M4_G16_s20231231055224_e20231231055224_c20231231102341.nc

The _s and _e tokens in the filename are identical (same to-the-second scan start and end), and the file's internal t coordinate is set to the J2000 epoch (~2000-01-01T11:43:21). The image content is otherwise present but useless — clearly a placeholder for an aborted scan.

Likely cause: ABI scheduler started a scan, aborted, and still wrote a stub product record.

Handling:

  1. Filename-level: mcmipf_archive.list_day_files drops any file where the _s token equals the _e token (_is_aborted_scan).
  2. Pipeline-level: validate_raw raises if the file's internal t is before 2017-01-01 (catches any sentinel that slips past #1).
  3. Pipeline-level: validate_raw also raises if abs(t - filename_scan_start) > 1 hour (catches mis-stamped files).

4. Non-M6 scan-mode files mixed with operational M6

Where: scattered through the archive — confirmed on 2023-05-03 (DOY 123). Examples (both legitimate scans, not aborted):

  • OR_ABI-L2-MCMIPF-M4_G16_s20231231050229_e20231231055049_…nc
  • OR_ABI-L2-MCMIPF-M4_G16_s20231231605205_e20231231610031_…nc

GOES-16 ABI has multiple operational modes:

  • M6 — standard 10-minute flex (full disk every 10 min).
  • M4 — continuous 5-minute full disk (used briefly during special-event coverage e.g. major hurricanes).
  • M3 — legacy 15-minute schedule, mostly pre-2019.

Lex-sorting full URLs interleaves modes incorrectly: M4 < M6 causes an M4 file at 10:50 to sort before every M6 file in the hour-10 prefix block.

Handling:

  • list_day_files sorts by the embedded sYYYYDDDHHMMSST token (a parse_scan_start_to_datetime-friendly stamp), not the full URL. M4 and M6 scans interleave correctly into chronological order.
  • We do not filter M4 scans out: they're real radiometry, same schema and codecs as M6 (verified). Only aborted placeholders are filtered (#3).

5. Day with one channel-set entirely missing from a single granule (DQF)

Where: 2024-01-18 (DOY 018), hour 03:50 UTC File: s3://noaa-goes16/ABI-L2-MCMIPF/2024/018/03/OR_ABI-L2-MCMIPF-M6_G16_s20240180350207_e20240180359521_c20240180400008.nc

This file appears to have CMI imagery for all 16 channels but no DQF variables at all (none of DQF_C01..DQF_C16). Trips fill_missing_channels' "No DQF_* variable present — cannot infer ManifestArray template" because the function uses an existing per-prefix variable as the metadata template for fill placeholders.

Likely cause: similar to #2 but more aggressive — a quality-flag generator outage that left every DQF channel out of the product.

Handling: fill_missing_channels synthesizes fill-only ManifestArrays for every missing channel directly from hard-coded CMI_METADATA / DQF_METADATA constants. This works regardless of how many channels are missing — including the zero-channels-present case — because no source template is required. The time axis stays continuous; readers see the missing channels as all-NaN for that t.

Confirmed working on the specific file above (downloaded locally as no_dqf.nc); 10 CMI channels are present and pass through unchanged while the 6 missing CMI + all 16 DQF channels are synthesized.


6. Per-channel stats NaN for one or two channels in an otherwise-complete file

Where: 2024-01-26 (DOY 026), hour 19:40 UTC File: s3://noaa-goes16/ABI-L2-MCMIPF/2024/026/19/OR_ABI-L2-MCMIPF-M6_G16_s20240261940204_e20240261949523_c20240261952567.nc

All 16 CMI and DQF channels are present, but every per-channel stat for C04 (reflectance) and C14 (brightness temperature) is NaN:

  • min/max/mean/std_dev_reflectance_factor_C04 = NaN
  • min/max/mean/std_dev_brightness_temperature_C14 = NaN

These are float scalars, so NaN is a legal storage value; it just signals that the algorithm couldn't compute a meaningful statistic for those channels in that scan (no valid pixels passed the QC threshold).

Likely cause: a single-scan outage on those two channels — the imagery is still there, but no pixels survived the algorithm's "good or conditionally usable" filter, so the aggregate stats are undefined.

Handling: make_mcmipf_schema marks every per-channel outlier_pixel_count and stat variable as nullable=True, so a one-off NaN doesn't fail validate_raw. (The cleaned schema already had nullable=True for the stacked stats, see #2's NaN-half-empty layout.)


12. Corrupted compressed chunk(s) in upstream NetCDFs

Where: at least one chunk somewhere in the archive — first instance found:

  • variable: CMI_C13
  • timestep: t = 2023-10-13T14:25:06.402946 (t-index 25422 on the v1 main-branch ingest of post-2023-04-19)
  • presumed source: s3://noaa-goes16/ABI-L2-MCMIPF/2023/286/14/OR_ABI-L2-MCMIPF-M6_G16_s20232861425*_*.nc
  • offending chunk: the chunk covering the NYC pixel (iy=740, ix=2751) — i.e. chunk coordinate (3, 12) at 226×226 chunking.

Symptom: reading the chunk raises Error -3 while decompressing data: incorrect header check from zlib. The compressed bytes themselves are malformed at the zlib-header level.

Why it matters: not our bug — virtualizarr stores a virtual reference to the source file's byte range; the icechunk store never decompresses on write. The corruption lives in NOAA's upstream file on s3://noaa-goes16. Other chunks in the same variable / same file decode fine; only this one is bad.

Status: not yet fixed — accepting it for now. The bad chunk errors at read time but doesn't trip the ingest. If a sampled verify happens to touch the bad chunk it'll error out of the _run_batch_verifications retry path, but at the current verification sampling rate the probability of hitting it is small.

Things we'd want to do later:

  • Drive a sweep across the archive to find all such bad chunks (not just the one we stumbled into via a NYC-pixel time-series plot).
  • Decide on a policy: silently fall back to fill_value on read, log a per-chunk warning, or maintain a manifest of "known bad chunks" that the verifier and pipeline both consult. None of these fix the upstream bytes, but they let consumers handle the gap cleanly.

11. DQF valid_range expanded mid-archive

Where: 2019-DOY-099 11:10 UTC (s20190991110224) is the first observed instance, but the boundary is somewhere between 2017 and 2019 — likely when NOAA added a new DQF flag value (4). Affects every DQF_C{NN}.

Example: validator output comparing 2019 source vs an icechunk store seeded from a 2017 file (FIRST_OPERATIONAL_FILE):

[attr] DQF_C01.attrs['valid_range']: icechunk=[0, 3] source=array([0, 4], dtype=int8)
...
[attr] DQF_C16.attrs['valid_range']: icechunk=[0, 3] source=array([0, 4], dtype=int8)

Unlike issue #10 (which was about NOAA adding a missing attr), here NOAA widened an existing attr's range. The drift can also happen inside a single era — 2017-DOY-304 source has [0, 3] and 2019 source has [0, 4] but both live in our pre-2023-04-19 group. So neither setdefault nor "harvest from first file" can satisfy every source file in the era simultaneously.

Why it matters: the valid_range attr documents which int values a consumer might legitimately encounter. If the icechunk-stored range is narrower than the actual data range, consumers might silently mishandle flag values they think are illegal. If it's wider, consumers just see a slightly more inclusive range than older NOAA files claimed — benign.

Handling:

  • mcmipf_pipeline._CANONICAL_VAR_ATTR_OVERRIDES carries the canonical (wider/modern) value: valid_range=[0, 4] for every DQF_C{NN}. finalize_per_channel writes this unconditionally (override mode), regardless of what source has. So icechunk always stores [0, 4].
  • maintenance/backfill_canonical_attrs.py handles both modes — applies setdefault attrs (issue #10) AND override attrs (this issue) to existing committed groups.
  • The verifier's _compare_attrs special-cases valid_range: instead of equality, it accepts the icechunk value as a superset of source's. So icechunk=[0, 4] matches source=[0, 3] (early-era source, narrower) AND source=[0, 4] (modern source, equal). The reverse — icechunk narrower than source — still fails (would misrepresent the data).

10. NOAA added canonical CF attrs over time

Where: 2019-DOY-036 23:50 UTC (s20190362345319) is the first instance we hit, but the pattern almost certainly affects every variable NOAA improved between 2017 and present. Specifically observed: x_image_bounds.units and y_image_bounds.units were absent in early-era files and added later (value: "rad").

Example: validator output comparing the 2019 file against icechunk:

[attr] x_image_bounds.attrs['units']: present in source only
[attr] y_image_bounds.attrs['units']: present in source only

The icechunk-stored variant lacks the attr because our pipeline harvests attrs from whichever file was first in the batch that created the schema — for pre-2023-04-19 that's a 2017 file from the start of the era, which didn't carry the units attr yet.

Why it matters: pure metadata-richness gap, not a data defect. Consumers reading the icechunk store get less-complete CF metadata than NOAA's current files would provide. Strict verification flags this as a hard "present-in-source-only" failure.

Handling:

  • mcmipf_pipeline._CANONICAL_VAR_ATTRS carries a small table of CF attrs we know are missing from early-era files. finalize_per_channel applies these via setdefault so source values are preserved when present, and the canonical value is added when absent. Currently: units='rad' on x_image_bounds and y_image_bounds.
  • For groups already committed before the canonical-attrs step was added, maintenance/backfill_canonical_attrs.py opens a writable session and patches the missing attrs in place, without re-ingesting data.
  • The verifier stays strict on "attr present in source only" — that's the signal that surfaces NEW metadata drift we haven't yet added to _CANONICAL_VAR_ATTRS. Add more entries as new instances are found.

9. ABI channel wavelength reported with different precision across the archive

Where: 2017-DOY-160 11:50 UTC (s20171601145352) and likely the whole pre- operational / early-pre-op era. The metadata reporting precision was tightened sometime between 2017 and 2023.

Example: validator output comparing icechunk-stored values (harvested from a 2023 operational reference file) against a 2017 source file:

channel icechunk (operational, 2023) source (early-pre-op, 2017)
C07 3.9 μm 3.89 μm
C08 6.185 μm 6.17 μm
C09 6.95 μm 6.93 μm
C11 8.5 μm 8.44 μm
C13 10.35 μm 10.33 μm
C14 11.2 μm 11.19 μm
C15 12.3 μm 12.27 μm
C16 13.3 μm 13.27 μm

The instrument's spectral filters didn't change — only NOAA's documentation precision did. The icechunk-stored values match the GOES-R ABI published nominal central wavelengths (3.9, 6.185, 6.95, 8.5, 10.35, 11.2, 12.3, 13.3 μm), which is the canonical reference for downstream consumers.

Why it matters: differences are 0.01–0.06 μm absolute, ≤ 0.7% relative — nowhere near a real per-channel spectral shift, but well above np.allclose's default 1e-5 relative tolerance. Without explicit tolerance the verifier flags every pre-op file as failing on every brightness channel's wavelength.

Handling: mcmipf_verifier._CONSOLIDATED_COORDS now carries per-coord rtol / atol. The wavelength entry uses rtol=0.01, atol=0.1 (1% relative + 0.1 μm absolute) — still catches any real wavelength corruption but accepts NOAA's metadata-precision drift. The band entry keeps exact-equality (it's an integer 1..16).


8. Scalar telemetry vars NaN or entirely missing in early-2017 files

Where: at least 2017-DOY-086 22:30 UTC (s20170862230380); likely scattered across early-2017 files generally. Example: the validate_raw pandera schema rejected the file above with:

SCHEMA → SERIES_CONTAINS_NULLS:
  column: 'percent_uncorrectable_GRB_errors'
  check: 'nullable'
  error: 'non-nullable DataArray contains null values'

percent_uncorrectable_GRB_errors was present in the file as a single NaN. The same class of anomaly is possible (though not yet observed) for the other scalar / 1-D telemetry vars: percent_uncorrectable_L0_errors, geospatial_lat_lon_extent, nominal_satellite_*, the algorithm/dynamic containers, time_bounds, x_image_bounds, y_image_bounds. We've also preemptively accommodated the case where any of these is missing entirely from a source file (not yet seen but plausible).

Why it matters: a single bad-data scan in a batch would fail validate_raw and abort the entire batch. Even with the schema check relaxed, an entirely-missing variable would then cause the cross-file xr.concat in open_virtual_mfdataset to fail — default data_vars="all" requires every variable to be present in every input.

Handling:

  • Both schemas (make_mcmipf_schema, make_cleaned_mcmipf_schema) mark every entry in _EXPECTED_SHARED_VARS as nullable=True, required=False so NaN values pass and absence doesn't fail the per-file check.
  • A new pipeline step fill_missing_scalar_metadata synthesizes a NaN-/NaT-/0-filled placeholder for any expected scalar that's absent from the source, so all files in a batch carry the same variable set and the concat can proceed.

7. Same file listed under two hour-prefix directories

Where: 2024-DOY-272 / 273 and scattered other days Example: each of the following lists the same underlying object:

s3://noaa-goes16/ABI-L2-MCMIPF/2024/272/0/OR_ABI-L2-MCMIPF-M6_G16_s20242720000207_e..._c20242720010000.nc
s3://noaa-goes16/ABI-L2-MCMIPF/2024/272/00/OR_ABI-L2-MCMIPF-M6_G16_s20242720000207_e..._c20242720010000.nc

For some single-digit hours, NOAA's bucket has entries under both the non-zero-padded prefix (.../272/0/) and the zero-padded prefix (.../272/00/). obstore.list returns each underlying file twice, once under each prefix. Scoped check on a single 15-day window (2024-09-25..10-09) found 105 such duplicates spread across days 271–273 and a few others.

The two URLs share the exact same _s (scan-start) and _c (creation) tokens — same file, two listing entries. This trips the per-batch is_unique check in ingest_all_days (monotonic=True, unique=False).

Handling: list_day_files deduplicates URLs by scan-start token after sorting. The first occurrence (by lex sort of the full URL after the scan-start-token sort) is kept; subsequent duplicates are dropped.


Audit notes / known patterns

  • The aborted-scan placeholder pattern (#3) is the only one that yields a literal sentinel t value. All other anomalies preserve a believable t.
  • The "missing N channels of variable X" pattern (#2, #5) can affect any per-channel variable independently. With hard-coded CMI_METADATA / DQF_METADATA we can synthesize placeholders for zero-template cases too.
  • Lex-sorting filenames across mode tokens (#4) is the only ordering pitfall we've found at the filename level. Within a given mode, lex sort of the embedded _s token matches scan-start order.
  • The codec change (#1) is the only inconsistency that cannot be papered over inside a single virtual dataset — it forces a hard boundary between pre- and post-2023-04-19 data.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment