A running log of every real-world anomaly we've hit while ingesting the
s3://noaa-goes16/ABI-L2-MCMIPF/ archive into Icechunk via VirtualiZarr.
Each entry: what the file/day looks like, how it surfaced, how we handle it, and (when known) the most likely upstream cause.
Boundary file: s3://noaa-goes16/ABI-L2-MCMIPF/2023/109/15/OR_ABI-L2-MCMIPF-M6_G16_s20231091500205_e20231091509514_c20231091510007.nc
(2023-04-19 15:00 UTC, DOY 109). This is the most annoying inconsistency
in the archive because it cannot be papered over inside a single virtual
dataset — see "Why it matters" below.
At this scan, NOAA enabled the Shuffle byte-shuffle filter in the
codec pipeline used for the per-channel CMI_C{NN} and DQF_C{NN}
variables. Verified by probing one file on each side of the boundary:
| Pre-change (≤ 14:50 UTC, 2023-04-19) | Post-change (≥ 15:00 UTC) | |
|---|---|---|
CMI_C* codecs |
(BytesCodec(little), Zlib(level=1)) |
(BytesCodec(little), Shuffle(elementsize=2), Zlib(level=1)) |
DQF_C* codecs |
(BytesCodec(endian=None), Zlib(level=1)) |
(BytesCodec(endian=None), Shuffle(elementsize=1), Zlib(level=1)) |
Everything else (dtype, fill_value, chunk_shape, BytesCodec endianness)
is identical on both sides. Shuffle improves compression ratio on multi-byte
integer data, so this was almost certainly a deliberate algorithm/encoding
update on NOAA's end.
Why it matters: virtualizarr produces a ManifestArray per source-file
chunk, and concatenating ManifestArrays into a single virtual zarr array
requires that every contributing chunk share an identical codec pipeline.
The codec change therefore prevents creating one continuous virtual zarr
array across the entire GOES-16 archive. Any single virtual dataset must
live entirely within one codec regime.
Handling:
mcmipf_archive.CODEC_CHANGE_MARKERnames the first post-boundary file.iter_archive_by_day/iter_archivedefault tostart=CODEC_CHANGE_MARKER, so iterating the archive yields only the new-codec era.EXPECTED_CODECS_BY_PREFIX(inmcmipf_pipeline) codifies the post- boundary codec pipeline. The validator (validate_raw's_check_codecs) flags pre-boundary files as a codec mismatch, so the boundary is enforced at ingest time as well.- The icechunk group used for ingest is named
ABI-L2-MCMIPF/post-2023-04-19so a sibling group could be used later for the older-codec data without collision.
Where: 2023-06-14 (DOY 165), hour 18:20 UTC
File: s3://noaa-goes16/ABI-L2-MCMIPF/2023/165/18/OR_ABI-L2-MCMIPF-M6_G16_s20231651820208_e20231651829527_c20231651832574.nc
This single 18:20 file is missing all C04 variables (CMI_C04, DQF_C04,
and every C04 stat / band_id / band_wavelength). The other 143 files of
the day have the full 16-channel set.
Likely cause: a transient channel-4 outage during MCMIPF product generation; NOAA still emitted the file but omitted C04.
Handling: fill_missing_channels back-fills the missing channel with a
fill-only ManifestArray (CMI/DQF) plus NaN scalars (stats), so the file
still produces a complete cleaned dataset with C04 readable as NaN.
Where: scattered through the archive — confirmed on 2023-05-03 (DOY 123).
Example: s3://noaa-goes16/ABI-L2-MCMIPF/2023/123/10/OR_ABI-L2-MCMIPF-M4_G16_s20231231055224_e20231231055224_c20231231102341.nc
The _s and _e tokens in the filename are identical (same to-the-second
scan start and end), and the file's internal t coordinate is set to the
J2000 epoch (~2000-01-01T11:43:21). The image content is otherwise present
but useless — clearly a placeholder for an aborted scan.
Likely cause: ABI scheduler started a scan, aborted, and still wrote a stub product record.
Handling:
- Filename-level:
mcmipf_archive.list_day_filesdrops any file where the_stoken equals the_etoken (_is_aborted_scan). - Pipeline-level:
validate_rawraises if the file's internaltis before 2017-01-01 (catches any sentinel that slips past #1). - Pipeline-level:
validate_rawalso raises ifabs(t - filename_scan_start) > 1 hour(catches mis-stamped files).
Where: scattered through the archive — confirmed on 2023-05-03 (DOY 123). Examples (both legitimate scans, not aborted):
OR_ABI-L2-MCMIPF-M4_G16_s20231231050229_e20231231055049_…ncOR_ABI-L2-MCMIPF-M4_G16_s20231231605205_e20231231610031_…nc
GOES-16 ABI has multiple operational modes:
- M6 — standard 10-minute flex (full disk every 10 min).
- M4 — continuous 5-minute full disk (used briefly during special-event coverage e.g. major hurricanes).
- M3 — legacy 15-minute schedule, mostly pre-2019.
Lex-sorting full URLs interleaves modes incorrectly: M4 < M6 causes an M4
file at 10:50 to sort before every M6 file in the hour-10 prefix block.
Handling:
list_day_filessorts by the embeddedsYYYYDDDHHMMSSTtoken (aparse_scan_start_to_datetime-friendly stamp), not the full URL. M4 and M6 scans interleave correctly into chronological order.- We do not filter M4 scans out: they're real radiometry, same schema and codecs as M6 (verified). Only aborted placeholders are filtered (#3).
Where: 2024-01-18 (DOY 018), hour 03:50 UTC
File: s3://noaa-goes16/ABI-L2-MCMIPF/2024/018/03/OR_ABI-L2-MCMIPF-M6_G16_s20240180350207_e20240180359521_c20240180400008.nc
This file appears to have CMI imagery for all 16 channels but no DQF
variables at all (none of DQF_C01..DQF_C16). Trips
fill_missing_channels' "No DQF_* variable present — cannot infer
ManifestArray template" because the function uses an existing per-prefix
variable as the metadata template for fill placeholders.
Likely cause: similar to #2 but more aggressive — a quality-flag generator outage that left every DQF channel out of the product.
Handling: fill_missing_channels synthesizes fill-only ManifestArrays
for every missing channel directly from hard-coded CMI_METADATA /
DQF_METADATA constants. This works regardless of how many channels
are missing — including the zero-channels-present case — because no source
template is required. The time axis stays continuous; readers see the
missing channels as all-NaN for that t.
Confirmed working on the specific file above (downloaded locally as
no_dqf.nc); 10 CMI channels are present and pass through unchanged
while the 6 missing CMI + all 16 DQF channels are synthesized.
Where: 2024-01-26 (DOY 026), hour 19:40 UTC
File: s3://noaa-goes16/ABI-L2-MCMIPF/2024/026/19/OR_ABI-L2-MCMIPF-M6_G16_s20240261940204_e20240261949523_c20240261952567.nc
All 16 CMI and DQF channels are present, but every per-channel stat for C04 (reflectance) and C14 (brightness temperature) is NaN:
min/max/mean/std_dev_reflectance_factor_C04= NaNmin/max/mean/std_dev_brightness_temperature_C14= NaN
These are float scalars, so NaN is a legal storage value; it just signals that the algorithm couldn't compute a meaningful statistic for those channels in that scan (no valid pixels passed the QC threshold).
Likely cause: a single-scan outage on those two channels — the imagery is still there, but no pixels survived the algorithm's "good or conditionally usable" filter, so the aggregate stats are undefined.
Handling: make_mcmipf_schema marks every per-channel outlier_pixel_count
and stat variable as nullable=True, so a one-off NaN doesn't fail
validate_raw. (The cleaned schema already had nullable=True for the
stacked stats, see #2's NaN-half-empty layout.)
Where: at least one chunk somewhere in the archive — first instance found:
- variable:
CMI_C13 - timestep:
t = 2023-10-13T14:25:06.402946(t-index 25422 on the v1 main-branch ingest ofpost-2023-04-19) - presumed source:
s3://noaa-goes16/ABI-L2-MCMIPF/2023/286/14/OR_ABI-L2-MCMIPF-M6_G16_s20232861425*_*.nc - offending chunk: the chunk covering the NYC pixel (iy=740, ix=2751) —
i.e. chunk coordinate
(3, 12)at 226×226 chunking.
Symptom: reading the chunk raises
Error -3 while decompressing data: incorrect header check from zlib.
The compressed bytes themselves are malformed at the zlib-header level.
Why it matters: not our bug — virtualizarr stores a virtual reference
to the source file's byte range; the icechunk store never decompresses on
write. The corruption lives in NOAA's upstream file on
s3://noaa-goes16. Other chunks in the same variable / same file
decode fine; only this one is bad.
Status: not yet fixed — accepting it for now. The bad chunk
errors at read time but doesn't trip the ingest. If a sampled verify
happens to touch the bad chunk it'll error out of the
_run_batch_verifications retry path, but at the current verification
sampling rate the probability of hitting it is small.
Things we'd want to do later:
- Drive a sweep across the archive to find all such bad chunks (not just the one we stumbled into via a NYC-pixel time-series plot).
- Decide on a policy: silently fall back to fill_value on read, log a per-chunk warning, or maintain a manifest of "known bad chunks" that the verifier and pipeline both consult. None of these fix the upstream bytes, but they let consumers handle the gap cleanly.
Where: 2019-DOY-099 11:10 UTC (s20190991110224) is the first observed
instance, but the boundary is somewhere between 2017 and 2019 — likely when
NOAA added a new DQF flag value (4). Affects every DQF_C{NN}.
Example: validator output comparing 2019 source vs an icechunk store seeded from a 2017 file (FIRST_OPERATIONAL_FILE):
[attr] DQF_C01.attrs['valid_range']: icechunk=[0, 3] source=array([0, 4], dtype=int8)
...
[attr] DQF_C16.attrs['valid_range']: icechunk=[0, 3] source=array([0, 4], dtype=int8)
Unlike issue #10 (which was about NOAA adding a missing attr), here NOAA
widened an existing attr's range. The drift can also happen inside a
single era — 2017-DOY-304 source has [0, 3] and 2019 source has
[0, 4] but both live in our pre-2023-04-19 group. So neither
setdefault nor "harvest from first file" can satisfy every source file
in the era simultaneously.
Why it matters: the valid_range attr documents which int values a
consumer might legitimately encounter. If the icechunk-stored range is
narrower than the actual data range, consumers might silently mishandle
flag values they think are illegal. If it's wider, consumers just see a
slightly more inclusive range than older NOAA files claimed — benign.
Handling:
mcmipf_pipeline._CANONICAL_VAR_ATTR_OVERRIDEScarries the canonical (wider/modern) value:valid_range=[0, 4]for everyDQF_C{NN}.finalize_per_channelwrites this unconditionally (override mode), regardless of what source has. So icechunk always stores[0, 4].maintenance/backfill_canonical_attrs.pyhandles both modes — applies setdefault attrs (issue #10) AND override attrs (this issue) to existing committed groups.- The verifier's
_compare_attrsspecial-casesvalid_range: instead of equality, it accepts the icechunk value as a superset of source's. So icechunk=[0, 4]matches source=[0, 3](early-era source, narrower) AND source=[0, 4](modern source, equal). The reverse — icechunk narrower than source — still fails (would misrepresent the data).
Where: 2019-DOY-036 23:50 UTC (s20190362345319) is the first instance we
hit, but the pattern almost certainly affects every variable NOAA improved
between 2017 and present. Specifically observed: x_image_bounds.units and
y_image_bounds.units were absent in early-era files and added later (value:
"rad").
Example: validator output comparing the 2019 file against icechunk:
[attr] x_image_bounds.attrs['units']: present in source only
[attr] y_image_bounds.attrs['units']: present in source only
The icechunk-stored variant lacks the attr because our pipeline harvests
attrs from whichever file was first in the batch that created the schema —
for pre-2023-04-19 that's a 2017 file from the start of the era, which
didn't carry the units attr yet.
Why it matters: pure metadata-richness gap, not a data defect. Consumers reading the icechunk store get less-complete CF metadata than NOAA's current files would provide. Strict verification flags this as a hard "present-in-source-only" failure.
Handling:
mcmipf_pipeline._CANONICAL_VAR_ATTRScarries a small table of CF attrs we know are missing from early-era files.finalize_per_channelapplies these viasetdefaultso source values are preserved when present, and the canonical value is added when absent. Currently:units='rad'onx_image_boundsandy_image_bounds.- For groups already committed before the canonical-attrs step was added,
maintenance/backfill_canonical_attrs.pyopens a writable session and patches the missing attrs in place, without re-ingesting data. - The verifier stays strict on "attr present in source only" — that's the
signal that surfaces NEW metadata drift we haven't yet added to
_CANONICAL_VAR_ATTRS. Add more entries as new instances are found.
Where: 2017-DOY-160 11:50 UTC (s20171601145352) and likely the whole pre-
operational / early-pre-op era. The metadata reporting precision was tightened
sometime between 2017 and 2023.
Example: validator output comparing icechunk-stored values (harvested from a 2023 operational reference file) against a 2017 source file:
| channel | icechunk (operational, 2023) | source (early-pre-op, 2017) |
|---|---|---|
| C07 | 3.9 μm | 3.89 μm |
| C08 | 6.185 μm | 6.17 μm |
| C09 | 6.95 μm | 6.93 μm |
| C11 | 8.5 μm | 8.44 μm |
| C13 | 10.35 μm | 10.33 μm |
| C14 | 11.2 μm | 11.19 μm |
| C15 | 12.3 μm | 12.27 μm |
| C16 | 13.3 μm | 13.27 μm |
The instrument's spectral filters didn't change — only NOAA's documentation precision did. The icechunk-stored values match the GOES-R ABI published nominal central wavelengths (3.9, 6.185, 6.95, 8.5, 10.35, 11.2, 12.3, 13.3 μm), which is the canonical reference for downstream consumers.
Why it matters: differences are 0.01–0.06 μm absolute, ≤ 0.7% relative —
nowhere near a real per-channel spectral shift, but well above
np.allclose's default 1e-5 relative tolerance. Without explicit
tolerance the verifier flags every pre-op file as failing on every brightness
channel's wavelength.
Handling: mcmipf_verifier._CONSOLIDATED_COORDS now carries per-coord
rtol / atol. The wavelength entry uses rtol=0.01, atol=0.1
(1% relative + 0.1 μm absolute) — still catches any real wavelength
corruption but accepts NOAA's metadata-precision drift. The band entry
keeps exact-equality (it's an integer 1..16).
Where: at least 2017-DOY-086 22:30 UTC (s20170862230380); likely
scattered across early-2017 files generally.
Example: the validate_raw pandera schema rejected the file above with:
SCHEMA → SERIES_CONTAINS_NULLS:
column: 'percent_uncorrectable_GRB_errors'
check: 'nullable'
error: 'non-nullable DataArray contains null values'
percent_uncorrectable_GRB_errors was present in the file as a single
NaN. The same class of anomaly is possible (though not yet observed) for
the other scalar / 1-D telemetry vars: percent_uncorrectable_L0_errors,
geospatial_lat_lon_extent, nominal_satellite_*, the algorithm/dynamic
containers, time_bounds, x_image_bounds, y_image_bounds. We've also
preemptively accommodated the case where any of these is missing
entirely from a source file (not yet seen but plausible).
Why it matters: a single bad-data scan in a batch would fail
validate_raw and abort the entire batch. Even with the schema check
relaxed, an entirely-missing variable would then cause the cross-file
xr.concat in open_virtual_mfdataset to fail — default
data_vars="all" requires every variable to be present in every input.
Handling:
- Both schemas (
make_mcmipf_schema,make_cleaned_mcmipf_schema) mark every entry in_EXPECTED_SHARED_VARSasnullable=True, required=Falseso NaN values pass and absence doesn't fail the per-file check. - A new pipeline step
fill_missing_scalar_metadatasynthesizes a NaN-/NaT-/0-filled placeholder for any expected scalar that's absent from the source, so all files in a batch carry the same variable set and the concat can proceed.
Where: 2024-DOY-272 / 273 and scattered other days Example: each of the following lists the same underlying object:
s3://noaa-goes16/ABI-L2-MCMIPF/2024/272/0/OR_ABI-L2-MCMIPF-M6_G16_s20242720000207_e..._c20242720010000.nc
s3://noaa-goes16/ABI-L2-MCMIPF/2024/272/00/OR_ABI-L2-MCMIPF-M6_G16_s20242720000207_e..._c20242720010000.nc
For some single-digit hours, NOAA's bucket has entries under both the
non-zero-padded prefix (.../272/0/) and the zero-padded prefix
(.../272/00/). obstore.list returns each underlying file twice, once
under each prefix. Scoped check on a single 15-day window (2024-09-25..10-09)
found 105 such duplicates spread across days 271–273 and a few others.
The two URLs share the exact same _s (scan-start) and _c (creation)
tokens — same file, two listing entries. This trips the per-batch
is_unique check in ingest_all_days (monotonic=True, unique=False).
Handling: list_day_files deduplicates URLs by scan-start token after
sorting. The first occurrence (by lex sort of the full URL after the
scan-start-token sort) is kept; subsequent duplicates are dropped.
- The aborted-scan placeholder pattern (#3) is the only one that yields a
literal sentinel
tvalue. All other anomalies preserve a believablet. - The "missing N channels of variable X" pattern (#2, #5) can affect any
per-channel variable independently. With hard-coded
CMI_METADATA/DQF_METADATAwe can synthesize placeholders for zero-template cases too. - Lex-sorting filenames across mode tokens (#4) is the only ordering
pitfall we've found at the filename level. Within a given mode, lex sort
of the embedded
_stoken matches scan-start order. - The codec change (#1) is the only inconsistency that cannot be papered over inside a single virtual dataset — it forces a hard boundary between pre- and post-2023-04-19 data.