A running log of every real-world anomaly we've hit while ingesting the
s3://noaa-goes16/ABI-L2-MCMIPF/ archive into Icechunk via VirtualiZarr.
Each entry: what the file/day looks like, how it surfaced, how we handle it, and (when known) the most likely upstream cause.
Boundary file: s3://noaa-goes16/ABI-L2-MCMIPF/2023/109/15/OR_ABI-L2-MCMIPF-M6_G16_s20231091500205_e20231091509514_c20231091510007.nc
(2023-04-19 15:00 UTC, DOY 109). This is the most annoying inconsistency
in the archive because it cannot be papered over inside a single virtual
dataset — see "Why it matters" below.
At this scan, NOAA enabled the Shuffle byte-shuffle filter in the
codec pipeline used for the per-channel CMI_C{NN} and DQF_C{NN}
variables. Verified by probing one file on each side of the boundary:
| Pre-change (≤ 14:50 UTC, 2023-04-19) | Post-change (≥ 15:00 UTC) | |
|---|---|---|
CMI_C* codecs |
(BytesCodec(little), Zlib(level=1)) |
(BytesCodec(little), Shuffle(elementsize=2), Zlib(level=1)) |
DQF_C* codecs |
(BytesCodec(endian=None), Zlib(level=1)) |
(BytesCodec(endian=None), Shuffle(elementsize=1), Zlib(level=1)) |
Everything else (dtype, fill_value, chunk_shape, BytesCodec endianness)
is identical on both sides. Shuffle improves compression ratio on multi-byte
integer data, so this was almost certainly a deliberate algorithm/encoding
update on NOAA's end.
Why it matters: virtualizarr produces a ManifestArray per source-file
chunk, and concatenating ManifestArrays into a single virtual zarr array
requires that every contributing chunk share an identical codec pipeline.
The codec change therefore prevents creating one continuous virtual zarr
array across the entire GOES-16 archive. Any single virtual dataset must
live entirely within one codec regime.
Handling:
mcmipf_archive.CODEC_CHANGE_MARKERnames the first post-boundary file.iter_archive_by_day/iter_archivedefault tostart=CODEC_CHANGE_MARKER, so iterating the archive yields only the new-codec era.EXPECTED_CODECS_BY_PREFIX(inmcmipf_pipeline) codifies the post- boundary codec pipeline. The validator (validate_raw's_check_codecs) flags pre-boundary files as a codec mismatch, so the boundary is enforced at ingest time as well.- The icechunk group used for ingest is named
ABI-L2-MCMIPF/post-2023-04-19so a sibling group could be used later for the older-codec data without collision.
Where: 2023-06-14 (DOY 165), hour 18:20 UTC
File: s3://noaa-goes16/ABI-L2-MCMIPF/2023/165/18/OR_ABI-L2-MCMIPF-M6_G16_s20231651820208_e20231651829527_c20231651832574.nc
This single 18:20 file is missing all C04 variables (CMI_C04, DQF_C04,
and every C04 stat / band_id / band_wavelength). The other 143 files of
the day have the full 16-channel set.
Likely cause: a transient channel-4 outage during MCMIPF product generation; NOAA still emitted the file but omitted C04.
Handling: fill_missing_channels back-fills the missing channel with a
fill-only ManifestArray (CMI/DQF) plus NaN scalars (stats), so the file
still produces a complete cleaned dataset with C04 readable as NaN.
Where: scattered through the archive — confirmed on 2023-05-03 (DOY 123).
Example: s3://noaa-goes16/ABI-L2-MCMIPF/2023/123/10/OR_ABI-L2-MCMIPF-M4_G16_s20231231055224_e20231231055224_c20231231102341.nc
The _s and _e tokens in the filename are identical (same to-the-second
scan start and end), and the file's internal t coordinate is set to the
J2000 epoch (~2000-01-01T11:43:21). The image content is otherwise present
but useless — clearly a placeholder for an aborted scan.
Likely cause: ABI scheduler started a scan, aborted, and still wrote a stub product record.
Handling:
- Filename-level:
mcmipf_archive.list_day_filesdrops any file where the_stoken equals the_etoken (_is_aborted_scan). - Pipeline-level:
validate_rawraises if the file's internaltis before 2017-01-01 (catches any sentinel that slips past #1). - Pipeline-level:
validate_rawalso raises ifabs(t - filename_scan_start) > 1 hour(catches mis-stamped files).
Where: scattered through the archive — confirmed on 2023-05-03 (DOY 123). Examples (both legitimate scans, not aborted):
OR_ABI-L2-MCMIPF-M4_G16_s20231231050229_e20231231055049_…ncOR_ABI-L2-MCMIPF-M4_G16_s20231231605205_e20231231610031_…nc
GOES-16 ABI has multiple operational modes:
- M6 — standard 10-minute flex (full disk every 10 min).
- M4 — continuous 5-minute full disk (used briefly during special-event coverage e.g. major hurricanes).
- M3 — legacy 15-minute schedule, mostly pre-2019.
Lex-sorting full URLs interleaves modes incorrectly: M4 < M6 causes an M4
file at 10:50 to sort before every M6 file in the hour-10 prefix block.
Handling:
list_day_filessorts by the embeddedsYYYYDDDHHMMSSTtoken (aparse_scan_start_to_datetime-friendly stamp), not the full URL. M4 and M6 scans interleave correctly into chronological order.- We do not filter M4 scans out: they're real radiometry, same schema and codecs as M6 (verified). Only aborted placeholders are filtered (#3).
Where: 2024-01-18 (DOY 018), hour 03:50 UTC
File: s3://noaa-goes16/ABI-L2-MCMIPF/2024/018/03/OR_ABI-L2-MCMIPF-M6_G16_s20240180350207_e20240180359521_c20240180400008.nc
This file appears to have CMI imagery for all 16 channels but no DQF
variables at all (none of DQF_C01..DQF_C16). Trips
fill_missing_channels' "No DQF_* variable present — cannot infer
ManifestArray template" because the function uses an existing per-prefix
variable as the metadata template for fill placeholders.
Likely cause: similar to #2 but more aggressive — a quality-flag generator outage that left every DQF channel out of the product.
Handling: fill_missing_channels synthesizes fill-only ManifestArrays
for every missing channel directly from hard-coded CMI_METADATA /
DQF_METADATA constants. This works regardless of how many channels
are missing — including the zero-channels-present case — because no source
template is required. The time axis stays continuous; readers see the
missing channels as all-NaN for that t.
Confirmed working on the specific file above (downloaded locally as
no_dqf.nc); 10 CMI channels are present and pass through unchanged
while the 6 missing CMI + all 16 DQF channels are synthesized.
Where: 2024-01-26 (DOY 026), hour 19:40 UTC
File: s3://noaa-goes16/ABI-L2-MCMIPF/2024/026/19/OR_ABI-L2-MCMIPF-M6_G16_s20240261940204_e20240261949523_c20240261952567.nc
All 16 CMI and DQF channels are present, but every per-channel stat for C04 (reflectance) and C14 (brightness temperature) is NaN:
min/max/mean/std_dev_reflectance_factor_C04= NaNmin/max/mean/std_dev_brightness_temperature_C14= NaN
These are float scalars, so NaN is a legal storage value; it just signals that the algorithm couldn't compute a meaningful statistic for those channels in that scan (no valid pixels passed the QC threshold).
Likely cause: a single-scan outage on those two channels — the imagery is still there, but no pixels survived the algorithm's "good or conditionally usable" filter, so the aggregate stats are undefined.
Handling: make_mcmipf_schema marks every per-channel outlier_pixel_count
and stat variable as nullable=True, so a one-off NaN doesn't fail
validate_raw. (The cleaned schema already had nullable=True for the
stacked stats, see #2's NaN-half-empty layout.)
Where: 2024-DOY-272 / 273 and scattered other days Example: each of the following lists the same underlying object:
s3://noaa-goes16/ABI-L2-MCMIPF/2024/272/0/OR_ABI-L2-MCMIPF-M6_G16_s20242720000207_e..._c20242720010000.nc
s3://noaa-goes16/ABI-L2-MCMIPF/2024/272/00/OR_ABI-L2-MCMIPF-M6_G16_s20242720000207_e..._c20242720010000.nc
For some single-digit hours, NOAA's bucket has entries under both the
non-zero-padded prefix (.../272/0/) and the zero-padded prefix
(.../272/00/). obstore.list returns each underlying file twice, once
under each prefix. Scoped check on a single 15-day window (2024-09-25..10-09)
found 105 such duplicates spread across days 271–273 and a few others.
The two URLs share the exact same _s (scan-start) and _c (creation)
tokens — same file, two listing entries. This trips the per-batch
is_unique check in ingest_all_days (monotonic=True, unique=False).
Handling: list_day_files deduplicates URLs by scan-start token after
sorting. The first occurrence (by lex sort of the full URL after the
scan-start-token sort) is kept; subsequent duplicates are dropped.
- The aborted-scan placeholder pattern (#3) is the only one that yields a
literal sentinel
tvalue. All other anomalies preserve a believablet. - The "missing N channels of variable X" pattern (#2, #5) can affect any
per-channel variable independently. With hard-coded
CMI_METADATA/DQF_METADATAwe can synthesize placeholders for zero-template cases too. - Lex-sorting filenames across mode tokens (#4) is the only ordering
pitfall we've found at the filename level. Within a given mode, lex sort
of the embedded
_stoken matches scan-start order. - The codec change (#1) is the only inconsistency that cannot be papered over inside a single virtual dataset — it forces a hard boundary between pre- and post-2023-04-19 data.