- Prod Pinot servers rolled at ~04:40–04:56 UTC and came back without hundreds of realtime segments (
ServerSegmentMissing
), driving ~1k query timeouts in brokers and the web error-rate alert. - Logs show every server stuck in a loop failing to delete
doc_metadata_REALTIME
consumer folders, so segment reloads from deep store never finish and brokers only receive error blocks. - EBS volumes underneath the StatefulSet are healthy (
aws ec2 describe-volume-status
), so the regression stems from the rollout (either startup cleanup or filesystem permissions) rather than the storage layer.
- Approximately 481
doc_metadata_REALTIME
segments are unavailable, so queries reading that table time out or throw NPEs on the broker, impacting prod ingestion and the web surface. - Ingest was manually paused, but serving remains degraded until the consumer directories are cleaned and segments reloaded.