Skip to content

Instantly share code, notes, and snippets.

View labeleddata's full-sized avatar

Tal Perry labeleddata

View GitHub Profile
@labeleddata
labeleddata / pinot_issue_body.md
Created October 17, 2025 09:29
Pinot prod missing segments issue writeup

Summary

  • Prod Pinot servers rolled at ~04:40–04:56 UTC and came back without hundreds of realtime segments (ServerSegmentMissing), driving ~1k query timeouts in brokers and the web error-rate alert.
  • Logs show every server stuck in a loop failing to delete doc_metadata_REALTIME consumer folders, so segment reloads from deep store never finish and brokers only receive error blocks.
  • EBS volumes underneath the StatefulSet are healthy (aws ec2 describe-volume-status), so the regression stems from the rollout (either startup cleanup or filesystem permissions) rather than the storage layer.

Impact

  • Approximately 481 doc_metadata_REALTIME segments are unavailable, so queries reading that table time out or throw NPEs on the broker, impacting prod ingestion and the web surface.
  • Ingest was manually paused, but serving remains degraded until the consumer directories are cleaned and segments reloaded.

Timeline

@labeleddata
labeleddata / aws_volume_status_vol-09e0e99a564a8d00f.json
Created October 17, 2025 09:28
Pinot prod restart evidence 2025-10-17
{
"VolumeStatuses": [
{
"Actions": [],
"AvailabilityZone": "us-west-2a",
"Events": [],
"VolumeId": "vol-09e0e99a564a8d00f",
"VolumeStatus": {
"Details": [
{