Tal Perry labeleddata

## pinot_issue_body.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                labeleddata
                / pinot_issue_body.md
            
            
              Created
              October 17, 2025 09:29
            
              
                Pinot prod missing segments issue writeup
              
          
    Summary


Prod Pinot servers rolled at ~04:40–04:56 UTC and came back without hundreds of realtime segments (ServerSegmentMissing), driving ~1k query timeouts in brokers and the web error-rate alert.
Logs show every server stuck in a loop failing to delete doc_metadata_REALTIME consumer folders, so segment reloads from deep store never finish and brokers only receive error blocks.
EBS volumes underneath the StatefulSet are healthy (aws ec2 describe-volume-status), so the regression stems from the rollout (either startup cleanup or filesystem permissions) rather than the storage layer.

Impact


Approximately 481 doc_metadata_REALTIME segments are unavailable, so queries reading that table time out or throw NPEs on the broker, impacting prod ingestion and the web surface.
Ingest was manually paused, but serving remains degraded until the consumer directories are cleaned and segments reloaded.

Timeline


## aws_volume_status_vol-09e0e99a564a8d00f.json
{
    "VolumeStatuses": [
        {
            "Actions": [],
            "AvailabilityZone": "us-west-2a",
            "Events": [],
            "VolumeId": "vol-09e0e99a564a8d00f",
            "VolumeStatus": {
                "Details": [
                    {
	{
	"VolumeStatuses": [
	{
	"Actions": [],
	"AvailabilityZone": "us-west-2a",
	"Events": [],
	"VolumeId": "vol-09e0e99a564a8d00f",
	"VolumeStatus": {
	"Details": [
	{