Skip to content

Instantly share code, notes, and snippets.

@unnitallman
Last active April 2, 2025 09:32
Show Gist options
  • Select an option

  • Save unnitallman/6ce1789bee1e896eef4471d0d13f2486 to your computer and use it in GitHub Desktop.

Select an option

Save unnitallman/6ce1789bee1e896eef4471d0d13f2486 to your computer and use it in GitHub Desktop.

Incident Report

Summary

NeetoCal’s PostgreSQL addon experienced a performance degradation due to exceeding disk I/O limits. The root cause was linked to unusually heavy auto-vacuum activity and WAL (Write-Ahead Logging) file buildup. The issue was resolved temporarily, and several follow-up actions are in place to prevent recurrence.


Timeline & Analysis

  1. Initial Symptoms
    NeetoCal, hosted on NeetoDeploy, began showing performance issues. We received Rack::Timeout errors, and analysis via New Relic pointed to slow PostgreSQL responses. CPU and memory usage were within normal limits.

  2. Recent Deployment
    Earlier in the day, we had deployed a patch to the NeetoCal PostgreSQL addon aimed at fixing memory leaks during PgBackRest backups. Although the patch had been running successfully in NeetoChat and NeetoForm for two days, we initially suspected it might be the cause for the slowdown and rolled it back after placing NeetoCal into maintenance mode.

  3. Unexpected Recovery Delay
    After rollback, the database did not restart as expected—it entered recovery mode. While this can happen when WAL files are being replayed, the delay was unusual. After waiting five minutes, the addon still refused to start.

  4. Backup Restore Consideration
    A new addon was provisioned using the latest backup. While it also started slowly, it became ready for use. At this point, we faced a trade-off: wait indefinitely for the live addon to recover, or switch to the backup with potential data loss (a few minutes’ worth). As we leaned toward the latter, the live addon eventually came online.

  5. Persistent Slowness & Investigation
    Despite recovery, performance remained poor. Further investigation revealed that disk usage had surged, particularly in the PGDATA/pg_wal directory, where WAL files accumulate. These files are normally archived after a checkpoint, but archiving was not occurring, leading to disk bloat.

  6. Manual Intervention
    To safely reduce the WAL size, we manually issued the PostgreSQL CHECKPOINT command. This triggered WAL archiving and reduced disk usage. Once the WAL cleanup cycle completed, database performance returned to normal and NeetoCal was brought back online.

  7. Root Cause Discovery
    The underlying issue persisted: WAL files continued to grow rapidly, and PostgreSQL performance remained at risk. We analyzed logs and identified extensive auto-vacuum activity overlapping with the initial slowdown. Specifically, the google_calendar_events table had an auto-vacuum process that ran for 55 minutes—highly abnormal. Other tables also exhibited unusually long auto-vacuum durations.

  8. WAL Bloat from Auto-Vacuum
    Auto-vacuum generates WAL files to maintain transactional integrity. In the case of google_calendar_events, one auto-vacuum run generated ~1.42 GB of WAL data. Frequent writes to this and similar tables (by background jobs every 30 minutes) and the presence of multiple indexes contributed to high disk I/O.

  9. Disk I/O Bottleneck Identified
    Disk I/O metrics confirmed that we were consistently hitting the maximum throughput allowed by our GP2 disk type. This bottleneck explained both the PostgreSQL slowness and the WAL backlog.


Action Items

  1. Table & Index Optimization

    • The NeetoCal team has been notified about inefficient and unused indexes on heavily written tables like google_calendar_events.
    • The team will also review write patterns to these tables to reduce unnecessary load as usage grows.
  2. Auto-Vacuum Tuning

    • Evaluate options to schedule auto-vacuum during low-traffic periods, either globally or at the table level.
  3. Disk Upgrade

    • NeetoCal’s PG addon is currently on a GP2 disk, which has limited IOPS. All new addons are already using GP3, which supports higher IOPS and can be scaled without downtime.
    • We plan to migrate NeetoCal to GP3 this Sunday, which will require a ~10-minute downtime.
  4. Monitoring

    • Until the disk upgrade and optimizations are in place, we will manually monitor NeetoCal’s PostgreSQL performance and intervene as needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment