NeetoCal’s PostgreSQL addon experienced a performance degradation due to exceeding disk I/O limits. The root cause was linked to unusually heavy auto-vacuum activity and WAL (Write-Ahead Logging) file buildup. The issue was resolved temporarily, and several follow-up actions are in place to prevent recurrence.
-
Initial Symptoms
NeetoCal, hosted on NeetoDeploy, began showing performance issues. We receivedRack::Timeouterrors, and analysis via New Relic pointed to slow PostgreSQL responses. CPU and memory usage were within normal limits. -
Recent Deployment
Earlier in the day, we had deployed a patch to the NeetoCal PostgreSQL addon aimed at fixing memory leaks during PgBackRest backups. Although the patch had been running successfully in NeetoChat and NeetoForm for two days, we initially suspected it might be the cause for the slowdown and rolled it back after placing NeetoCal into maintenance mode. -
Unexpected Recovery Delay
After rollback, the database did not restart as expected—it entered recovery mode. While this can happen when WAL files are being replayed, the delay was unusual. After waiting five minutes, the addon still refused to start. -
Backup Restore Consideration
A new addon was provisioned using the latest backup. While it also started slowly, it became ready for use. At this point, we faced a trade-off: wait indefinitely for the live addon to recover, or switch to the backup with potential data loss (a few minutes’ worth). As we leaned toward the latter, the live addon eventually came online. -
Persistent Slowness & Investigation
Despite recovery, performance remained poor. Further investigation revealed that disk usage had surged, particularly in thePGDATA/pg_waldirectory, where WAL files accumulate. These files are normally archived after a checkpoint, but archiving was not occurring, leading to disk bloat. -
Manual Intervention
To safely reduce the WAL size, we manually issued the PostgreSQLCHECKPOINTcommand. This triggered WAL archiving and reduced disk usage. Once the WAL cleanup cycle completed, database performance returned to normal and NeetoCal was brought back online. -
Root Cause Discovery
The underlying issue persisted: WAL files continued to grow rapidly, and PostgreSQL performance remained at risk. We analyzed logs and identified extensive auto-vacuum activity overlapping with the initial slowdown. Specifically, thegoogle_calendar_eventstable had an auto-vacuum process that ran for 55 minutes—highly abnormal. Other tables also exhibited unusually long auto-vacuum durations. -
WAL Bloat from Auto-Vacuum
Auto-vacuum generates WAL files to maintain transactional integrity. In the case ofgoogle_calendar_events, one auto-vacuum run generated ~1.42 GB of WAL data. Frequent writes to this and similar tables (by background jobs every 30 minutes) and the presence of multiple indexes contributed to high disk I/O. -
Disk I/O Bottleneck Identified
Disk I/O metrics confirmed that we were consistently hitting the maximum throughput allowed by our GP2 disk type. This bottleneck explained both the PostgreSQL slowness and the WAL backlog.
-
Table & Index Optimization
- The NeetoCal team has been notified about inefficient and unused indexes on heavily written tables like
google_calendar_events. - The team will also review write patterns to these tables to reduce unnecessary load as usage grows.
- The NeetoCal team has been notified about inefficient and unused indexes on heavily written tables like
-
Auto-Vacuum Tuning
- Evaluate options to schedule auto-vacuum during low-traffic periods, either globally or at the table level.
-
Disk Upgrade
- NeetoCal’s PG addon is currently on a GP2 disk, which has limited IOPS. All new addons are already using GP3, which supports higher IOPS and can be scaled without downtime.
- We plan to migrate NeetoCal to GP3 this Sunday, which will require a ~10-minute downtime.
-
Monitoring
- Until the disk upgrade and optimizations are in place, we will manually monitor NeetoCal’s PostgreSQL performance and intervene as needed.