Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save tomswartz07/96f24d98a21f8834574adc3517c80f0f to your computer and use it in GitHub Desktop.
Save tomswartz07/96f24d98a21f8834574adc3517c80f0f to your computer and use it in GitHub Desktop.

Support Documentation For Log Check Issue

The commands listed in this document use BOSH CLI v2+.

Background

The current Crunchy PostgreSQL for PCF tile release included high availability features that in the event of a problem on the Primary PostgreSQL server, our configuration would automatically fence the Primary and promote a replica to Primary status. Our recent release of the v04.090513.001 tile version included some additional statistics gathering as part of our health check process.

The high availability functions are based on the status of each server in Consul. On each of the PostgreSQL servers, there exists a script at /var/vcap/store/service/healthcheck.sh. Every service has one, haproxy, pgbackrest, postgresql, etc. in the same location. The Consul application on each VM runs that script, if it exits with a status code of 0, it is in passing state, if it is 1 warning state and anything 2 or above it is marked critical. Our HA configuration is such that if the status of a PostgreSQL server is marked critical in Consul, our Crunchy Cluster Manager (CCM) sets it to fenced, looks for a the next server that is a replica and promotes it to primary. Our built in health checks are designed so that only the Primary server can receive an exit code of 2, others will receive an exit code of 0 or 1.

Known Issue

As part of the health check script we also do some statistics gathering, memory status, cpu status, etc and publish that to consul. Part of that information gathering is a log check that checks pg logs for PANIC|ERROR|FATAL|WARN messages. When a customer application initiates enough transactions, the PostgreSQL system creates logs big enough or were still being processed in memory that the health check reaches the timeout of Consul waiting for a response and the check is killed. Likely receiving a response code of 128 (kill -9) though the scenario in which it happens makes it difficult to determine the exact error code output. Since anything greater than 2 is marked critical, our CCM kicked in and executed a failover. Part of the reason it was difficult to catch was that our output of the health check script is on the status of the postgresql server, pgbackrest server, and the replica servers. It did not capture the output of the statistics generation other than a true value if it completed successfully. We ultimately found the issue by doing a comparison, word for word, the response message from a critical event against a passing event.

As a result of this, customer service instances would experience frequent failovers of the Primary PostgreSQL server and frequent interrupts to their applications. If a customer is reporting that they see frequent dropped connections in their application or difficulty when running bosh tasks against the PostgreSQL VMs, this is a likely culprit.

Resolution

To fix this issue until the next release, the customer needs to remove the statistics_push function in the health check script. This fix is also on a per cluster basis so it will need to be addressed for each cluster the exhibits the issues.

  1. The customer will first need to determine the PostgreSQL servers that exist in the cluster.
  • bosh -e $ENV -d $SERVICE_INSTANCE vms | grep 'postgresql/'
  • An example:
$ bosh -e vbox -d service-instance_f9def4ea-7231-4344-b5ae-b1f0ca333f18 vms | grep 'postgresql/'
postgresql/8b702651-42b0-47f8-9152-f883dc305c38         running z2      10.244.10.3     bd38401d-2d7f-47c1-627b-c77349899a8a    crunchy-small   false
postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603         running z1      10.244.9.7      3423495e-4567-45ff-599b-3d0be8333e51    crunchy-small   false
  1. Next for each of the PostgreSQL servers, run a sed command to remove the statistics_push function.
  • bosh -e $ENV -d $SERVICE_INSTANCE ssh $POSTGRESQL_SERVER -c "sudo -u vcap sed -i '/statistics_push/d' /var/vcap/store/service/healthcheck.sh"
  • An example:
bosh -e vbox -d service-instance_f9def4ea-7231-4344-b5ae-b1f0ca333f18 ssh postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603 -c "sudo -u vcap sed -i '/statistics_push/d' /var/vcap/store/service/healthcheck.sh"
Using environment '192.168.50.6' as client 'admin'

Using deployment 'service-instance_f9def4ea-7231-4344-b5ae-b1f0ca333f18'

Task 594. Done
postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603: stderr | Unauthorized use is strictly prohibited. All access and activity
postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603: stderr | is subject to logging and monitoring.
postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603: stderr | Connection to 10.244.9.7 closed.

Succeeded
  1. The healthcheck.sh script is called each time uniquely so it does not require a restart of any services on the instance.

Code Fix

We currently have the fix in our staging environment and targeted to go out with our next release of v04.090513.003 and v04.100400.003. The fix will remove the statistics gathering from the health check and put it in an independent cron job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment