Skip to content

Instantly share code, notes, and snippets.

@stevendanna
Created May 13, 2015 12:18
Show Gist options
  • Save stevendanna/4311c07e8f331a0a9624 to your computer and use it in GitHub Desktop.
Save stevendanna/4311c07e8f331a0a9624 to your computer and use it in GitHub Desktop.
Chef Server 12.0.8 Double Transaction Rate

Supposed Cause

Chef Server 12.0.8+ contains a new version of our underlying postgresql client library and a new version of the sqerl library which provideds an abstraction over that. These updates included the following change:

https://github.com/chef/sqerl/commit/11d08f363fa61958f5e46d2560447931f88ae3ee

which results in the following query being sent before each query that we send to sqerl:

set statement_timeout=5000

If this query is also counted as a transaction, that would account for the doubling in the transaction rate seen when upgrading to 12.0.8.

Proposed Fix

Since the statement_timeout is (by default) a per-session configurable, we can set it once at initialze. Thus, we can move the timeout statement into the initialization of the sqerl worker. The following PR does that:

chef/sqerl#92

Testing the Fix

To test the fix, I generated simple load against a standalone Chef Server built off of master. I then measured transactions on a 1 second interval for 5 minutes using a query very similar to what munin uses. On master, without the proposed fix, we observe the following transaction rate data:

       V1
 Min.   :  0.0
 1st Qu.:480.0
 Median :512.0
 Mean   :511.1
 3rd Qu.:549.0
 Max.   :771.0

I then stopped the load generator and erchef. I rebuilt oc_erchef with a new version of sqerl with the above fix, restarted the load generator, and restarted the monitoring application. With the change, we observe the following transaction rate data:

       V1
 Min.   :  0.0
 1st Qu.:230.0
 Median :252.0
 Mean   :249.1
 3rd Qu.:271.5
 Max.   :395.0

Load generation script

puts "Starting up"

threads = []
1.upto(4) do |i|
  threads << Thread.new do
    loop do
      puts "Request from thread #{i}"
      api.get("/nodes")
    end
  end
end
threads.each {|t| t.join}

started as: knife exec -u usera -k usera.pem test.rb -s 'https://localhost/organizations/orga' after creating the appropriate user and organization.

Monitoring Script

QUERY="SELECT sum(pg_stat_get_db_xact_commit(oid)) FROM pg_database WHERE datname='opscode_chef'"

# all of our calls are queries only, so rollbacks are very unlikely.  Munin, however does measure them
# by adding something like the following:
#UNION ALL
#SELECT 'rollback',sum(pg_stat_get_db_xact_rollback(oid)) FROM pg_database;
#

get_commit_count() {
 su opscode-pgsql -c "/opt/opscode/embedded/bin/psql opscode_chef -t -c \"$QUERY\"" | head -1 | awk '{print $1}'
}

OLD_COUNT=$(get_commit_count)
while true; do
   NEW_COUNT=$(get_commit_count)
   echo $(echo "$NEW_COUNT - $OLD_COUNT" | bc)
   OLD_COUNT=$NEW_COUNT
   sleep 1
done

started with: bash monitor.sh 2>/dev/null >tx.log. manually stopped with an interrupt after 5 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment