Skip to content

Instantly share code, notes, and snippets.

@tsabat
Last active August 29, 2015 14:14
Show Gist options
  • Save tsabat/af6d76c8172326afe2f1 to your computer and use it in GitHub Desktop.
Save tsabat/af6d76c8172326afe2f1 to your computer and use it in GitHub Desktop.
exploring disk contention on a box

Sometimes choosing the right AWS resource to use makes all the difference. Here's a story of how I used iotop and iostat to help build evidence for the need to chose ebs-optimized disk to solve a problem.

problem description

A redis box was acting up. Here's what I'd experience:

  • slow login
  • failing redis backups (ERR Background save already in progress)
  • general slugishness

I suspected the disk, and used iotop to see that it was indeed so. But, I needed more evidence, so recorded the info over time using iostat. This utility returns the same data as iotop, but in a tabular format. To make it more readable, we only grep the lines with iowait in them.

iostat 1 | grep iowait -A 1

and got this in waves.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.27    0.00    0.73   17.32    0.01   79.66
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   49.25    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00   48.76    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    1.00   49.00    0.00   49.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00   48.76    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.50    0.00   50.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   68.50    0.00   30.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   98.50    0.50    0.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.00   99.00    0.00    0.00

Notice the iowait column would spike to 99% and stay there for a while. With another window open, I tried typing during those times and got the lack of response. So, I started another ec2 instance, this time with an ebs-optimized disk and higher (500) iops. This solved the problem. Notice the idle and iowait columns below stay low.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.50    0.00    0.00   98.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.51    0.00    0.00   99.49
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    1.00    0.00    0.00   98.01
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.51    0.00    0.00   97.99
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50    0.00    0.50   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.49    0.00    0.00   98.01
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.51    0.00    0.00   99.49
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50    0.00    0.00   99.00

newrelic told us the same story.

before ebs-optimized disks

after ebs-optimized disks

moral of the story

Sometimes the baseline disk that comes with an instance is just fine. Other times you need something more robust. Since Redis saves so regularly, we needed a faster link between the two. But, rather than just starting the new disk and calling it done, we measured our results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment