tsabat/bad_disk.md

## bad_disk.md

      
    Raw
  

              bad_disk.md
            
          
    Sometimes choosing the right AWS resource to use makes all the difference.  Here's a story of how I used iotop and iostat to help build evidence for the need to chose ebs-optimized disk to solve a problem.
problem description

A redis box was acting up.  Here's what I'd experience:

slow login
failing redis backups (ERR Background save already in progress)
general slugishness

I suspected the disk, and used iotop to see that it was indeed so.  But, I needed more evidence, so recorded the info over time using iostat.  This utility returns the same data as iotop, but in a tabular format.  To make it more readable, we only grep the lines with iowait in them.
iostat 1 | grep iowait -A 1
and got this in waves.
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.27    0.00    0.73   17.32    0.01   79.66
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   49.25    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00   48.76    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    1.00   49.00    0.00   49.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00   48.76    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.50    0.00   50.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   68.50    0.00   30.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   98.50    0.50    0.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.00   99.00    0.00    0.00

Notice the iowait column would spike to 99% and stay there for a while.  With another window open, I tried typing during those times and got the lack of response.  So, I started another ec2 instance, this time with an ebs-optimized disk and higher (500) iops.  This solved the problem.  Notice the idle and iowait columns below stay low.
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.50    0.00    0.00   98.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.51    0.00    0.00   99.49
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    1.00    0.00    0.00   98.01
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.51    0.00    0.00   97.99
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50    0.00    0.50   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.49    0.00    0.00   98.01
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.51    0.00    0.00   99.49
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50    0.00    0.00   99.00

newrelic told us the same story.
before ebs-optimized disks


after ebs-optimized disks


moral of the story

Sometimes the baseline disk that comes with an instance is just fine.  Other times you need something more robust.  Since Redis saves so regularly, we needed a faster link between the two.  But, rather than just starting the new disk and calling it done, we measured our results.