What exactly is "iowait"? | |
To summarize it in one sentence, 'iowait' is the percentage | |
of time the CPU is idle AND there is at least one I/O | |
in progress. | |
Each CPU can be in one of four states: user, sys, idle, iowait. | |
Performance tools such as vmstat, iostat, sar, etc. print | |
out these four states as a percentage. The sar tool can | |
print out the states on a per CPU basis (-P flag) but most | |
other tools print out the average values across all the CPUs. | |
Since these are percentage values, the four state values | |
should add up to 100%. | |
The tools print out the statistics using counters that the | |
kernel updates periodically (on AIX, these CPU state counters | |
are incremented at every clock interrupt (these occur | |
at 10 millisecond intervals). | |
When the clock interrupt occurs on a CPU, the kernel | |
checks the CPU to see if it is idle or not. If it's not | |
idle, the kernel then determines if the instruction being | |
executed at that point is in user space or in kernel space. | |
If user, then it increments the 'user' counter by one. If | |
the instruction is in kernel space, then the 'sys' counter | |
is incremented by one. | |
If the CPU is idle, the kernel then determines if there is | |
at least one I/O currently in progress to either a local disk | |
or a remotely mounted disk (NFS) which had been initiated | |
from that CPU. If there is, then the 'iowait' counter is | |
incremented by one. If there is no I/O in progress that was | |
initiated from that CPU, the 'idle' counter is incremented | |
by one. | |
When a performance tool such as vmstat is invoked, it reads | |
the current values of these four counters. Then it sleeps | |
for the number of seconds the user specified as the interval | |
time and then reads the counters again. Then vmstat will | |
subtract the previous values from the current values to | |
get the delta value for this sampling period. Since vmstat | |
knows that the counters are incremented at each clock | |
tick (10ms), second, it then divides the delta value of | |
each counter by the number of clock ticks in the sampling | |
period. For example, if you run 'vmstat 2', this makes | |
vmstat sample the counters every 2 seconds. Since the | |
clock ticks at 10ms intervals, then there are 100 ticks | |
per second or 200 ticks per vmstat interval (if the interval | |
value is 2 seconds). The delta values of each counter | |
are divided by the total ticks in the interval and | |
multiplied by 100 to get the percentage value in that | |
interval. | |
iowait can in some cases be an indicator of a limiting factor | |
to transaction throughput whereas in other cases, iowait may | |
be completely meaningless. | |
Some examples here will help to explain this. The first | |
example is one where high iowait is a direct cause | |
of a performance issue. | |
Example 1: | |
Let's say that a program needs to perform transactions on behalf of | |
a batch job. For each transaction, the program will perform some | |
computations which takes 10 milliseconds and then does a synchronous | |
write of the results to disk. Since the file it is writing to was | |
opened synchronously, the write does not return until the I/O has | |
made it all the way to the disk. Let's say the disk subsystem does | |
not have a cache and that each physical write I/O takes 20ms. | |
This means that the program completes a transaction every 30ms. | |
Over a period of 1 second (1000ms), the program can do 33 | |
transactions (33 tps). If this program is the only one running | |
on a 1-CPU system, then the CPU usage would be busy 1/3 of the | |
time and waiting on I/O the rest of the time - so 66% iowait | |
and 34% CPU busy. | |
If the I/O subsystem was improved (let's say a disk cache is | |
added) such that a write I/O takes only 1ms. This means that | |
it takes 11ms to complete a transaction, and the program can | |
now do around 90-91 transactions a second. Here the iowait time | |
would be around 8%. Notice that a lower iowait time directly | |
affects the throughput of the program. | |
Example 2: | |
Let's say that there is one program running on the system - let's assume | |
that this is the 'dd' program, and it is reading from the disk 4KB at | |
a time. Let's say that the subroutine in 'dd' is called main() and it | |
invokes read() to do a read. Both main() and read() are user space | |
subroutines. read() is a libc.a subroutine which will then invoke | |
the kread() system call at which point it enters kernel space. | |
kread() will then initiate a physical I/O to the device and the 'dd' | |
program is then put to sleep until the physical I/O completes. | |
The time to execute the code in main, read, and kread is very small - | |
probably around 50 microseconds at most. The time it takes for | |
the disk to complete the I/O request will probably be around 2-20 | |
milliseconds depending on how far the disk arm had to seek. This | |
means that when the clock interrupt occurs, the chances are that | |
the 'dd' program is asleep and that the I/O is in progress. Therefore, | |
the 'iowait' counter is incremented. If the I/O completes in | |
2 milliseconds, then the 'dd' program runs again to do another read. | |
But since 50 microseconds is so small compared to 2ms (2000 microseconds), | |
the chances are that when the clock interrupt occurs, the CPU will | |
again be idle with a I/O in progress. So again, 'iowait' is | |
incremented. If 'sar -P <cpunumber>' is run to show the CPU | |
utilization for this CPU, it will most likely show 97-98% iowait. | |
If each I/O takes 20ms, then the iowait would be 99-100%. | |
Even though the I/O wait is extremely high in either case, | |
the throughput is 10 times better in one case. | |
Example 3: | |
Let's say that there are two programs running on a CPU. One is a 'dd' | |
program reading from the disk. The other is a program that does no | |
I/O but is spending 100% of its time doing computational work. | |
Now assume that there is a problem with the I/O subsystem and that | |
physical I/Os are taking over a second to complete. Whenever the | |
'dd' program is asleep while waiting for its I/Os to complete, | |
the other program is able to run on that CPU. When the clock | |
interrupt occurs, there will always be a program running in | |
either user mode or system mode. Therefore, the %idle and %iowait | |
values will be 0. Even though iowait is 0 now, that does not | |
mean there is NOT a I/O problem because there obviously is one | |
if physical I/Os are taking over a second to complete. | |
Example 4: | |
Let's say that there is a 4-CPU system where there are 6 programs | |
running. Let's assume that four of the programs spend 70% of their | |
time waiting on physical read I/Os and the 30% actually using CPU time. | |
Since these four programs do have to enter kernel space to execute the | |
kread system calls, it will spend a percentage of its time in | |
the kernel; let's assume that 25% of the time is in user mode, | |
and 5% of the time in kernel mode. | |
Let's also assume that the other two programs spend 100% of their | |
time in user code doing computations and no I/O so that two CPUs | |
will always be 100% busy. Since the other four programs are busy | |
only 30% of the time, they can share that are not busy. | |
If we run 'sar -P ALL 1 10' to run 'sar' at 1-second intervals | |
for 10 intervals, then we'd expect to see this for each interval: | |
cpu %usr %sys %wio %idle | |
0 50 10 40 0 | |
1 50 10 40 0 | |
2 100 0 0 0 | |
3 100 0 0 0 | |
- 75 5 20 0 | |
Notice that the average CPU utilization will be 75% user, 5% sys, | |
and 20% iowait. The values one sees with 'vmstat' or 'iostat' or | |
most tools are the average across all CPUs. | |
Now let's say we take this exact same workload (same 6 programs | |
with same behavior) to another machine that has 6 CPUs (same | |
CPU speeds and same I/O subsytem). Now each program can be | |
running on its own CPU. Therefore, the CPU usage breakdown | |
would be as follows: | |
cpu %usr %sys %wio %idle | |
0 25 5 70 0 | |
1 25 5 70 0 | |
2 25 5 70 0 | |
3 25 5 70 0 | |
4 100 0 0 0 | |
5 100 0 0 0 | |
- 50 3 47 0 | |
So now the average CPU utilization will be 50% user, 3% sy, | |
and 47% iowait. Notice that the same workload on another | |
machine has more than double the iowait value. | |
Conclusion: | |
The iowait statistic may or may not be a useful indicator of | |
I/O performance - but it does tell us that the system can | |
handle more computational work. Just because a CPU is in | |
iowait state does not mean that it can't run other threads | |
on that CPU; that is, iowait is simply a form of idle time. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment