vaneyckt/iowait.txt Secret

## iowait.txt
What exactly is "iowait"?

To summarize it in one sentence, 'iowait' is the percentage
of time the CPU is idle AND there is at least one I/O
in progress.

Each CPU can be in one of four states: user, sys, idle, iowait.
Performance tools such as vmstat, iostat, sar, etc. print
out these four states as a percentage.  The sar tool can
print out the states on a per CPU basis (-P flag) but most
other tools print out the average values across all the CPUs.
Since these are percentage values, the four state values
should add up to 100%.

The tools print out the statistics using counters that the
kernel updates periodically (on AIX, these CPU state counters
are incremented at every clock interrupt (these occur
at 10 millisecond intervals).
When the clock interrupt occurs on a CPU, the kernel
checks the CPU to see if it is idle or not. If it's not
idle, the kernel then determines if the instruction being
executed at that point is in user space or in kernel space.
If user, then it increments the 'user' counter by one. If
the instruction is in kernel space, then the 'sys' counter
is incremented by one.

If the CPU is idle, the kernel then determines if there is
at least one I/O currently in progress to either a local disk
or a remotely mounted disk (NFS) which had been initiated
from that CPU. If there is, then the 'iowait' counter is
incremented by one. If there is no I/O in progress that was
initiated from that CPU, the 'idle' counter is incremented
by one.

When a performance tool such as vmstat is invoked, it reads
the current values of these four counters. Then it sleeps
for the number of seconds the user specified as the interval
time and then reads the counters again. Then vmstat will
subtract the previous values from the current values to
get the delta value for this sampling period. Since vmstat
knows that the counters are incremented at each clock
tick (10ms), second, it then divides the delta value of
each counter by the number of clock ticks in the sampling
period. For example, if you run 'vmstat 2', this makes
vmstat sample the counters every 2 seconds. Since the
clock ticks at 10ms intervals, then there are 100 ticks
per second or 200 ticks per vmstat interval (if the interval
value is 2 seconds).   The delta values of each counter
are divided by the total ticks in the interval and
multiplied by 100 to get the percentage value in that
interval.

iowait can in some cases be an indicator of a limiting factor
to transaction throughput whereas in other cases, iowait may
be completely meaningless.
Some examples here will help to explain this. The first
example is one where high iowait is a direct cause
of a performance issue.

Example 1:
Let's say that a program needs to perform transactions on behalf of
a batch job. For each transaction, the program will perform some
computations which takes 10 milliseconds and then does a synchronous
write of the results to disk. Since the file it is writing to was
opened synchronously, the write does not return until the I/O has
made it all the way to the disk. Let's say the disk subsystem does
not have a cache and that each physical write I/O takes 20ms.
This means that the program completes a transaction every 30ms.
Over a period of 1 second (1000ms), the program can do 33
transactions (33 tps).  If this program is the only one running
on a 1-CPU system, then the CPU usage would be busy 1/3 of the
time and waiting on I/O the rest of the time - so 66% iowait
and 34% CPU busy.

If the I/O subsystem was improved (let's say a disk cache is
added) such that a write I/O takes only 1ms. This means that
it takes 11ms to complete a transaction, and the program can
now do around 90-91 transactions a second. Here the iowait time
would be around 8%. Notice that a lower iowait time directly
affects the throughput of the program.

Example 2:

Let's say that there is one program running on the system - let's assume
that this is the 'dd' program, and it is reading from the disk 4KB at
a time. Let's say that the subroutine in 'dd' is called main() and it
invokes read() to do a read. Both main() and read() are user space
subroutines. read() is a libc.a subroutine which will then invoke
the kread() system call at which point it enters kernel space.
kread() will then initiate a physical I/O to the device and the 'dd'
program is then put to sleep until the physical I/O completes.
The time to execute the code in main, read, and kread is very small -
probably around 50 microseconds at most. The time it takes for
the disk to complete the I/O request will probably be around 2-20
milliseconds depending on how far the disk arm had to seek. This
means that when the clock interrupt occurs, the chances are that
the 'dd' program is asleep and that the I/O is in progress. Therefore,
the 'iowait' counter is incremented. If the I/O completes in
2 milliseconds, then the 'dd' program runs again to do another read.
But since 50 microseconds is so small compared to 2ms (2000 microseconds),
the chances are that when the clock interrupt occurs, the CPU will
again be idle with a I/O in progress.  So again, 'iowait' is
incremented.  If 'sar -P <cpunumber>' is run to show the CPU
utilization for this CPU, it will most likely show 97-98% iowait.
If each I/O takes 20ms, then the iowait would be 99-100%.
Even though the I/O wait is extremely high in either case,
the throughput is 10 times better in one case.


Example 3:

Let's say that there are two programs running on a CPU. One is a 'dd'
program reading from the disk. The other is a program that does no
I/O but is spending 100% of its time doing computational work.
Now assume that there is a problem with the I/O subsystem and that
physical I/Os are taking over a second to complete. Whenever the
'dd' program is asleep while waiting for its I/Os to complete,
the other program is able to run on that CPU. When the clock
interrupt occurs, there will always be a program running in
either user mode or system mode. Therefore, the %idle and %iowait
values will be 0. Even though iowait is 0 now, that does not
mean there is NOT a I/O problem because there obviously is one
if physical I/Os are taking over a second to complete.


Example 4:

Let's say that there is a 4-CPU system where there are 6 programs
running. Let's assume that four of the programs spend 70% of their
time waiting on physical read I/Os and the 30% actually using CPU time.
Since these four  programs do have to enter kernel space to execute the
kread system calls, it will spend a percentage of its time in
the kernel; let's assume that 25% of the time is in user mode,
and 5% of the time in kernel mode.
Let's also assume that the other two programs spend 100% of their
time in user code doing computations and no I/O so that two CPUs
will always be 100% busy. Since the other four programs are busy
only 30% of the time, they can share that are not busy.

If we run 'sar -P ALL 1 10' to run 'sar' at 1-second intervals
for 10 intervals, then we'd expect to see this for each interval:

         cpu    %usr    %sys    %wio   %idle
          0       50      10      40       0
          1       50      10      40       0
          2      100       0       0       0
          3      100       0       0       0
          -       75       5      20       0

Notice that the average CPU utilization will be 75% user, 5% sys,
and 20% iowait. The values one sees with 'vmstat' or 'iostat' or
most tools are the average across all CPUs.

Now let's say we take this exact same workload (same 6 programs
with same behavior) to another machine that has 6 CPUs (same
CPU speeds and same I/O subsytem).  Now each program can be
running on its own CPU. Therefore, the CPU usage breakdown
would be as follows:

         cpu    %usr    %sys    %wio   %idle
          0       25       5      70       0
          1       25       5      70       0
          2       25       5      70       0
          3       25       5      70       0
          4      100       0       0       0
          5      100       0       0       0
          -       50       3      47       0

So now the average CPU utilization will be 50% user, 3% sy,
and 47% iowait.  Notice that the same workload on another
machine has more than double the iowait value.


Conclusion:

The iowait statistic may or may not be a useful indicator of
I/O performance - but it does tell us that the system can
handle more computational work. Just because a CPU is in
iowait state does not mean that it can't run other threads
on that CPU; that is, iowait is simply a form of idle time.
	What exactly is "iowait"?

	To summarize it in one sentence, 'iowait' is the percentage
	of time the CPU is idle AND there is at least one I/O
	in progress.

	Each CPU can be in one of four states: user, sys, idle, iowait.
	Performance tools such as vmstat, iostat, sar, etc. print
	out these four states as a percentage. The sar tool can
	print out the states on a per CPU basis (-P flag) but most
	other tools print out the average values across all the CPUs.
	Since these are percentage values, the four state values
	should add up to 100%.

	The tools print out the statistics using counters that the
	kernel updates periodically (on AIX, these CPU state counters
	are incremented at every clock interrupt (these occur
	at 10 millisecond intervals).
	When the clock interrupt occurs on a CPU, the kernel
	checks the CPU to see if it is idle or not. If it's not
	idle, the kernel then determines if the instruction being
	executed at that point is in user space or in kernel space.
	If user, then it increments the 'user' counter by one. If
	the instruction is in kernel space, then the 'sys' counter
	is incremented by one.

	If the CPU is idle, the kernel then determines if there is
	at least one I/O currently in progress to either a local disk
	or a remotely mounted disk (NFS) which had been initiated
	from that CPU. If there is, then the 'iowait' counter is
	incremented by one. If there is no I/O in progress that was
	initiated from that CPU, the 'idle' counter is incremented
	by one.

	When a performance tool such as vmstat is invoked, it reads
	the current values of these four counters. Then it sleeps
	for the number of seconds the user specified as the interval
	time and then reads the counters again. Then vmstat will
	subtract the previous values from the current values to
	get the delta value for this sampling period. Since vmstat
	knows that the counters are incremented at each clock
	tick (10ms), second, it then divides the delta value of
	each counter by the number of clock ticks in the sampling
	period. For example, if you run 'vmstat 2', this makes
	vmstat sample the counters every 2 seconds. Since the
	clock ticks at 10ms intervals, then there are 100 ticks
	per second or 200 ticks per vmstat interval (if the interval
	value is 2 seconds). The delta values of each counter
	are divided by the total ticks in the interval and
	multiplied by 100 to get the percentage value in that
	interval.

	iowait can in some cases be an indicator of a limiting factor
	to transaction throughput whereas in other cases, iowait may
	be completely meaningless.
	Some examples here will help to explain this. The first
	example is one where high iowait is a direct cause
	of a performance issue.

	Example 1:
	Let's say that a program needs to perform transactions on behalf of
	a batch job. For each transaction, the program will perform some
	computations which takes 10 milliseconds and then does a synchronous
	write of the results to disk. Since the file it is writing to was
	opened synchronously, the write does not return until the I/O has
	made it all the way to the disk. Let's say the disk subsystem does
	not have a cache and that each physical write I/O takes 20ms.
	This means that the program completes a transaction every 30ms.
	Over a period of 1 second (1000ms), the program can do 33
	transactions (33 tps). If this program is the only one running
	on a 1-CPU system, then the CPU usage would be busy 1/3 of the
	time and waiting on I/O the rest of the time - so 66% iowait
	and 34% CPU busy.

	If the I/O subsystem was improved (let's say a disk cache is
	added) such that a write I/O takes only 1ms. This means that
	it takes 11ms to complete a transaction, and the program can
	now do around 90-91 transactions a second. Here the iowait time
	would be around 8%. Notice that a lower iowait time directly
	affects the throughput of the program.

	Example 2:

	Let's say that there is one program running on the system - let's assume
	that this is the 'dd' program, and it is reading from the disk 4KB at
	a time. Let's say that the subroutine in 'dd' is called main() and it
	invokes read() to do a read. Both main() and read() are user space
	subroutines. read() is a libc.a subroutine which will then invoke
	the kread() system call at which point it enters kernel space.
	kread() will then initiate a physical I/O to the device and the 'dd'
	program is then put to sleep until the physical I/O completes.
	The time to execute the code in main, read, and kread is very small -
	probably around 50 microseconds at most. The time it takes for
	the disk to complete the I/O request will probably be around 2-20
	milliseconds depending on how far the disk arm had to seek. This
	means that when the clock interrupt occurs, the chances are that
	the 'dd' program is asleep and that the I/O is in progress. Therefore,
	the 'iowait' counter is incremented. If the I/O completes in
	2 milliseconds, then the 'dd' program runs again to do another read.
	But since 50 microseconds is so small compared to 2ms (2000 microseconds),
	the chances are that when the clock interrupt occurs, the CPU will
	again be idle with a I/O in progress. So again, 'iowait' is
	incremented. If 'sar -P <cpunumber>' is run to show the CPU
	utilization for this CPU, it will most likely show 97-98% iowait.
	If each I/O takes 20ms, then the iowait would be 99-100%.
	Even though the I/O wait is extremely high in either case,
	the throughput is 10 times better in one case.



	Example 3:

	Let's say that there are two programs running on a CPU. One is a 'dd'
	program reading from the disk. The other is a program that does no
	I/O but is spending 100% of its time doing computational work.
	Now assume that there is a problem with the I/O subsystem and that
	physical I/Os are taking over a second to complete. Whenever the
	'dd' program is asleep while waiting for its I/Os to complete,
	the other program is able to run on that CPU. When the clock
	interrupt occurs, there will always be a program running in
	either user mode or system mode. Therefore, the %idle and %iowait
	values will be 0. Even though iowait is 0 now, that does not
	mean there is NOT a I/O problem because there obviously is one
	if physical I/Os are taking over a second to complete.



	Example 4:

	Let's say that there is a 4-CPU system where there are 6 programs
	running. Let's assume that four of the programs spend 70% of their
	time waiting on physical read I/Os and the 30% actually using CPU time.
	Since these four programs do have to enter kernel space to execute the
	kread system calls, it will spend a percentage of its time in
	the kernel; let's assume that 25% of the time is in user mode,
	and 5% of the time in kernel mode.
	Let's also assume that the other two programs spend 100% of their
	time in user code doing computations and no I/O so that two CPUs
	will always be 100% busy. Since the other four programs are busy
	only 30% of the time, they can share that are not busy.

	If we run 'sar -P ALL 1 10' to run 'sar' at 1-second intervals
	for 10 intervals, then we'd expect to see this for each interval:

	cpu %usr %sys %wio %idle
	0 50 10 40 0
	1 50 10 40 0
	2 100 0 0 0
	3 100 0 0 0
	- 75 5 20 0

	Notice that the average CPU utilization will be 75% user, 5% sys,
	and 20% iowait. The values one sees with 'vmstat' or 'iostat' or
	most tools are the average across all CPUs.

	Now let's say we take this exact same workload (same 6 programs
	with same behavior) to another machine that has 6 CPUs (same
	CPU speeds and same I/O subsytem). Now each program can be
	running on its own CPU. Therefore, the CPU usage breakdown
	would be as follows:

	cpu %usr %sys %wio %idle
	0 25 5 70 0
	1 25 5 70 0
	2 25 5 70 0
	3 25 5 70 0
	4 100 0 0 0
	5 100 0 0 0
	- 50 3 47 0

	So now the average CPU utilization will be 50% user, 3% sy,
	and 47% iowait. Notice that the same workload on another
	machine has more than double the iowait value.



	Conclusion:

	The iowait statistic may or may not be a useful indicator of
	I/O performance - but it does tell us that the system can
	handle more computational work. Just because a CPU is in
	iowait state does not mean that it can't run other threads
	on that CPU; that is, iowait is simply a form of idle time.