avitevet/accurateWindowsPerformanceMeasurement.cpp

## accurateWindowsPerformanceMeasurement.cpp
#include <Windows.h>
#include <iostream>

// Shows how to perform accurate performance measurements on modern Windows in application-level code
// on x86 processors.
//
// Mostly from https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx.
// The QueryPerformanceCounter is an abstraction around the processor time-stamp counter, which
// starts at zero when the processor is reset and increments monotonically each cycle.
//
// Performance monitoring on Intel processors is a challenge because the behavior of the internal
// processor counters has changed over the years.  Also, processor frequency used to be fixed,
// but with the introduction of SpeedStep technology the processor frequency may change at
// essentially any time.  This problem is compounded by multi-core processors that each have
// their own clocks.
//
// If the measurement is restricted to "modern"
// x86 processors, like basically Pentium 4 and newer, the time-stamp counter increments at a constant
// rate regardless of any adjustments the processor may make for power/performance reasons (aka due to
// SpeedStep).  From the Intel IA32 Architectures SW Developer System Programming Manual, vol 3B, ch 17:
//
// "... the time-stamp counter increments at a constant rate. That rate may be set by the
// maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at
// which the processor is booted.  The maximum resolved frequency may differ from the processor base
// frequency, see Section 18.18.2 for more detail.  On certain processors, the TSC frequency may not be the same
// as the frequency in the brand string.
// The specific processor configuration determines the behavior.  Constant TSC behavior ensures that the duration
// of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core
// changes frequency.  This is the architectural behavior moving forward."
//
// So, to do the fast & loose measurement, you could call the use RDTSC instruction via assembly or intrinsic,
// though Intel processors have restrictions and/or configurations for this, like whether it's allowed to
// call this with application-level privilege or what the processor should do when this is called while
// a VM is running.
//
// It's safest to simply use the abstraction that Windows provides.

int main() {
	LARGE_INTEGER StartTick, EndTick, ElapsedTicks;

	QueryPerformanceCounter(&StartTick);

	// Activity to be timed.  Typically a short operation, like a single division or a single
	// sqrt, would be done many times.  The number of ticks for a single operation
	// would be found by dividing ElapsedTicks by the number of iterations of the operation.
	// A longer running operation, like a large matrix multiplication, could be performed
	// only once.
	//
	// A note on calling a single operation multiple times using a loop: a loop gets
	// translated into a set of instructions for the body, a jump, and a test + conditional jump (cjump).
	// Modern Intel processors attempt to increase performance by predicting which
	// instruction will execute next after a cjump - a misprediction is one of the more
	// costly things that a processor can do.  So, to avoid the effect of branch
	// mispredictions in accurate timing, either use the compiler's loop unrolling
	// feature (and check the disassembly) or manually unroll the loop, even using
	// copy/paste if necessary.  Or, use a LOT of loop iterations - the cost of the
	// final misprediction will be amortized over the cost of the operation.
	int numStuff = 0;
	for (int i = 0; i < 1000000; ++i) {
		numStuff = 1 - numStuff;
	}


	QueryPerformanceCounter(&EndTick);
	ElapsedTicks.QuadPart = EndTick.QuadPart - StartTick.QuadPart;

	std::cout << "Number of ticks: " << ElapsedTicks.QuadPart << std::endl;

	// Number of ticks can be converted to seconds by dividing by time-stamp counter frequency.  Though,
	// in terms of comparing different algorithms/modules/etc, it may be just as useful
	// to compare # ticks as total time, because for a particular microarchitecture,
	// the vast majority of instructions in the
	// vast majority of cases consume a consistent deterministic number of cycles.  Comparing
	// cycles allows comparison of algorithms across the same processor models with different
	// frequencies (i7 3.5GHz vs. i7 3.8GHz), which can be useful if the test is being performed
	// on lower-frequency/lower-cost versions of the production processors.  On the other hand,
	// time is a more valuable measurement when comparing the cost/benefit of upgrading
	// to a new processor generation.
	LARGE_INTEGER Frequency;

	QueryPerformanceFrequency(&Frequency);

	//
	// We now have the elapsed number of ticks, along with the
	// number of ticks-per-second. We use these values
	// to convert to the number of elapsed microseconds.
	// To guard against loss-of-precision, we convert
	// to microseconds *before* dividing by ticks-per-second.
	//

	ElapsedTicks.QuadPart *= 1000000;
	ElapsedTicks.QuadPart /= Frequency.QuadPart;

	std::cout << "Frequency: " << Frequency.QuadPart << std::endl;
	std::cout << "Microseconds: " << ElapsedTicks.QuadPart << std::endl;
}
	#include <Windows.h>
	#include <iostream>

	// Shows how to perform accurate performance measurements on modern Windows in application-level code
	// on x86 processors.
	//
	// Mostly from https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx.
	// The QueryPerformanceCounter is an abstraction around the processor time-stamp counter, which
	// starts at zero when the processor is reset and increments monotonically each cycle.
	//
	// Performance monitoring on Intel processors is a challenge because the behavior of the internal
	// processor counters has changed over the years. Also, processor frequency used to be fixed,
	// but with the introduction of SpeedStep technology the processor frequency may change at
	// essentially any time. This problem is compounded by multi-core processors that each have
	// their own clocks.
	//
	// If the measurement is restricted to "modern"
	// x86 processors, like basically Pentium 4 and newer, the time-stamp counter increments at a constant
	// rate regardless of any adjustments the processor may make for power/performance reasons (aka due to
	// SpeedStep). From the Intel IA32 Architectures SW Developer System Programming Manual, vol 3B, ch 17:
	//
	// "... the time-stamp counter increments at a constant rate. That rate may be set by the
	// maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at
	// which the processor is booted. The maximum resolved frequency may differ from the processor base
	// frequency, see Section 18.18.2 for more detail. On certain processors, the TSC frequency may not be the same
	// as the frequency in the brand string.
	// The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration
	// of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core
	// changes frequency. This is the architectural behavior moving forward."
	//
	// So, to do the fast & loose measurement, you could call the use RDTSC instruction via assembly or intrinsic,
	// though Intel processors have restrictions and/or configurations for this, like whether it's allowed to
	// call this with application-level privilege or what the processor should do when this is called while
	// a VM is running.
	//
	// It's safest to simply use the abstraction that Windows provides.

	int main() {
	LARGE_INTEGER StartTick, EndTick, ElapsedTicks;

	QueryPerformanceCounter(&StartTick);

	// Activity to be timed. Typically a short operation, like a single division or a single
	// sqrt, would be done many times. The number of ticks for a single operation
	// would be found by dividing ElapsedTicks by the number of iterations of the operation.
	// A longer running operation, like a large matrix multiplication, could be performed
	// only once.
	//
	// A note on calling a single operation multiple times using a loop: a loop gets
	// translated into a set of instructions for the body, a jump, and a test + conditional jump (cjump).
	// Modern Intel processors attempt to increase performance by predicting which
	// instruction will execute next after a cjump - a misprediction is one of the more
	// costly things that a processor can do. So, to avoid the effect of branch
	// mispredictions in accurate timing, either use the compiler's loop unrolling
	// feature (and check the disassembly) or manually unroll the loop, even using
	// copy/paste if necessary. Or, use a LOT of loop iterations - the cost of the
	// final misprediction will be amortized over the cost of the operation.
	int numStuff = 0;
	for (int i = 0; i < 1000000; ++i) {
	numStuff = 1 - numStuff;
	}


	QueryPerformanceCounter(&EndTick);
	ElapsedTicks.QuadPart = EndTick.QuadPart - StartTick.QuadPart;

	std::cout << "Number of ticks: " << ElapsedTicks.QuadPart << std::endl;

	// Number of ticks can be converted to seconds by dividing by time-stamp counter frequency. Though,
	// in terms of comparing different algorithms/modules/etc, it may be just as useful
	// to compare # ticks as total time, because for a particular microarchitecture,
	// the vast majority of instructions in the
	// vast majority of cases consume a consistent deterministic number of cycles. Comparing
	// cycles allows comparison of algorithms across the same processor models with different
	// frequencies (i7 3.5GHz vs. i7 3.8GHz), which can be useful if the test is being performed
	// on lower-frequency/lower-cost versions of the production processors. On the other hand,
	// time is a more valuable measurement when comparing the cost/benefit of upgrading
	// to a new processor generation.
	LARGE_INTEGER Frequency;

	QueryPerformanceFrequency(&Frequency);

	//
	// We now have the elapsed number of ticks, along with the
	// number of ticks-per-second. We use these values
	// to convert to the number of elapsed microseconds.
	// To guard against loss-of-precision, we convert
	// to microseconds before dividing by ticks-per-second.
	//

	ElapsedTicks.QuadPart *= 1000000;
	ElapsedTicks.QuadPart /= Frequency.QuadPart;

	std::cout << "Frequency: " << Frequency.QuadPart << std::endl;
	std::cout << "Microseconds: " << ElapsedTicks.QuadPart << std::endl;
	}