Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
How to perform accurate & high-resolution performance measurement in "recent" Windows on "recent" x86 processors
#include <Windows.h>
#include <iostream>
// Shows how to perform accurate performance measurements on modern Windows in application-level code
// on x86 processors.
// Mostly from
// The QueryPerformanceCounter is an abstraction around the processor time-stamp counter, which
// starts at zero when the processor is reset and increments monotonically each cycle.
// Performance monitoring on Intel processors is a challenge because the behavior of the internal
// processor counters has changed over the years. Also, processor frequency used to be fixed,
// but with the introduction of SpeedStep technology the processor frequency may change at
// essentially any time. This problem is compounded by multi-core processors that each have
// their own clocks.
// If the measurement is restricted to "modern"
// x86 processors, like basically Pentium 4 and newer, the time-stamp counter increments at a constant
// rate regardless of any adjustments the processor may make for power/performance reasons (aka due to
// SpeedStep). From the Intel IA32 Architectures SW Developer System Programming Manual, vol 3B, ch 17:
// "... the time-stamp counter increments at a constant rate. That rate may be set by the
// maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at
// which the processor is booted. The maximum resolved frequency may differ from the processor base
// frequency, see Section 18.18.2 for more detail. On certain processors, the TSC frequency may not be the same
// as the frequency in the brand string.
// The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration
// of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core
// changes frequency. This is the architectural behavior moving forward."
// So, to do the fast & loose measurement, you could call the use RDTSC instruction via assembly or intrinsic,
// though Intel processors have restrictions and/or configurations for this, like whether it's allowed to
// call this with application-level privilege or what the processor should do when this is called while
// a VM is running.
// It's safest to simply use the abstraction that Windows provides.
int main() {
LARGE_INTEGER StartTick, EndTick, ElapsedTicks;
// Activity to be timed. Typically a short operation, like a single division or a single
// sqrt, would be done many times. The number of ticks for a single operation
// would be found by dividing ElapsedTicks by the number of iterations of the operation.
// A longer running operation, like a large matrix multiplication, could be performed
// only once.
// A note on calling a single operation multiple times using a loop: a loop gets
// translated into a set of instructions for the body, a jump, and a test + conditional jump (cjump).
// Modern Intel processors attempt to increase performance by predicting which
// instruction will execute next after a cjump - a misprediction is one of the more
// costly things that a processor can do. So, to avoid the effect of branch
// mispredictions in accurate timing, either use the compiler's loop unrolling
// feature (and check the disassembly) or manually unroll the loop, even using
// copy/paste if necessary. Or, use a LOT of loop iterations - the cost of the
// final misprediction will be amortized over the cost of the operation.
int numStuff = 0;
for (int i = 0; i < 1000000; ++i) {
numStuff = 1 - numStuff;
ElapsedTicks.QuadPart = EndTick.QuadPart - StartTick.QuadPart;
std::cout << "Number of ticks: " << ElapsedTicks.QuadPart << std::endl;
// Number of ticks can be converted to seconds by dividing by time-stamp counter frequency. Though,
// in terms of comparing different algorithms/modules/etc, it may be just as useful
// to compare # ticks as total time, because for a particular microarchitecture,
// the vast majority of instructions in the
// vast majority of cases consume a consistent deterministic number of cycles. Comparing
// cycles allows comparison of algorithms across the same processor models with different
// frequencies (i7 3.5GHz vs. i7 3.8GHz), which can be useful if the test is being performed
// on lower-frequency/lower-cost versions of the production processors. On the other hand,
// time is a more valuable measurement when comparing the cost/benefit of upgrading
// to a new processor generation.
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second. We use these values
// to convert to the number of elapsed microseconds.
// To guard against loss-of-precision, we convert
// to microseconds *before* dividing by ticks-per-second.
ElapsedTicks.QuadPart *= 1000000;
ElapsedTicks.QuadPart /= Frequency.QuadPart;
std::cout << "Frequency: " << Frequency.QuadPart << std::endl;
std::cout << "Microseconds: " << ElapsedTicks.QuadPart << std::endl;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment