Created
March 10, 2017 08:10
-
-
Save avitevet/3b43e3ff91980787b2c45a9120241ee1 to your computer and use it in GitHub Desktop.
How to perform accurate & high-resolution performance measurement in "recent" Windows on "recent" x86 processors
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <Windows.h> | |
#include <iostream> | |
// Shows how to perform accurate performance measurements on modern Windows in application-level code | |
// on x86 processors. | |
// | |
// Mostly from https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx. | |
// The QueryPerformanceCounter is an abstraction around the processor time-stamp counter, which | |
// starts at zero when the processor is reset and increments monotonically each cycle. | |
// | |
// Performance monitoring on Intel processors is a challenge because the behavior of the internal | |
// processor counters has changed over the years. Also, processor frequency used to be fixed, | |
// but with the introduction of SpeedStep technology the processor frequency may change at | |
// essentially any time. This problem is compounded by multi-core processors that each have | |
// their own clocks. | |
// | |
// If the measurement is restricted to "modern" | |
// x86 processors, like basically Pentium 4 and newer, the time-stamp counter increments at a constant | |
// rate regardless of any adjustments the processor may make for power/performance reasons (aka due to | |
// SpeedStep). From the Intel IA32 Architectures SW Developer System Programming Manual, vol 3B, ch 17: | |
// | |
// "... the time-stamp counter increments at a constant rate. That rate may be set by the | |
// maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at | |
// which the processor is booted. The maximum resolved frequency may differ from the processor base | |
// frequency, see Section 18.18.2 for more detail. On certain processors, the TSC frequency may not be the same | |
// as the frequency in the brand string. | |
// The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration | |
// of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core | |
// changes frequency. This is the architectural behavior moving forward." | |
// | |
// So, to do the fast & loose measurement, you could call the use RDTSC instruction via assembly or intrinsic, | |
// though Intel processors have restrictions and/or configurations for this, like whether it's allowed to | |
// call this with application-level privilege or what the processor should do when this is called while | |
// a VM is running. | |
// | |
// It's safest to simply use the abstraction that Windows provides. | |
int main() { | |
LARGE_INTEGER StartTick, EndTick, ElapsedTicks; | |
QueryPerformanceCounter(&StartTick); | |
// Activity to be timed. Typically a short operation, like a single division or a single | |
// sqrt, would be done many times. The number of ticks for a single operation | |
// would be found by dividing ElapsedTicks by the number of iterations of the operation. | |
// A longer running operation, like a large matrix multiplication, could be performed | |
// only once. | |
// | |
// A note on calling a single operation multiple times using a loop: a loop gets | |
// translated into a set of instructions for the body, a jump, and a test + conditional jump (cjump). | |
// Modern Intel processors attempt to increase performance by predicting which | |
// instruction will execute next after a cjump - a misprediction is one of the more | |
// costly things that a processor can do. So, to avoid the effect of branch | |
// mispredictions in accurate timing, either use the compiler's loop unrolling | |
// feature (and check the disassembly) or manually unroll the loop, even using | |
// copy/paste if necessary. Or, use a LOT of loop iterations - the cost of the | |
// final misprediction will be amortized over the cost of the operation. | |
int numStuff = 0; | |
for (int i = 0; i < 1000000; ++i) { | |
numStuff = 1 - numStuff; | |
} | |
QueryPerformanceCounter(&EndTick); | |
ElapsedTicks.QuadPart = EndTick.QuadPart - StartTick.QuadPart; | |
std::cout << "Number of ticks: " << ElapsedTicks.QuadPart << std::endl; | |
// Number of ticks can be converted to seconds by dividing by time-stamp counter frequency. Though, | |
// in terms of comparing different algorithms/modules/etc, it may be just as useful | |
// to compare # ticks as total time, because for a particular microarchitecture, | |
// the vast majority of instructions in the | |
// vast majority of cases consume a consistent deterministic number of cycles. Comparing | |
// cycles allows comparison of algorithms across the same processor models with different | |
// frequencies (i7 3.5GHz vs. i7 3.8GHz), which can be useful if the test is being performed | |
// on lower-frequency/lower-cost versions of the production processors. On the other hand, | |
// time is a more valuable measurement when comparing the cost/benefit of upgrading | |
// to a new processor generation. | |
LARGE_INTEGER Frequency; | |
QueryPerformanceFrequency(&Frequency); | |
// | |
// We now have the elapsed number of ticks, along with the | |
// number of ticks-per-second. We use these values | |
// to convert to the number of elapsed microseconds. | |
// To guard against loss-of-precision, we convert | |
// to microseconds *before* dividing by ticks-per-second. | |
// | |
ElapsedTicks.QuadPart *= 1000000; | |
ElapsedTicks.QuadPart /= Frequency.QuadPart; | |
std::cout << "Frequency: " << Frequency.QuadPart << std::endl; | |
std::cout << "Microseconds: " << ElapsedTicks.QuadPart << std::endl; | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment