Skip to content

Instantly share code, notes, and snippets.

@avitevet
Last active January 23, 2018 22:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save avitevet/eda22e17eace892fffbd8b4ce896dfba to your computer and use it in GitHub Desktop.
Save avitevet/eda22e17eace892fffbd8b4ce896dfba to your computer and use it in GitHub Desktop.
How to perform accurate performance measurements on OS X
#include <mach/mach_time.h>
#include <iostream>
// Shows how to perform accurate performance measurements on modern OS X in application-level code
// on x86 processors.
//
// Performance monitoring on Intel processors is a challenge because the behavior of the internal
// processor counters has changed over the years. Also, processor frequency used to be fixed,
// but with the introduction of SpeedStep technology the processor frequency may change at
// essentially any time. This problem is compounded by multi-core processors that each have
// their own clocks.
//
// If the measurement is restricted to "modern"
// x86 processors, like basically Pentium 4 and newer, the time-stamp counter increments at a constant
// rate regardless of any adjustments the processor may make for power/performance reasons (aka due to
// SpeedStep). From the Intel IA32 Architectures SW Developer System Programming Manual, vol 3B, ch 17:
//
// "... the time-stamp counter increments at a constant rate. That rate may be set by the
// maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at
// which the processor is booted. The maximum resolved frequency may differ from the processor base
// frequency, see Section 18.18.2 for more detail. On certain processors, the TSC frequency may not be the same
// as the frequency in the brand string.
// The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration
// of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core
// changes frequency. This is the architectural behavior moving forward."
//
// So, to do the fast & loose measurement, you could call the use RDTSC instruction via assembly or intrinsic,
// though Intel processors have restrictions and/or configurations for this, like whether it's allowed to
// call this with application-level privilege or what the processor should do when this is called while
// a VM is running.
//
// It's safest to simply use the abstraction that OS X provides.
int main() {
uint64_t StartTick, EndTick, ElapsedTicks;
StartTick = mach_absolute_time();
// Activity to be timed. Typically a short operation, like a single division or a single
// sqrt, would be done many times. The number of ticks for a single operation
// would be found by dividing ElapsedTicks by the number of iterations of the operation.
// A longer running operation, like a large matrix multiplication, could be performed
// only once.
//
// A note on calling a single operation multiple times using a loop: a loop gets
// translated into a set of instructions for the body, a jump, and a test + conditional jump (cjump).
// Modern Intel processors attempt to increase performance by predicting which
// instruction will execute next after a cjump - a misprediction is one of the more
// costly things that a processor can do. So, to avoid the effect of branch
// mispredictions in accurate timing, either use the compiler's loop unrolling
// feature (and check the disassembly) or manually unroll the loop, even using
// copy/paste if necessary. Or, use a LOT of loop iterations - the cost of the
// final misprediction will be amortized over the cost of the operation.
int numStuff = 0;
for (int i = 0; i < 1000000; ++i) {
numStuff = 1 - numStuff;
}
EndTick = mach_absolute_time();
ElapsedTicks = EndTick - StartTick;
// Number of ticks can be converted to seconds by dividing by time-stamp counter frequency. Though,
// in terms of comparing different algorithms/modules/etc, it may be just as useful
// to compare # ticks as total time, because for a particular microarchitecture,
// the vast majority of instructions in the
// vast majority of cases consume a consistent deterministic number of cycles. Comparing
// cycles allows comparison of algorithms across the same processor models with different
// frequencies (i7 3.5GHz vs. i7 3.8GHz), which can be useful if the test is being performed
// on lower-frequency/lower-cost versions of the production processors. On the other hand,
// time is a more valuable measurement when comparing the cost/benefit of upgrading
// to a new processor generation.
// Convert to milliseconds.
// mach_absolute_time() returns billionth of seconds,
// so divide by one million to get milliseconds
mach_timebase_info_data_t s_timebase_info;
mach_timebase_info(&s_timebase_info);
static const uint64_t kOneMillion = 1000 * 1000;
uint64_t milliseconds = ((ElapsedTicks * s_timebase_info.numer) / (kOneMillion * s_timebase_info.denom));
std::cout << "Number of ticks: " << ElapsedTicks << std::endl;
std::cout << "Milliseconds: " << milliseconds << std::endl;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment