avitevet/accurateOSXPerformanceMeasurements.cpp

## accurateOSXPerformanceMeasurements.cpp
#include <mach/mach_time.h>

#include <iostream>

// Shows how to perform accurate performance measurements on modern OS X in application-level code
// on x86 processors.
//
// Performance monitoring on Intel processors is a challenge because the behavior of the internal
// processor counters has changed over the years.  Also, processor frequency used to be fixed,
// but with the introduction of SpeedStep technology the processor frequency may change at
// essentially any time.  This problem is compounded by multi-core processors that each have
// their own clocks.
//
// If the measurement is restricted to "modern"
// x86 processors, like basically Pentium 4 and newer, the time-stamp counter increments at a constant
// rate regardless of any adjustments the processor may make for power/performance reasons (aka due to
// SpeedStep).  From the Intel IA32 Architectures SW Developer System Programming Manual, vol 3B, ch 17:
//
// "... the time-stamp counter increments at a constant rate. That rate may be set by the
// maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at
// which the processor is booted.  The maximum resolved frequency may differ from the processor base
// frequency, see Section 18.18.2 for more detail.  On certain processors, the TSC frequency may not be the same
// as the frequency in the brand string.
// The specific processor configuration determines the behavior.  Constant TSC behavior ensures that the duration
// of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core
// changes frequency.  This is the architectural behavior moving forward."
//
// So, to do the fast & loose measurement, you could call the use RDTSC instruction via assembly or intrinsic,
// though Intel processors have restrictions and/or configurations for this, like whether it's allowed to
// call this with application-level privilege or what the processor should do when this is called while
// a VM is running.
//
// It's safest to simply use the abstraction that OS X provides.

int main() {
    uint64_t StartTick, EndTick, ElapsedTicks;

    StartTick = mach_absolute_time();

    // Activity to be timed.  Typically a short operation, like a single division or a single
    // sqrt, would be done many times.  The number of ticks for a single operation
    // would be found by dividing ElapsedTicks by the number of iterations of the operation.
    // A longer running operation, like a large matrix multiplication, could be performed
    // only once.
    //
    // A note on calling a single operation multiple times using a loop: a loop gets
    // translated into a set of instructions for the body, a jump, and a test + conditional jump (cjump).
    // Modern Intel processors attempt to increase performance by predicting which
    // instruction will execute next after a cjump - a misprediction is one of the more
    // costly things that a processor can do.  So, to avoid the effect of branch
    // mispredictions in accurate timing, either use the compiler's loop unrolling
    // feature (and check the disassembly) or manually unroll the loop, even using
    // copy/paste if necessary.  Or, use a LOT of loop iterations - the cost of the
    // final misprediction will be amortized over the cost of the operation.
    int numStuff = 0;
    for (int i = 0; i < 1000000; ++i) {
        numStuff = 1 - numStuff;
    }


    EndTick = mach_absolute_time();
    ElapsedTicks = EndTick - StartTick;

    // Number of ticks can be converted to seconds by dividing by time-stamp counter frequency.  Though,
    // in terms of comparing different algorithms/modules/etc, it may be just as useful
    // to compare # ticks as total time, because for a particular microarchitecture,
    // the vast majority of instructions in the
    // vast majority of cases consume a consistent deterministic number of cycles.  Comparing
    // cycles allows comparison of algorithms across the same processor models with different
    // frequencies (i7 3.5GHz vs. i7 3.8GHz), which can be useful if the test is being performed
    // on lower-frequency/lower-cost versions of the production processors.  On the other hand,
    // time is a more valuable measurement when comparing the cost/benefit of upgrading
    // to a new processor generation.

    // Convert to milliseconds.
    // mach_absolute_time() returns billionth of seconds,
    // so divide by one million to get milliseconds
    mach_timebase_info_data_t s_timebase_info;
    mach_timebase_info(&s_timebase_info);
    static const uint64_t kOneMillion = 1000 * 1000;
    uint64_t milliseconds = ((ElapsedTicks * s_timebase_info.numer) / (kOneMillion * s_timebase_info.denom));

    std::cout << "Number of ticks: " << ElapsedTicks << std::endl;
    std::cout << "Milliseconds:    " << milliseconds << std::endl;
}
	#include <mach/mach_time.h>

	#include <iostream>

	// Shows how to perform accurate performance measurements on modern OS X in application-level code
	// on x86 processors.
	//
	// Performance monitoring on Intel processors is a challenge because the behavior of the internal
	// processor counters has changed over the years. Also, processor frequency used to be fixed,
	// but with the introduction of SpeedStep technology the processor frequency may change at
	// essentially any time. This problem is compounded by multi-core processors that each have
	// their own clocks.
	//
	// If the measurement is restricted to "modern"
	// x86 processors, like basically Pentium 4 and newer, the time-stamp counter increments at a constant
	// rate regardless of any adjustments the processor may make for power/performance reasons (aka due to
	// SpeedStep). From the Intel IA32 Architectures SW Developer System Programming Manual, vol 3B, ch 17:
	//
	// "... the time-stamp counter increments at a constant rate. That rate may be set by the
	// maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at
	// which the processor is booted. The maximum resolved frequency may differ from the processor base
	// frequency, see Section 18.18.2 for more detail. On certain processors, the TSC frequency may not be the same
	// as the frequency in the brand string.
	// The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration
	// of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core
	// changes frequency. This is the architectural behavior moving forward."
	//
	// So, to do the fast & loose measurement, you could call the use RDTSC instruction via assembly or intrinsic,
	// though Intel processors have restrictions and/or configurations for this, like whether it's allowed to
	// call this with application-level privilege or what the processor should do when this is called while
	// a VM is running.
	//
	// It's safest to simply use the abstraction that OS X provides.

	int main() {
	uint64_t StartTick, EndTick, ElapsedTicks;

	StartTick = mach_absolute_time();

	// Activity to be timed. Typically a short operation, like a single division or a single
	// sqrt, would be done many times. The number of ticks for a single operation
	// would be found by dividing ElapsedTicks by the number of iterations of the operation.
	// A longer running operation, like a large matrix multiplication, could be performed
	// only once.
	//
	// A note on calling a single operation multiple times using a loop: a loop gets
	// translated into a set of instructions for the body, a jump, and a test + conditional jump (cjump).
	// Modern Intel processors attempt to increase performance by predicting which
	// instruction will execute next after a cjump - a misprediction is one of the more
	// costly things that a processor can do. So, to avoid the effect of branch
	// mispredictions in accurate timing, either use the compiler's loop unrolling
	// feature (and check the disassembly) or manually unroll the loop, even using
	// copy/paste if necessary. Or, use a LOT of loop iterations - the cost of the
	// final misprediction will be amortized over the cost of the operation.
	int numStuff = 0;
	for (int i = 0; i < 1000000; ++i) {
	numStuff = 1 - numStuff;
	}


	EndTick = mach_absolute_time();
	ElapsedTicks = EndTick - StartTick;

	// Number of ticks can be converted to seconds by dividing by time-stamp counter frequency. Though,
	// in terms of comparing different algorithms/modules/etc, it may be just as useful
	// to compare # ticks as total time, because for a particular microarchitecture,
	// the vast majority of instructions in the
	// vast majority of cases consume a consistent deterministic number of cycles. Comparing
	// cycles allows comparison of algorithms across the same processor models with different
	// frequencies (i7 3.5GHz vs. i7 3.8GHz), which can be useful if the test is being performed
	// on lower-frequency/lower-cost versions of the production processors. On the other hand,
	// time is a more valuable measurement when comparing the cost/benefit of upgrading
	// to a new processor generation.

	// Convert to milliseconds.
	// mach_absolute_time() returns billionth of seconds,
	// so divide by one million to get milliseconds
	mach_timebase_info_data_t s_timebase_info;
	mach_timebase_info(&s_timebase_info);
	static const uint64_t kOneMillion = 1000 * 1000;
	uint64_t milliseconds = ((ElapsedTicks * s_timebase_info.numer) / (kOneMillion * s_timebase_info.denom));

	std::cout << "Number of ticks: " << ElapsedTicks << std::endl;
	std::cout << "Milliseconds: " << milliseconds << std::endl;
	}