Skip to content

Instantly share code, notes, and snippets.

@mrange
Last active October 3, 2016 09:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mrange/49e6010f42761c6a22ba1ca989940dfb to your computer and use it in GitHub Desktop.
Save mrange/49e6010f42761c6a22ba1ca989940dfb to your computer and use it in GitHub Desktop.
Measure and Verify your performance assumptions

This example is written with F# in mind but the ideas are applicable in all environments

The first rule when optimizing for performance is to not to rely assumption; always Measure and Verify your assumptions.

As we are not writing machine code directly it is hard to predict how the compiler and JIT:er transform your program to machine code. That's why we need to Measure the execution time to see that we get the performance improvement we expect and Verify that the actual program doesn't contain any hidden overhead.

Verification is the quite simple process that involves reverse engineering the executable using for example tools like ILSpy. The JIT:er complicates Verification in that seeing the actual machine code is tricky but doable. However, usually examining the IL-code gives the big gains.

The harder problem is Measuring; harder because it's tricky to setup realistic situations that allows to measure improvements in code. Still Measuring is invaluable.

Analyzing simple F# functions

Let's examine some simple F# functions that accumulates all integers in 1..n written in various different ways. As the range is a simple Arithmetic Series the result can be computed directly but for the purpose of this example we iterate over the range.

First we define some useful functions for measuring the time a function takes:

    // now () returns current time in milliseconds since start
    let now : unit -> int64 =
      let sw = System.Diagnostics.Stopwatch ()
      sw.Start ()
      fun () -> sw.ElapsedMilliseconds

    // time estimates the time 'action' repeated a number of times
    let time repeat action : int64*'T =
      let v = action ()  // Warm-up and compute value

      let b = now ()
      for i = 1 to repeat do
        action () |> ignore
      let e = now ()

      e - b, v

time runs an action repeatedly we need to run the tests for a few hundred milliseconds to reduce variance.

Then we define a few functions that accumulates all integers in 1..n in different ways.

    // Accumulates all integers 1..n using 'List'
    let accumulateUsingList n =
      List.init (n + 1) id
      |> List.sum

    // Accumulates all integers 1..n using 'Seq'
    let accumulateUsingSeq n =
      Seq.init (n + 1) id
      |> Seq.sum

    // Accumulates all integers 1..n using 'for-expression'
    let accumulateUsingFor n =
      let mutable sum = 0
      for i = 1 to n do
        sum <- sum + i
      sum

    // Accumulates all integers 1..n using 'foreach-expression' over range
    let accumulateUsingForEach n =
      let mutable sum = 0
      for i in 1..n do
        sum <- sum + i
      sum

    // Accumulates all integers 1..n using 'foreach-expression' over list range
    let accumulateUsingForEachOverList n =
      let mutable sum = 0
      for i in [1..n] do
        sum <- sum + i
      sum

    // Accumulates every second integer 1..n using 'foreach-expression' over range
    let accumulateUsingForEachStep2 n =
      let mutable sum = 0
      for i in 1..2..n do
        sum <- sum + i
      sum

    // Accumulates all 64 bit integers 1..n using 'foreach-expression' over range
    let accumulateUsingForEach64 n =
      let mutable sum = 0L
      for i in 1L..int64 n do
        sum <- sum + i
      sum |> int

    // Accumulates all integers n..1 using 'for-expression' in reverse order
    let accumulateUsingReverseFor n =
      let mutable sum = 0
      for i = n downto 1 do
        sum <- sum + i
      sum

    // Accumulates all 64 integers n..1 using 'tail-recursion' in reverse order
    let accumulateUsingReverseTailRecursion n =
      let rec loop sum i =
        if i > 0 then
          loop (sum + i) (i - 1)
        else
          sum
      loop 0 n

We assume the result to be the same (except for one of the functions that uses increment of 2) but is there difference in performance. To Measure this the following function is defined:

    let testRun (path : string) =
      use testResult = new System.IO.StreamWriter (path)
      let write   (l : string)  = testResult.WriteLine l
      let writef  fmt           = FSharp.Core.Printf.kprintf write fmt

      write "Name\tTotal\tOuter\tInner\tCC0\tCC1\tCC2\tTime\tResult"

      // total is the total number of iterations being executed
      let total   = 10000000
      // outers let us variate the relation between the inner and outer loop
      //  this is often useful when the algorithm allocates different amount of memory
      //  depending on the input size. This can affect cache locality
      let outers  = [| 1000; 10000; 100000 |]
      for outer in outers do
        let inner = total / outer

        // multiplier is used to increase resolution of certain tests that are significantly
        //  faster than the slower ones

        let testCases =
          [|
        //   Name of test                         multiplier    action
            "List"                              , 1           , accumulateUsingList
            "Seq"                               , 1           , accumulateUsingSeq
            "for-expression"                    , 100         , accumulateUsingFor
            "foreach-expression"                , 100         , accumulateUsingForEach
            "foreach-expression over List"      , 1           , accumulateUsingForEachOverList
            "foreach-expression increment of 2" , 1           , accumulateUsingForEachStep2
            "foreach-expression over 64 bit"    , 1           , accumulateUsingForEach64
            "reverse for-expression"            , 100         , accumulateUsingReverseFor
            "reverse tail-recursion"            , 100         , accumulateUsingReverseTailRecursion
          |]
        for name, multiplier, a in testCases do
          System.GC.Collect (2, System.GCCollectionMode.Forced, true)
          let cc g = System.GC.CollectionCount g

          printfn "Accumulate using %s with outer=%d and inner=%d ..." name outer inner

          // Collect collection counters before test run
          let pcc0, pcc1, pcc2 = cc 0, cc 1, cc 2

          let ms, result       = time (outer*multiplier) (fun () -> a inner)
          let ms               = (float ms / float multiplier)

          // Collect collection counters after test run
          let acc0, acc1, acc2 = cc 0, cc 1, cc 2
          let cc0, cc1, cc2    = acc0 - pcc0, acc1 - pcc1, acc1 - pcc1
          printfn "  ... took: %f ms, GC collection count %d,%d,%d and produced %A" ms cc0 cc1 cc2 result

          writef "%s\t%d\t%d\t%d\t%d\t%d\t%d\t%f\t%d" name total outer inner cc0 cc1 cc2 ms result

The test result while running on .NET 4.5.2 x64:

Test results on .NET 4.5.2 x64

We see dramatic difference and some of the results are unexpectedly bad.

Let's look at the bad cases:

List

    // Accumulates all integers 1..n using 'List'
    let accumulateUsingList n =
      List.init (n + 1) id
      |> List.sum

What happens here is a full list containing all integers 1..n is created and reduced using a sum. This should be more expensive than just iterating and accumulating over the range, it seems about ~42x slower than the for loop.

In addition, we can see that the GC ran about 100x during the test run because the code allocated a lot of objects. This also costs CPU.

Seq

    // Accumulates all integers 1..n using 'Seq'
    let accumulateUsingSeq n =
      Seq.init (n + 1) id
      |> Seq.sum

The Seq version doesn't allocate a full List so it's a bit suprising that this ~270x slower than the for loop. In addition, we see that the GC has executed 661x.

Seq is inefficient when the amount of work per item is very small (in this case aggregating two integers).

The point is not to never use Seq. The point is to Measure.

foreach-expression over List

    // Accumulates all integers 1..n using 'foreach-expression' over range
    let accumulateUsingForEach n =
      let mutable sum = 0
      for i in 1..n do
        sum <- sum + i
      sum

    // Accumulates all integers 1..n using 'foreach-expression' over list range
    let accumulateUsingForEachOverList n =
      let mutable sum = 0
      for i in [1..n] do
        sum <- sum + i
      sum

The difference between these two function is very subtle but the performance difference is not, roughly ~76x. Why? Let's reverse engineer the bad code:

    public static int accumulateUsingForEach(int n)
    {
      int sum = 0;
      int i = 1;
      if (n >= i)
      {
        do
        {
          sum += i;
          i++;
        }
        while (i != n + 1);
      }
      return sum;
    }

    public static int accumulateUsingForEachOverList(int n)
    {
      int sum = 0;
      FSharpList<int> fSharpList = SeqModule.ToList<int>(Operators.CreateSequence<int>(Operators.OperatorIntrinsics.RangeInt32(1, 1, n)));
      for (FSharpList<int> tailOrNull = fSharpList.TailOrNull; tailOrNull != null; tailOrNull = fSharpList.TailOrNull)
      {
        int i = fSharpList.HeadOrDefault;
        sum += i;
        fSharpList = tailOrNull;
      }
      return sum;
    }

accumulateUsingForEach is implemented as an efficient while loop but for i in [1..n] is converted into:

    FSharpList<int> fSharpList =
      SeqModule.ToList<int>(
        Operators.CreateSequence<int>(
          Operators.OperatorIntrinsics.RangeInt32(1, 1, n)));

This means first we create a Seq over 1..n and finally calls ToList.

Expensive.

foreach-expression increment of 2

    // Accumulates all integers 1..n using 'foreach-expression' over range
    let accumulateUsingForEach n =
      let mutable sum = 0
      for i in 1..n do
        sum <- sum + i
      sum

    // Accumulates every second integer 1..n using 'foreach-expression' over range
    let accumulateUsingForEachStep2 n =
      let mutable sum = 0
      for i in 1..2..n do
        sum <- sum + i
      sum

Once again the difference between these two functions are subtle but the performance difference is brutal: ~25x

Once again let's run ILSpy:

    public static int accumulateUsingForEachStep2(int n)
    {
      int sum = 0;
      IEnumerable<int> enumerable = Operators.OperatorIntrinsics.RangeInt32(1, 2, n);
      foreach (int i in enumerable)
      {
        sum += i;
      }
      return sum;
    }

A Seq is created over 1..2..n and then we iterate over Seq using the enumerator.

We were expecting F# to create something like this:

    public static int accumulateUsingForEachStep2(int n)
    {
      int sum = 0;
      for (int i = 1; i < n; i += 2)
      {
        sum += i;
      }
      return sum;
    }

However, F# compiler only supports efficient for loops over int32 ranges that increment by one. For all other cases it falls back on Operators.OperatorIntrinsics.RangeInt32. Which will explain the next suprising result

foreach-expression over 64 bit

    // Accumulates all 64 bit integers 1..n using 'foreach-expression' over range
    let accumulateUsingForEach64 n =
      let mutable sum = 0L
      for i in 1L..int64 n do
        sum <- sum + i
      sum |> int

This performs ~47x slower than the for loop, the only difference is that we iterate over 64 bit integers. ILSpy shows us why:

    public static int accumulateUsingForEach64(int n)
    {
      long sum = 0L;
      IEnumerable<long> enumerable = Operators.OperatorIntrinsics.RangeInt64(1L, 1L, (long)n);
      foreach (long i in enumerable)
      {
        sum += i;
      }
      return (int)sum;
    }

F# only supports efficient for loops for int32 numbers it has to use the fallback Operators.OperatorIntrinsics.RangeInt64.

The other cases performs roughly similar:

Test results on .NET 4.5.2 x64

The reason the performance degrades for larger test runs is that the overhead of invoking the action is growing as we doing less and less work in action.

Looping towards 0 can sometimes give performance benefits as it might save a CPU register but in this case the CPU has registers to spare so it doesn't seem to make a difference.

Conclusion

Measuring is important because otherwise we might think all these alternatives are equivalent but some alternatives are ~270x slower than others.

The Verification step involves reverse engineering the executable helps us explain why we did or did not get performance we expected. In addition, Verification can help us predict performance in the cases it's too difficult to do a proper Measurement.

It's hard to predict performance there always Measure, always Verify your performance assumptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment