Skip to content

Instantly share code, notes, and snippets.

@mrange
Last active October 3, 2016 09:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mrange/adc7a330632de20c236829f4bd2d89bc to your computer and use it in GitHub Desktop.
Save mrange/adc7a330632de20c236829f4bd2d89bc to your computer and use it in GitHub Desktop.
Comparison of different data pipelines in F#

In F# there are many options for creating data pipelines, for example: List, Seq and Array.

What data pipeline is preferable from memory usage and performance perspective?

In order to answer this we'll compare performance and memory usage using different pipelines.

Data Pipeline

In order to measure the overhead, we will use a data pipeline with low cpu-cost per items processed:

    let seqTest n =
      Seq.init (n + 1) id
      |> Seq.map    int64
      |> Seq.filter (fun v -> v % 2L = 0L)
      |> Seq.map    ((+) 1L)
      |> Seq.sum

We will create equivalent pipelines for all alternatives and compare them.

We will variate the size of n but let the total number of work be the same.

Data Pipeline Alternatives

We will compare the following alternatives:

  1. Imperative code
  2. Array (Non-lazy)
  3. List (Non-lazy)
  4. LINQ (Lazy pull stream)
  5. Seq (Lazy pull stream)
  6. Nessos (Lazy pull/push stream)
  7. PullStream (Simplistic pull stream)
  8. PushStream (Simplistic push stream)

Although not a data pipeline we will compare against Imperative code since that most closely match how the CPU executes code. That should be that fastest possible way to compute the result allowing us to measure the performance overhead of data pipelines.

Array and List compute a full Array/List in each step so we expect memory overhead.

LINQ and Seq are both based around IEnumerable<'T> which is lazy pull stream (pull means that the consumer stream is pulling data out of the producer stream). We therefore expect the performance and memory usage to be identical.

Nessos is a high-performance stream library that supports both push & pull (like Java Stream).

PullStream and PushStream are simplistic implementations of Pull & Push streams.

Performance Results from running on: F# 4.0 - .NET 4.6.1 - x64

Performance Results from running on: F# 4.0 - .NET 4.6.1 - x64

The bars show the elapsed time, lower is better. The total amount of useful work is the same for all tests so the results are comparable. This also means that few runs implies larger datasets.

As usual when Measuring one see interesting results.

  1. List performance poor is compared to other alternatives for large data sets. This can be because of GC or poor cache locality.
  2. Array performance better than expected.
  3. LINQ performs better than Seq, this is unexpected because both are based around IEnumerable<'T>. However, Seq internally is based around a generic impementation for all algorithms while LINQ uses specialized algorithms.
  4. Push performs better than Pull. This is expected since the push data pipeline has fewer checks
  5. The simplistic Push data pipelines performs comparable to Nessos. However, Nessos supports pull and parallelism.
  6. For small data pipelines the performance of Nessos degrades possible because pipelines setup overhead.
  7. As expected the Imperative code performed the best

GC Collection count from running on: F# 4.0 - .NET 4.6.1 - x64

GC Collection count from running on: F# 4.0 - .NET 4.6.1 - x64

The bars shows the total number of GC collection counts during the test, lower is better. This is a measurement of how many objects are created by the data pipeline.

As usual when Measuring one see interesting results.

  1. List is expectedly creating more objects than Array because a List is essentially a single linked list of nodes. An array is a continous memory area.
  2. Looking at the underlying numbers both List & Array forces 2 generation collections. These kind of collection are expensive.
  3. Seq is triggering a surprising amount of collections. It's surprisingly even worse than List in this regard.
  4. LINQ, Nessos, Push and Pull triggers no collections for few runs. However, objects are allocated so the GC eventually will have to run.
  5. As expected since the Imperative code allocate no objects no GC collections were triggered.

Conclusion

All data pipelines do the same amount of useful work in all test cases but we see significant differences in performance and memory usage between the different pipelines.

In addition, we notice that the overhead of data pipelines differ depending on the size of data processed. For example, for small sizes Array is performing quite well.

One should keep in mind the amount of work performed in each step in the pipeline is very small in order to measure the overhead. In "real" situations the overhead of Seq might not matter because the actual work is more time consuming.

Of more concern is the memory usage differences. GC isn't free and it is beneficial for long running applications to keep GC pressure down.

For F# developers concerned about performance and memory usage it's recommended to check out Nessos Streams.

If you need top-notch performance strategically placed Imperative code is worth considering.

Finally, when it comes to performance don't make assumptions. Measure and Verify.

Full source code:

    module PushStream =
      type Receiver<'T> = 'T -> bool
      type Stream<'T>   = Receiver<'T> -> unit

      let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
        fun r -> s (fun v -> if f v then r v else true)

      let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
        fun r -> s (fun v -> r (m v))

      let inline range b e : Stream<int> =
        fun r ->
          let rec loop i = if i <= e && r i then loop (i + 1)
          loop b

      let inline sum (s : Stream<'T>) : 'T =
        let mutable state = LanguagePrimitives.GenericZero<'T>
        s (fun v -> state <- state + v; true)
        state

    module PullStream =

      [<Struct>]
      [<NoComparison>]
      [<NoEqualityAttribute>]
      type Maybe<'T>(v : 'T, hasValue : bool) =
        member    x.Value        = v
        member    x.HasValue     = hasValue
        override  x.ToString ()  =
          if hasValue then
            sprintf "Just %A" v
          else
            "Nothing"

      let Nothing<'T>     = Maybe<'T> (Unchecked.defaultof<'T>, false)
      let inline Just v   = Maybe<'T> (v, true)

      type Iterator<'T> = unit -> Maybe<'T>
      type Stream<'T>   = unit -> Iterator<'T>

      let filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
        fun () ->
          let i = s ()
          let rec pop () =
            let mv = i ()
            if mv.HasValue then
              let v = mv.Value
              if f v then Just v else pop ()
            else
              Nothing
          pop

      let map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
        fun () ->
          let i = s ()
          let pop () =
            let mv = i ()
            if mv.HasValue then
              Just (m mv.Value)
            else
              Nothing
          pop

      let range b e : Stream<int> =
        fun () ->
          let mutable i = b
          fun () ->
            if i <= e then
              let p = i
              i <- i + 1
              Just p
            else
              Nothing

      let inline sum (s : Stream<'T>) : 'T =
        let i = s ()
        let rec loop state =
          let mv = i ()
          if mv.HasValue then
            loop (state + mv.Value)
          else
            state
        loop LanguagePrimitives.GenericZero<'T>

    module PerfTest =

      open System.Linq
    #if USE_NESSOS
      open Nessos.Streams
    #endif

      let now =
        let sw = System.Diagnostics.Stopwatch ()
        sw.Start ()
        fun () -> sw.ElapsedMilliseconds

      let time n a =
        let inline cc i       = System.GC.CollectionCount i

        let v                 = a ()

        System.GC.Collect (2, System.GCCollectionMode.Forced, true)

        let bcc0, bcc1, bcc2  = cc 0, cc 1, cc 2
        let b                 = now ()

        for i in 1..n do
          a () |> ignore

        let e = now ()
        let ecc0, ecc1, ecc2  = cc 0, cc 1, cc 2

        v, (e - b), ecc0 - bcc0, ecc1 - bcc1, ecc2 - bcc2

      let arrayTest n =
        Array.init (n + 1) id
        |> Array.map    int64
        |> Array.filter (fun v -> v % 2L = 0L)
        |> Array.map    ((+) 1L)
        |> Array.sum

      let imperativeTest n =
        let rec loop s i =
          if i >= 0L then
            if i % 2L = 0L then
              loop (s + i + 1L) (i - 1L)
            else
              loop s (i - 1L)
          else
            s
        loop 0L (int64 n)

      let linqTest n =
        (((Enumerable.Range(0, n + 1)).Select int64).Where(fun v -> v % 2L = 0L)).Select((+) 1L).Sum()

      let listTest n =
        List.init (n + 1) id
        |> List.map     int64
        |> List.filter  (fun v -> v % 2L = 0L)
        |> List.map     ((+) 1L)
        |> List.sum

    #if USE_NESSOS
      let nessosTest n =
        Stream.initInfinite id
        |> Stream.take    (n + 1)
        |> Stream.map     int64
        |> Stream.filter  (fun v -> v % 2L = 0L)
        |> Stream.map     ((+) 1L)
        |> Stream.sum
    #endif

      let pullTest n =
        PullStream.range 0 n
        |> PullStream.map     int64
        |> PullStream.filter  (fun v -> v % 2L = 0L)
        |> PullStream.map     ((+) 1L)
        |> PullStream.sum

      let pushTest n =
        PushStream.range 0 n
        |> PushStream.map     int64
        |> PushStream.filter  (fun v -> v % 2L = 0L)
        |> PushStream.map     ((+) 1L)
        |> PushStream.sum

      let seqTest n =
        Seq.init (n + 1) id
        |> Seq.map    int64
        |> Seq.filter (fun v -> v % 2L = 0L)
        |> Seq.map    ((+) 1L)
        |> Seq.sum

      let perfTest (path : string) =
        let testCases =
          [|
            "array"       , arrayTest       
            "imperative"  , imperativeTest  
            "linq"        , linqTest        
            "list"        , listTest        
            "seq"         , seqTest         
    #if USE_NESSOS
            "nessos"      , nessosTest      
    #endif
            "pull"        , pullTest        
            "push"        , pushTest        
          |]
        use out                   = new System.IO.StreamWriter (path)
        let write (msg : string)  = out.WriteLine msg
        let writef fmt            = FSharp.Core.Printf.kprintf write fmt

        write "Name\tTotal\tOuter\tInner\tElapsed\tCC0\tCC1\tCC2\tResult"

        let total   = 10000000
        let outers = [| 10; 1000; 1000000 |]
        for outer in outers do
          let inner = total / outer
          for name, a in testCases do
            printfn "Running %s with total=%d, outer=%d, inner=%d ..." name total outer inner
            let v, ms, cc0, cc1, cc2 = time outer (fun () -> a inner)
            printfn "  ... %d ms, cc0=%d, cc1=%d, cc2=%d, result=%A" ms cc0 cc1 cc2 v
            writef "%s\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d" name total outer inner ms cc0 cc1 cc2 v

    [<EntryPoint>]
    let main argv =
      System.Environment.CurrentDirectory <- System.AppDomain.CurrentDomain.BaseDirectory
      PerfTest.perfTest "perf.tsv"
      0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment