Skip to content

Instantly share code, notes, and snippets.

@mrange
Last active June 11, 2020 21:21
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save mrange/aa82d33e94ad76d0f33ed86e704d7492 to your computer and use it in GitHub Desktop.
On the cost of being lazy

On the cost of being lazy

Full source code can be found here

Changelog

  1. 2017-01-04
  2. New performance test - Paul Westcott (@manofstick) provided his implementation of Lazy semantics (called Lazzzy) which I included in the graphs. This implementation is intended to be equivalent to Lazy<'T> in .NET with regards to functionality.
  3. 2017-01-17
  4. New performance test - is working on a PR for .NET Core. I wanted to include performance numbers for the updated Lazy<'T> class.

Previously I have looked at performance of different persistent hash maps and data pipelines in .NET. Now I wonder:

What is the cost of being lazy?

In F# we can define a lazy value like this:

let x = lazy expensiveCalculation ()  // Delay the expensive computation
let y = x.Value                       // Forces the computation and returns the cached result
let z = x.Value                       // Returns the cached result

What is the CPU cost of this delayed computation and how does lazy compare with alternatives?

To test it I have created a simple algorithm that allows me to control the ratio of how often the lazy value will be be recreated so that we can see how the relative cost of being lazy depends on the ratio. The delayed computation is very cheap (returns an integer) as we like to measure the cost of being lazy.

let inline delay i              = lazy i
let inline value (l : Lazy<_>)  = l.Value


let createTestCases count ratio =
  let rec simpleLoop l r s i =
    if i < count then
      let r = r + ratio
      if r >= 1. then
        let r = r - 1.
        let l = delay i // Reset the lazy value to force recomputation
        simpleLoop l r (s + value l) (i + 1)
      else
        simpleLoop l r (s + value l) (i + 1) // Use the cached value
    else
      s

  let simple () =
    simpleLoop (delay 0) 0. 0 0

  simple

I vary the ratio from 0% (only uses the cached value) to 100% (performs computation every time) in step of 20%.

These are the alternatives I selected:

  1. no lazy - The control, the computation is done every time. As the computation is extremely cheap we expect this to be the fastest and we will just the control to compare the cost of being lazy.
  2. lazy - Uses F# lazy
  3. Lazy (Execution & Publication) - F# internally uses .NET Lazy<'T> with Execution & Publication protection so we expect the performance to be identical.
  4. Lazy (Publication) - Same as above but here we only uses publication protection. That means that the computation might be carried out by several threads but only one of the results are cached. There subtleties that I haven't dug into yet on whether all threads will see the same instance or not?
  5. Lazy (None) - No thread safety, if multiple threads are dereferencing the lazy value the behavior is undefined.
  6. Lazzzy (Execution & Publication) - Paul Westcott (@manofstick) provided his implementation of Lazy semantics (called Lazzzy).
  7. Lazzzy (Publication) - -"-
  8. Lazzzy (None) - -"-
  9. NewLazy (Execution & Publication) - Paul Westcott (@manofstick) is working on a PR for .NET Core. I wanted to include performance numbers for the updated Lazy<'T> class.
  10. NewLazy (Publication) - -"-
  11. NewLazy (None) - -"-
  12. Flag (Trivial) - A lazy implementation using a simple unprotected flag to decide whether a value is computed or not. Doesn't cache exceptions.
  13. Flag (Compact) - A lazy implementation letting the value act both value and flag. Doesn't cache exceptions.
  14. Flag (Exception aware) - A lazy implementation using a simple unprotected flag to decide whether a value is computed or not. Does cache exceptions.
  15. Flag (Protected) - A lazy implementation that uses a simple publication protection scheme. Unfortunately so simple that different threads might see different instances. Doesn't cache exceptions.
  16. Flag (Full protection) - A lazy implementation that uses Monitor to achieve Execution & Publication protection. Doesn't cache exceptions.
  17. Flag (Full protection w. DC) - A lazy implementation that uses Monitor to achieve Execution & Publication protection and Double Check pattern to reduce cost of reading a cached value. Doesn't cache exceptions.

Performance in Milliseconds - F# 4, .NET 4.6.2, x64

Performance in Milliseconds - F# 4, .NET 4.6.2, x64

As expected no lazy is the fastest overall (because the computation is very cheap).

At 0% all lazy implementations except Flag (Compact) and Flag (Full protection) has similar performance. Flag (Compact) does a type check each time it is dereferenced, this is the cost we see. Flag (Full protection) does a Monitor.Enter/Monitor.Leave when it's dereferenced which is costly compared to the other alternatives.

For all lazy implementations except Flag (Full protection) the overhead seems to be roughly linear to the ratio.

At 100% we see that lazy incurs a 50x performance overhead. It is also somewhat surprising that lazy does worse than Lazy (Execution & Publication) as F# uses the same implementation but it turns out that FSharp.Core allocates an extra function object when we use lazy.

Because of the forgiving memory model of i86 it turns out that the cost of Flag (Trivial) and Flag (Protected) is almost identical. This might not be true on ARM or PowerPC.

Paul Westcott (@manofstick) Lazzzy implementation does significantly better than the Lazy<'T> and according to Paul Lazzzy should replicate the full functionality of Lazy<'T> which my simpler flag based schemes don't.

In all these cases we have no contention of the locks so the performance will look different in a concurrent environment with heavy contention. What is the cost of locks under contention? That is a topic for another blog post.

Collection Count in Milliseconds - F# 4, .NET 4.6.2, x64

Collection Count in Milliseconds - F# 4, .NET 4.6.2, x64

Collection count gives an indication on the number of objects allocated by the lazy implementations. Lower is better.

For all lazy implementations the overhead seems to be roughly linear to the ratio.

All flag implementations except Flag Compact incurs the same overhead where Flag Compact ironically has a higher overhead. This is because Flag Compact needs to box the integer value.

Reflections

Is it a problem that for lazy at 100% the overhead is large? Isn't it so that in most realistic cases the ratio will be closer to 0% than 100%. In at least one case the opposite is true, consider Seq.upto. It turns out Seq.upto uses Lazy<_> to cache the current value (or exception). Normally the current value is accessed just once and then the next value is computed. That means in this case the ratio is closer to 100% and then incurs a large overhead if the computation is cheap. During my comparison of data pipelines in .NET this turned out to be the source of most of the overhead of Seq over other pipelines.

Personally, I am sceptical of mutually exclusive regions of code (like Monitor.Enter/Monitor.Exit). One of my reasons for this is the dreaded priority inversion. Now on Windows and non real-time Linux this isn't much of a problem in general but when working with real-time system priority version is real problem. In addition, the real-time distributions of Linux I had the pleasure of working with had severe problems in the futex implementation which meant that a futex could lock-out a different process (!) if we are unlucky. So by avoiding mutually exclusive regions we avoid these problems.

On the other hand, implementing thread-safe code without mutually exclusive regions is really hard as it requires whole different understanding of how the compiler, jitter CPU, cache and memory system rewrites your code in order to speed it up. Basically, reads are scheduled earlier and writes are delayed. Reads & writes that are deemed unnecessary may be eliminated. These rewrites works brilliantly in a non-concurrent environment. In a concurrent the rewrites makes it impossible for you to look at the code and deduce what happens (because this is not the program that is being executed). Because of this the simple double-check locking is actually quite hard to implement. Mutually exclusive regions restores sanity but as mentioned above has their own problem.

Finally, know that being lazy has costs. If the computation is cheap it might be better to do the computation repeatedly instead of caching the result.

Hope this was interesting to you,

Mårten

@giuliohome
Copy link

giuliohome commented Dec 31, 2016

Excuse me, I can't follow the explanation and I fail to understand the conclusion: how can be better (=faster) to do the computation repeatedly instead of caching the result? Which is the simple reason why this would happen? Is the computation so extremely cheap, that it shouldn't be called computation at all?

By the way (but only a bit off topic, sorry) some other interesting considerations (from this blog) about using Lazy objects to ensure that a generator function is called only once instead of semaphores and locks (and about the overhead keeping SemaphoreSlim objects around to prevent cache misses)

@mrange
Copy link
Author

mrange commented Dec 31, 2016

@giuliohome thanks for your feedback.

IMO computations goes from the very very cheap like id (which for integers end up in mov eax, edx which is about as cheap as it can can get) to the very expensive like protein folding (https://folding.stanford.edu/).

In some sense lazy has to do computations in order to figure if the value is cached or not by checking a flag but also creating lazy objects requires computations to initialize and allocate the objects. There's also a delayed computation in that the GC needs to perform computations in order to find out if an object can be collected or not. In addition, memory barriers and mutexes requires computations to synchronize caches and schedule thread execution.

So if your computation is very cheap like id or (+) 1 it can well be better to execute it everytime than using a cached result that involves checking flags, creating objects, GC time, memory barriers and so on.

Thinking about .NET Lazy<'T> I think this class was designed to cache expensive computations (>10,000 cycles). In those case the cost of being lazy is small compared to doing the work over and over again. However, if we as developers are not aware that being lazy costs something we might use it in places where being lazy hurts us, especially when it as easy as lazy 0.

Lastly, I ballpark the cost of creating a lazy object to ~300 cycles. That is roughly 300x more expensive than incrementing an integer by a number or about as much as full cache miss. Dereferencing a lazy object ballparks to ~10 cycles somewhat comparable to a function call.

@giuliohome
Copy link

giuliohome commented Jan 1, 2017

Ok. Thank you very much for the very detailed answer! Perfect and very clear!
Let me just add some lines more.
First, I admit that I used the word "computation" too vaguely (yet this is its generic meaning, just think about a typical F# "computation expression" involving not only CPU, but also network activities, like web requests, etc.), I really meant something that requires a db access, but has sufficiently constant result... I meant a "business" computation instead of a CPU calculation. 2) Anyway, I understand that when you say "as much as full cache miss" you're speaking about accessing the RAM memory, not about an I/O operation to fetch data from DB (correct?): in that case we can save dozens of milliseconds (as order of magnitude per operation, so seconds for many repetitions) of execution, while 300 cycles (I'm not sure how you derived this ballpark number, see my p.s., it's apparently a lot, even though you're right, Lazy is thread safe etc...) on a 2GHz CPU take only 300/2 = 150 nano (10^-9) seconds! (correct?)... and finally 3) the fact that the lazy value is stored allows one to start an initial background pre-fetch asap, so that the lazy object creation time is no longer on the critical path and the main execution is (almost) not delayed by its duration...

P.S. some other ballpark numbers to compare, for example from http://stackoverflow.com/a/27803969

@mrange
Copy link
Author

mrange commented Jan 2, 2017

I think Lazy<'T> in .NET was designed for caching DB results. For that it works fine.

The risk is that if we think Lazy<T> is free we end up using it in situations where 150ns is a long time. Let's you are iterating over 1,000,000 objects and for each object you create a Lazy object that you force. That is 150ms. Let's say the computation you force takes roughly 0.3 ns (like id) you now have an overhead of 450x. Why would anyone do something like that you wonder?

Well that is exactly what Seq.upto does which is used to generate ranges making Seq.upto slow and forcing the GC to runs several times over the full computation.

So I guess my point is, we should use Lazy<T> when caching expensive operations, not to cache trivial operations.

@manofstick
Copy link

manofstick commented Jan 2, 2017

Hi @mrange!

Ages ago I played a bit with lazy as documented here on stackoverflow.

Anyway, I have updated the code to f# land here (and fixed it a bit I think [hope!]), maybe you could give this a burl!

type Lazzzy<'T>(create:unit->'T) =
    let sync = obj ()
    let mutable counter = 0
    let mutable getter  = Unchecked.defaultof<unit->'T>
    [<VolatileField>]
    let mutable value   = Unchecked.defaultof<'T>
    do
        getter <- fun () ->
            lock sync (fun () ->
                counter <- counter + 1
                if counter = 1 then
                    value <- create ()
                Thread.MemoryBarrier ()
                getter <- fun () -> value
                value)

    member __.Value = getter ()

(assuming my code is correct, then this should have the same functionality as LazyThreadSafetyMode.ExecutionAndPublication)

@manofstick
Copy link

manofstick commented Jan 3, 2017

Lets see how much I can get wrong when I try to provide the whole of the System.Lazy functionality :-)

[<AbstractClass>]
type private LazyItem<'T>(isInitializer:bool, isValueCreated:bool) =
    member __.IsInitializer  = isInitializer
    member __.IsValueCreated = isValueCreated
    abstract member Value : 'T

[<AbstractClass>]
type private LazyInitializer<'T>() =
    inherit LazyItem<'T>(true, false)

type private LazyValue<'T>(t:'T) =
    inherit LazyItem<'T>(false, true)
    override __.Value = t

type private LazyException<'T>(e:System.Exception) =
    inherit LazyItem<'T>(false, false)
    override __.Value = raise e

type private LazyHelper<'T>() =
    static let createInstanceFunc =
        let construct () =
            try
                System.Activator.CreateInstance typeof<'T> :?> 'T
            with
            | :? System.MissingMethodException ->
                let nicerException =
#if MORE_REAL
                    System.MissingMemberException(Environment.GetResourceString("Lazy_CreateValue_NoParameterlessCtorForT"))
#else
                    System.MissingMemberException "Lazy_CreateValue_NoParameterlessCtorForT"
#endif                    
                raise nicerException

        System.Func<'T> construct

    static member createInstance = createInstanceFunc

module LazyHelper =
    let getMode isThreadSafe =
        if isThreadSafe then
            System.Threading.LazyThreadSafetyMode.ExecutionAndPublication
        else
            System.Threading.LazyThreadSafetyMode.None

type Lazzzy<'T> private (valueFactory:System.Func<'T>, mode:System.Threading.LazyThreadSafetyMode, viaConstructor) =
    [<VolatileField>]
    let mutable lazyItem : LazyItem<'T> = Unchecked.defaultof<LazyItem<'T>>

    let nonThreadSafeCreate () =
        try
            let value = valueFactory.Invoke ()
            lazyItem <- LazyValue value // single reference write is atomic
            value
        with
        | ex when not viaConstructor ->
            lazyItem <- LazyException ex // single reference write is atomic
            reraise ()

    let none () =
        { new LazyInitializer<'T>() with
            member __.Value = nonThreadSafeCreate () }

    let executionAndPublication () =
        { new LazyInitializer<'T>() with
            member self.Value =
                // internal class, safe to lock on
                lock self (fun () -> 
                    match lazyItem.IsInitializer  with 
                    | false -> lazyItem.Value
                    | true  -> nonThreadSafeCreate () )}

    let publicationOnly () =
        { new LazyInitializer<'T>() with
            member self.Value =
                let value = LazyValue (valueFactory.Invoke ())
                System.Threading.Interlocked.CompareExchange (&lazyItem, value, self) |> ignore
                lazyItem.Value }

    do
        match mode with
        | System.Threading.LazyThreadSafetyMode.ExecutionAndPublication -> lazyItem <- executionAndPublication ()
        | System.Threading.LazyThreadSafetyMode.None                    -> lazyItem <- none                    ()
        | System.Threading.LazyThreadSafetyMode.PublicationOnly         -> lazyItem <- publicationOnly         ()
        | _ -> failwith "unknown System.Threading.LazyThreadSafetyMode"

    new ()                          = Lazzzy<'T> (LazyHelper.createInstance, System.Threading.LazyThreadSafetyMode.ExecutionAndPublication, true)
    new (isThreadSafe)              = Lazzzy<'T> (LazyHelper.createInstance, LazyHelper.getMode isThreadSafe,                               true)
    new (valueFactory)              = Lazzzy<'T> (valueFactory,              System.Threading.LazyThreadSafetyMode.ExecutionAndPublication, false)
    new (valueFactory,isThreadSafe) = Lazzzy<'T> (valueFactory,              LazyHelper.getMode isThreadSafe,                               false)
    new (valueFactory, mode)        = Lazzzy<'T> (valueFactory,              mode,                                                          false)
    new (mode)                      = Lazzzy<'T> (LazyHelper.createInstance, mode,                                                          true)
    
    member __.Value          = lazyItem.Value
    member __.IsValueCreated = lazyItem.IsValueCreated

@manofstick
Copy link

manofstick commented Jan 4, 2017

@mrange

Well already wasted too much time on this already, but I think I have (without exhaustive testing...) replicated the System.Lazy funcitonality. i.e. in regards to thread safety, exception handling, modes, constructors...

It doesn't come in quite as fast as your versions, but each of those is only a facet of the functionality. They are ~ half the time (or better) of the original implementation, and I think it actually reads very well.

Anyway, have a look, if you care! Time for me to think of something else!

(argh! ... can't ... not ... look ... :-) just modified publicationOnly to use a CAS operation)

@mrange
Copy link
Author

mrange commented Jan 4, 2017

Hi @manofstick. I will update my charts to include your lazy implementation as well.

@mrange
Copy link
Author

mrange commented Jan 4, 2017

Hi @manofstick. I updated the charts now

@manofstick
Copy link

manofstick commented Jan 4, 2017

@mrange

Hmmm... Do you think it is worth creating a PR to update System.Lazy with my implementation? (I have slightly modified it from the version you took by just consolidating the duplicated code)

If you do, do you think you can sanity check my f# implementation? Effectively this executionAndPublication is a case of double check locking, and I think this implementation is sound...

@mrange
Copy link
Author

mrange commented Jan 5, 2017

@manofstick

I think it is worth a try although if I were maintaining System.Lazy<'T> I would be "scared" of making changes to it. The reason is of course that it's damn hard to test away bugs in code intended for concurrency. New implementation might be boon to a lot of developers but just one sad customer losing a lot of money because of a subtle bug is scary to maintainers.

But I think it's worth trying. I can check you code later but I travel today.

@manofstick
Copy link

manofstick commented Jan 5, 2017

Here is the c# implementation; I have changed it slightly to move the factory Func into the initializer objects. This way, when they are replaced by the value (or exception) the are available for garbage collection, which is probably (!) important.

internal interface ILazyItem<T>
{
    T Value { get; }
    bool IsValueCreated { get; }
}

public class Lazy<T>
    : ILazyItem<T>
{
    private volatile ILazyItem<T> _implementation;

    private T _value; // not used in PublicationOnly mode

    // ---------------------------------------------------------------------------------------
    // public surface
    // ---------------------------------------------------------------------------------------

    public T Value             { get { return _implementation.Value; } }
    public bool IsValueCreated { get { return _implementation.IsValueCreated; } }

    public Lazy(Func<T> valueFactory, LazyThreadSafetyMode mode)
    {
        _implementation = CreateInitializerFromMode(mode, this, valueFactory);
    }

    public Lazy()                                        : this(CreateInstance.Factory, LazyThreadSafetyMode.ExecutionAndPublication) {}
    public Lazy(bool isThreadSafe)                       : this(CreateInstance.Factory, GetMode(isThreadSafe))  {}
    public Lazy(Func<T> valueFactory)                    : this(valueFactory,           LazyThreadSafetyMode.ExecutionAndPublication) {}
    public Lazy(Func<T> valueFactory, bool isThreadSafe) : this(valueFactory,           GetMode(isThreadSafe)) {}
    public Lazy(LazyThreadSafetyMode mode)               : this(CreateInstance.Factory, mode) {}

    // ---------------------------------------------------------------------------------------
    // constructor helpers
    // ---------------------------------------------------------------------------------------

    private static class CreateInstance
    {
        private static T construct()
        {
            try
            {
                return (T)(Activator.CreateInstance(typeof(T)));
            }
            catch (MissingMethodException)
            {
                throw new MissingMemberException("Lazy_CreateValue_NoParameterlessCtorForT");
            }
        }

        public readonly static Func<T> Factory = construct;
    }

    private static LazyThreadSafetyMode GetMode(bool isThreadSafe)
    {
        return isThreadSafe ? LazyThreadSafetyMode.ExecutionAndPublication : LazyThreadSafetyMode.None;
    }

    private static ILazyItem<T> CreateInitializerFromMode(LazyThreadSafetyMode mode, Lazy<T> owner, Func<T> valueFactory)
    {
        switch (mode)
        {
            case LazyThreadSafetyMode.None:                    return new None(owner, valueFactory);
            case LazyThreadSafetyMode.PublicationOnly:         return new PublicationOnly(owner, valueFactory);
            case LazyThreadSafetyMode.ExecutionAndPublication: return new ExecutionAndPublication(owner, valueFactory);

            default:
                throw new Exception("unknown System.Threading.LazyThreadSafetyMode");
        }
    }

    // ---------------------------------------------------------------------------------------
    // constructor helpers
    // ---------------------------------------------------------------------------------------

    bool ILazyItem<T>.IsValueCreated => true;
    T ILazyItem<T>.Value { get { return _value; } }

    private T CreateValue(Func<T> factory, bool forceMemoryBarrier)
    {
        try
        {
            _value = factory();

            if (forceMemoryBarrier)
                Thread.MemoryBarrier();

            _implementation = this;

            return _value;
        }
        catch (Exception exception) when (!ReferenceEquals(CreateInstance.Factory, factory))
        {
            _implementation = new LazyException(exception);
            throw;
        }
    }

    private T CreateValuePublicationOnly(Func<T> factory, PublicationOnly comparand)
    {
        var value = factory();
        var lazyValue = new LazyValue(value);

        // atomic interlocked assignment, to volatile field ensures that _implementation is always valid
        // and that only first assignment is used
        Interlocked.CompareExchange(ref _implementation, lazyValue, comparand);

        // we use the _implementation.Value because we need to return the first value created
        return _implementation.Value;
    }

    private T CreateValueExecutionAndPublication(Func<T> factory, ExecutionAndPublication sync)
    {
        lock (sync) // we're safe to lock on "this" as object is an private object used by Lazy
        {
            // it's possible for multiple calls to have piled up behind the lock, so we need to check
            // to see if the ExecutionAndPublication object is still the current implementation.
            return ReferenceEquals(_implementation, sync) ? CreateValue(factory, true) : Value;
        }
    }

    // ---------------------------------------------------------------------------------------
    // ILazyItem<T> implementations
    // ---------------------------------------------------------------------------------------

    private sealed class LazyValue : ILazyItem<T>
    {
        private readonly T _value;

        internal LazyValue(T value)
        {
            _value = value;
        }

        public bool IsValueCreated => true;

        public T Value => _value;
    }

    private sealed class LazyException : ILazyItem<T>
    {
        private readonly Exception _exception;

        internal LazyException(Exception exception)
        {
            _exception = exception;
        }

        public bool IsValueCreated => false;

        public T Value
        {
            get { throw _exception; }
        }
    }

    abstract private class LazyInitializer : ILazyItem<T>
    {
        protected Lazy<T> Owner { get; }
        protected Func<T> Factory { get; }

        internal LazyInitializer(Lazy<T> owner, Func<T> factory)
        {
            Owner = owner;
            Factory = factory;
        }

        public bool IsValueCreated => false;
        abstract public T Value { get; }
    }

    private sealed class None : LazyInitializer
    {
        internal None(Lazy<T> owner, Func<T> factory) : base(owner, factory) { }

        public override T Value
        {
            get { return Owner.CreateValue(Factory, false); }
        }
    }

    private sealed class ExecutionAndPublication : LazyInitializer
    {
        internal ExecutionAndPublication(Lazy<T> owner, Func<T> factory) : base(owner, factory) { }

        public override T Value
        {
            get { return Owner.CreateValueExecutionAndPublication(Factory, this); }
        }
    }

    private sealed class PublicationOnly : LazyInitializer
    {
        internal PublicationOnly(Lazy<T> owner, Func<T> factory) : base(owner, factory) { }

        public override T Value
        {
            get { return Owner.CreateValuePublicationOnly(Factory, this); }
        }
    }

}

@manofstick
Copy link

And dotnet/coreclr#8963 has been merged, hooray! ;-)

@molinch
Copy link

molinch commented Jan 16, 2018

Let's say we have a class which is very frequently used, which holds many properties that are small computations or aggregations, so cheap things but they are called MANY times, and you don't know upfront what may or may not be used.
Furthermore there are MANY instances of that class, they are cached though, so it's not like we recreate them forever.

Would you go for such an implementation:

ctor() {
  aggregatedStuff = new Lazy<string>(() => /* some concatenation logic, or even more but still rather cheap */, LazyThreadSafetyMode.PublicationOnly);
}

private readonly Lazy<string> aggregatedStuff;
public string AggregatedStuff => aggregatedStuff.Value;

We use LazyThreadSafetyMode.PublicationOnly, since our computation is cheap, and side effect free.
Or would you go with a simple boolean flag?
Or even no lazy at all? but then we perform MANY cheap computations, since we create MANY instances of that class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment