Skip to content

Instantly share code, notes, and snippets.

@jjxtra
Last active September 20, 2020 22:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jjxtra/f6116180b2ef5c1550e60567af506c2a to your computer and use it in GitHub Desktop.
Save jjxtra/f6116180b2ef5c1550e60567af506c2a to your computer and use it in GitHub Desktop.
using System;
using System.Collections.Generic;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace Polly.Utilities
{
/// <summary>
/// Defines operations for locks used by Polly policies.
/// </summary>
public interface ILockProviderAsync
{
/// <summary>
/// Waits to acquire the lock.
/// </summary>
/// <param name="key">A string key being used by the execution.</param>
/// <param name="context">The Polly execution context consuming this lock.</param>
/// <param name="cancellationToken">A cancellation token to cancel waiting to acquire the lock.</param>
/// <throws>OperationCanceledException, if the passed <paramref name="cancellationToken"/> is signaled before the lock is acquired.</throws>
/// <throws>InvalidOperationException, invalid lock state</throws>
ValueTask<IDisposable> AcquireLockAsync(string key, Context context, CancellationToken cancellationToken);
}
/// <summary>
/// Defines operations for locks used by Polly policies.
/// </summary>
public interface ILockProvider
{
/// <summary>
/// Waits to acquire the lock.
/// </summary>
/// <param name="key">A string key being used by the execution.</param>
/// <param name="context">The Polly execution context consuming this lock.</param>
/// <throws>InvalidOperationException, invalid lock state</throws>
IDisposable AcquireLock(string key, Context context);
}
/// <summary>
/// Lock provider that locks on a key per process. The locking mechanism is designed to be able
/// to be requested and released on different threads if needed.
/// </summary>
public class ProcessLockProviderAsync : ILockProviderAsync
{
// TODO: Pass via IOC or other method instead of hard-coding static
internal static readonly int[] keyLocks = new int[1024];
private class ProcessLockAsync : IDisposable
{
private uint hash;
private bool gotLock;
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
public void Dispose()
{
if (gotLock)
{
gotLock = false;
ProcessLockProviderAsync.keyLocks[hash] = 0;
}
// else we do not care, it can be disposed in an error case and we will simply ignore that the key locks were not touched
}
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
public async ValueTask<IDisposable> AcquireLockAsync(string key, Context context, CancellationToken cancellationToken)
{
// Monitor.Enter and Monitor.Exit are tied to a specific thread and are
// slower than this spin lock, which does not care about threads and will execute very
// quickly, regardless of lock contention
// https://stackoverflow.com/questions/11001760/monitor-enter-and-monitor-exit-in-different-threads
// Get a hash based on the key, use this to lock on a specific int in the array. The array is designed
// to be small enough to not use very much memory, but large enough to avoid collisions.
// Even if there is a collision, it will be resolved very quickly.
hash = (uint)key.GetHashCode() % (uint)ProcessLockProviderAsync.keyLocks.Length;
// To get the lock, we must change the int at hash index from a 0 to a 1. If the value is
// already a 1, we don't get the lock. The return value must be 0 (the original value of the int).
// it is very unlikely to have any contention here, but if so, the spin cycle should be very short.
// Parameter index 1 (value of 1) is the value to change to if the existing value (Parameter index 2) is 0.
while (!cancellationToken.IsCancellationRequested && Interlocked.CompareExchange(ref ProcessLockProviderAsync.keyLocks[hash], 1, 0) == 1)
{
// give up a clock cycle, we want to get back and try to get the lock again very quickly
await Task.Yield();
}
if (cancellationToken.IsCancellationRequested)
{
throw new OperationCanceledException(cancellationToken);
}
gotLock = true;
return this;
}
}
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
public ValueTask<IDisposable> AcquireLockAsync(string key, Context context, CancellationToken cancellationToken)
{
return new ProcessLockAsync().AcquireLockAsync(key, context, cancellationToken);
}
}
/// <summary>
/// Lock provider that locks on a key per process. The locking mechanism is designed to be able
/// to be requested and released on different threads if needed.
/// </summary>
public class ProcessLockProvider : ILockProvider
{
private class ProcessLock : IDisposable
{
private uint hash;
private bool gotLock;
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
public void Dispose()
{
if (gotLock)
{
gotLock = false;
ProcessLockProviderAsync.keyLocks[hash] = 0;
}
// if constructor had exception, we will not get in Dispose as the object will never be created, if the constructor succeeds, gotLock will always be true
// we still use the gotLock bool in case of multiple dispose calls
}
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
public ProcessLockAsync(string key, Context context)
{
hash = (uint)key.GetHashCode() % (uint)ProcessLockProviderAsync.keyLocks.Length;
while (Interlocked.CompareExchange(ref ProcessLockProviderAsync.keyLocks[hash], 1, 0) == 1)
{
Task.Yield().GetAwaiter().GetResult();
}
gotLock = true;
}
}
/// <inheritdoc />
public IDisposable AcquireLock(string key, Context context)
{
return new ProcessLock(key, context);
}
}
}
@reisenberger
Copy link

reisenberger commented Jan 17, 2020

Very nice to see! Some thoughts:

  • The async version of the interface could return ValueTask<IDisposable>. For the common case where an async policy may be using a synchronous lock (and acquiring it synchronously without traversing any await), this will save a Task allocation per execution. For the async case consuming an async lock (eg from Redis), the cost will not be much more (certainly compared to going out over the wire to Redis).
  • ProcessLockReleaser can be a struct, to save an allocation per execution.

We are probably moving the implementation to synchronous, but if it were remaining asynchronous:

  • Task.Delay(1).Wait() would be sync-blocking-on-async within an async code path. That could risk deadlocks if the policy/lock were consumed in a context with a main/UI thread.
  • Task.Yield() on its own would not (afaiu) release the thread: we would probably want await Task.Yield();. (EDIT: checked the source of Task.Yield() to confirm this)

Finally, it looks like there is the potential for a race condition in Dispose(): for two threads (thread 1 and thread 2) to race through Dispose() simultaneously and both call provider.ReleaseLock(hash). Although apparently benign, it opens up a potential further race where a third thread (thread 3) could acquire the same lock between thread 1 calling provider.ReleaseLock(hash); and thread 2 calling provider.ReleaseLock(hash); - meaning thread 2 would dispose a lock it didn't own. Then the work governed by thread 3 wouldn't be locked, and thread 3 would probably throw ObjectDisposedException when it tried to dispose. All of this would require badly-formed consuming code to happen, of course. The potential race could be eliminated by having Dispose() use Interlocked.CompareExchange(...) to ensure a given instance of ProcessLockReleaser only ever calls provider.ReleaseLock(hash); once.

@jjxtra
Copy link
Author

jjxtra commented Jan 17, 2020

Thanks so much for the great feedback! I've (I think) fixed the race condition but let me know if you still think that's an issue.

I was hoping to use thread sleep, but it looks like .net standard 1.1 is still supported so that's out I think. ValueTask also not available in .net standard 1.1. But if you want to drop .net standard 1.1 (I'm in favor) we could do a thread sleep + ValueTask for the async version. Thoughts?

@jjxtra
Copy link
Author

jjxtra commented Jan 17, 2020

Added sync version...

@jjxtra
Copy link
Author

jjxtra commented Jan 22, 2020

Unable to use ValueTask unless we drop .NET standard 1.1 support in the library, which I think could be useful. Doing so would also let us use Thread.Yield and Thread.Sleep instead of Task.Yield and Task.Delay for the sync version.

@reisenberger
Copy link

@jjxtra Again, thanks for this!

I noted this well-known problem reading the code last week (hit this many times) but inexplicably omitted to comment:

  • Thread.Sleep(1) and Task.Delay(1) are widely known not to be accurate on many Windows systems, and typically balloon out to ~15.6ms (1/64-second).
  • I was also concerned 1ms may be order-of-magnitude too large for waiting to acquire a lock.

Still: to make decisions evidence-based, I constructed a benchmark:

Sidenotes on benchmark:

  • The overall durations in the benchmark are not of interest: most of the benchmark time goes in orchestrating the parallelism. Differences in duration between different modes are the salient point.
  • NoWork benchmark = baseline: how much benchmark time goes in orchestrating the parallelism.
  • NoExtraLock benchmark tells us how ConcurrentDictionary<,>.GetOrAdd() itself performs under load/contention, with no extra locking around it (neither single-lock or keyed-lock).
  • ExtraLock mode with LockPerKey = true [or] false compares keyed-lock against single-lock. The keyed-lock strategy is not exactly your algorithm, but (which would give faster benchmarks anyway) a "perfect" keyed-lock with no collisions. The work benchmarked otherwise stays close to the prototype DuplicateCollapserPolicy.
  • A switch DifferentKeys = true [or] false was to highlight any difference between many threads/tasks contending the same key, versus contending different keys (GetOrAdd will do more Adding and less Getting with DifferentKeys = true)
  • Commented-out code re ActualContentionEncountered was used to confirm the benchmarks really generated lock contention, not just that we believed they did. Analysing outputs, for a given ParallelTasks, around 50% (varying 45-55 with scale) of the tasks experienced lock contention.
  • A commented-out benchmark PureConcurrentDictionaryGetOrAdd() models how long .GetOrAdd() might take with no contention: in my environment, 40ns (nano-seconds)

Some results of this synthetic benchmark in my environment are below, and in the repo:

Below at contention of 100 tasks, ~50 experiencing lock contention:

Method Work DifferentKeys LockPerKey Mean Error StdDev
Contention 100 NoWork False False 135.9 us 0.8185 us 1.6533 us
Contention 100 NoWork False True 134.2 us 0.3304 us 0.6599 us
Contention 100 NoWork True False 137.1 us 0.4739 us 0.9131 us
Contention 100 NoWork True True 133.3 us 0.3957 us 0.7811 us
Contention 100 NoExtraLock False False 141.4 us 0.5465 us 1.0787 us
Contention 100 NoExtraLock False True 142.4 us 0.5610 us 1.1074 us
Contention 100 NoExtraLock True False 162.9 us 0.4368 us 0.8824 us
Contention 100 NoExtraLock True True 170.3 us 0.4381 us 0.8647 us
Contention 100 ExtraLock False False 175.9 us 3.0013 us 6.0627 us
Contention 100 ExtraLock False True 142.8 us 0.6640 us 1.3413 us
Contention 100 ExtraLock True False 195.9 us 1.8988 us 3.8356 us
Contention 100 ExtraLock True True 164.1 us 0.4940 us 0.9751 us

The benchmark suggests:

  • Even with 50 tasks contending the lock, .NET's lock / Monitor negotiates the lock contention for our policy implementation in 10s of micro-seconds (ExtraLock with LockPerKey == false is 30us burden over NoExtraLock), ie 100s to 1000s times faster (orders-of-magnitude faster) than Thread.Sleep(1) yielding 1ms-15.6ms waits. (... eliding that 30us is the additional aggregate completion time; that makes it the max additional time an individual task could have accrued; individual call paths could also have accrued less)

    • Based on this I suggest: we do not go forward with a spin-wait algorithm involving Thread.Sleep(1).
  • Similarly, striped-lock/keyed-lock adds savings in the order of 10s of micro-seconds (ExtraLock with LockPerKey == false versus true) for parallel contentions of 100 requests (less savings for lower).

    • Based on this I suggest: The default lock implementation should be a single lock object, with lock-per-key offered as an alternative in a Polly-Contrib.

On the latter: the q is whether microsecond savings are worth the complexity (/extra memory) for it to be the default implementation. The majority of users are unlikely to have cache refreshes (across different cache keys) colliding at anything like this contention. Any microsecond gains on the locking (assuming we are caching some expensive query result obtained from a networked resource) also are likely to be dwarfed by order-of-millisecond waits going over the network and calculating the query result. I also know as an OSS maintainer that making the default option complex increases cognitive load for adoption, increases queries, etc.

All the above said, the standard ymmv caveats apply. Some users may have servers under pressures which cause more lock contention. Users might have cases for DuplicateCollapserPolicy where the underlying work itself only takes microseconds, meaning the striped-lock savings become more relevant. This is why Polly-Contrib is so great: it allows us to offer extensions for power users. So it would still be great to have a striped-lock implementation out in Polly-Contrib!

Thanks again for everything you are doing on this.

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

Wow, very extensive tests. I've removed the sleep call, it simplified the code quite a bit so that's awesome! I had modeled that after the .NET framework spin lock, but in this case where we expect a lock to be acquired and released extremely fast, the sleep is not needed.

Also can I assume your benchmark were ran with the sleep removed? :)

@reisenberger
Copy link

Also can I assume your benchmark were ran with the sleep removed? :)

Yes, the benchmark results are in microseconds; they would have been in milliseconds with Thread.Sleep() in place.

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

Ok the only other question I have is I didn't see an Interlocked.CompareExchange in the benchmark, it looks like a simple lock was used per object for LockPerKey = true, I will try running the benchmark on my box with an Interlocked.CompareExchange to see if that's much different from the lock call.

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

Baseline benchmark with no code modifications on my system, a threadripper 1950x 16 core / 32 thread @3900ghz, 64 GB RAM:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
AMD Ryzen Threadripper 1950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=3.1.100
  [Host]        : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Contention100 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Job=Contention100  IterationCount=50  LaunchCount=1
WarmupCount=3

| Method |        Work | DifferentKeys | LockPerKey |     Mean |    Error |   StdDev |
|------- |------------ |-------------- |----------- |---------:|---------:|---------:|
| PCLOCD |      NoWork |         False |      False | 402.0 us |  1.52 us |  3.06 us |
| PCLOCD |      NoWork |         False |       True | 399.9 us |  1.58 us |  3.13 us |
| PCLOCD |      NoWork |          True |      False | 398.6 us | 14.90 us | 30.10 us |
| PCLOCD |      NoWork |          True |       True | 403.0 us |  1.17 us |  2.36 us |
| PCLOCD | NoExtraLock |         False |      False | 403.6 us |  1.56 us |  3.08 us |
| PCLOCD | NoExtraLock |         False |       True | 390.2 us |  2.24 us |  4.53 us |
| PCLOCD | NoExtraLock |          True |      False | 469.7 us |  1.32 us |  2.67 us |
| PCLOCD | NoExtraLock |          True |       True | 448.2 us |  8.93 us | 17.83 us |
| PCLOCD |   ExtraLock |         False |      False | 427.8 us |  8.06 us | 16.28 us |
| PCLOCD |   ExtraLock |         False |       True | 400.5 us |  2.10 us |  4.20 us |
| PCLOCD |   ExtraLock |          True |      False | 696.0 us |  2.13 us |  4.26 us |
| PCLOCD |   ExtraLock |          True |       True | 451.3 us |  1.31 us |  2.64 us |

// * Hints *
Outliers
  Benchmarks.PCLOCD: Contention100 -> 2 outliers were removed, 4 outliers were detected (391.24 us, 392.55 us, 408.68 us, 412.96 us)
  Benchmarks.PCLOCD: Contention100 -> 3 outliers were detected (240.89 us..367.58 us)
  Benchmarks.PCLOCD: Contention100 -> 2 outliers were removed, 4 outliers were detected (393.78 us, 395.88 us, 412.99 us, 413.29 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed, 4 outliers were detected (353.37 us..411.23 us, 459.61 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  detected (316.06 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (412.71 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed, 2 outliers were detected (670.98 us, 707.59 us)

// * Legends *
  Work          : Value of the 'Work' parameter
  DifferentKeys : Value of the 'DifferentKeys' parameter
  LockPerKey    : Value of the 'LockPerKey' parameter
  Mean          : Arithmetic mean of all measurements
  Error         : Half of 99.9% confidence interval
  StdDev        : Standard deviation of all measurements
  1 us          : 1 Microsecond (0.000001 sec)

// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:09:23 (563.65 sec), executed benchmarks: 12

Global total time: 00:09:28 (568.23 sec), executed benchmarks: 12
// * Artifacts cleanup *

I will run it again with the Interlocked change. The last two rows of the benchmark is of note as the per key lock is 50% faster.

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

Test using Interlocked.CompareExchange when LockPerKey is true...

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
AMD Ryzen Threadripper 1950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=3.1.100
  [Host]        : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Contention100 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Job=Contention100  IterationCount=50  LaunchCount=1
WarmupCount=3

| Method |        Work | DifferentKeys | LockPerKey |     Mean |   Error |  StdDev |
|------- |------------ |-------------- |----------- |---------:|--------:|--------:|
| PCLOCD |      NoWork |         False |      False | 389.6 us | 1.73 us | 3.49 us |
| PCLOCD |      NoWork |         False |       True | 394.7 us | 1.82 us | 3.64 us |
| PCLOCD |      NoWork |          True |      False | 401.8 us | 1.32 us | 2.63 us |
| PCLOCD |      NoWork |          True |       True | 408.5 us | 1.72 us | 3.48 us |
| PCLOCD | NoExtraLock |         False |      False | 383.9 us | 1.23 us | 2.46 us |
| PCLOCD | NoExtraLock |         False |       True | 396.3 us | 1.77 us | 3.57 us |
| PCLOCD | NoExtraLock |          True |      False | 445.8 us | 1.22 us | 2.47 us |
| PCLOCD | NoExtraLock |          True |       True | 443.0 us | 1.66 us | 3.32 us |
| PCLOCD |   ExtraLock |         False |      False | 424.8 us | 1.18 us | 2.35 us |
| PCLOCD |   ExtraLock |         False |       True | 406.1 us | 2.21 us | 4.46 us |
| PCLOCD |   ExtraLock |          True |      False | 724.5 us | 1.71 us | 3.45 us |
| PCLOCD |   ExtraLock |          True |       True | 446.6 us | 1.64 us | 3.27 us |

// * Hints *
Outliers
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (406.22 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed, 4 outliers were detected (393.35 us..395.19 us, 409.50 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  detected (399.65 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (392.95 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  detected (439.35 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (459.03 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (440.60 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  detected (713.84 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (456.40 us)

// * Legends *
  Work          : Value of the 'Work' parameter
  DifferentKeys : Value of the 'DifferentKeys' parameter
  LockPerKey    : Value of the 'LockPerKey' parameter
  Mean          : Arithmetic mean of all measurements
  Error         : Half of 99.9% confidence interval
  StdDev        : Standard deviation of all measurements
  1 us          : 1 Microsecond (0.000001 sec)

// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:09:18 (558.94 sec), executed benchmarks: 12

Global total time: 00:09:23 (563.06 sec), executed benchmarks: 12
// * Artifacts cleanup *

No different in performance for LockPerKey of true which is great because memory usage will be less than using .NET objects. Doing another run with a single global lock on a single int using Interlocked.CompareExchange when LockPerKey is false...

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

Another test, using Interlocked.CompareExchange instead of lock when LockPerKey is false...

// * Summary *

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
AMD Ryzen Threadripper 1950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=3.1.100
  [Host]        : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Contention100 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Job=Contention100  IterationCount=50  LaunchCount=1
WarmupCount=3

| Method |        Work | DifferentKeys | LockPerKey |     Mean |    Error |   StdDev |
|------- |------------ |-------------- |----------- |---------:|---------:|---------:|
| PCLOCD |      NoWork |         False |      False | 402.7 us |  2.28 us |  4.61 us |
| PCLOCD |      NoWork |         False |       True | 390.6 us |  1.38 us |  2.72 us |
| PCLOCD |      NoWork |          True |      False | 408.0 us |  1.80 us |  3.59 us |
| PCLOCD |      NoWork |          True |       True | 402.4 us | 13.15 us | 26.26 us |
| PCLOCD | NoExtraLock |         False |      False | 389.3 us |  1.56 us |  3.15 us |
| PCLOCD | NoExtraLock |         False |       True | 404.3 us |  1.60 us |  3.24 us |
| PCLOCD | NoExtraLock |          True |      False | 465.1 us |  1.68 us |  3.39 us |
| PCLOCD | NoExtraLock |          True |       True | 456.6 us |  4.91 us |  9.69 us |
| PCLOCD |   ExtraLock |         False |      False | 407.1 us |  1.78 us |  3.60 us |
| PCLOCD |   ExtraLock |         False |       True | 397.8 us |  1.96 us |  3.96 us |
| PCLOCD |   ExtraLock |          True |      False | 507.8 us |  2.48 us |  5.02 us |
| PCLOCD |   ExtraLock |          True |       True | 457.3 us |  5.71 us | 11.54 us |

// * Hints *
Outliers
  Benchmarks.PCLOCD: Contention100 -> 2 outliers were removed, 3 outliers were detected (383.63 us, 397.16 us, 399.75 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (420.26 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed, 4 outliers were detected (279.86 us..351.63 us, 418.90 us)
  Benchmarks.PCLOCD: Contention100 -> 2 outliers were removed, 4 outliers were detected (392.84 us, 451.00 us, 465.21 us, 468.22 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  detected (387.70 us)

// * Legends *
  Work          : Value of the 'Work' parameter
  DifferentKeys : Value of the 'DifferentKeys' parameter
  LockPerKey    : Value of the 'LockPerKey' parameter
  Mean          : Arithmetic mean of all measurements
  Error         : Half of 99.9% confidence interval
  StdDev        : Standard deviation of all measurements
  1 us          : 1 Microsecond (0.000001 sec)

// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:08:19 (499.64 sec), executed benchmarks: 12

Global total time: 00:08:23 (503.73 sec), executed benchmarks: 12
// * Artifacts cleanup *

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

The single lock with Interlocked.CompareExchange is interesting, it is performing a bit better than lock (obj) {}...

I am running another test without a thread.yield to see if it changes things...

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

Test using Interlocked.CompareExchange for LockPerKey both values, without Thread.Yield... short answer is Thread.Yield is necessary :)

// * Summary *

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
AMD Ryzen Threadripper 1950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=3.1.100
  [Host]        : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Contention100 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Job=Contention100  IterationCount=50  LaunchCount=1
WarmupCount=3

| Method |        Work | DifferentKeys | LockPerKey |       Mean |     Error |    StdDev |     Median |
|------- |------------ |-------------- |----------- |-----------:|----------:|----------:|-----------:|
| PCLOCD |      NoWork |         False |      False |   405.2 us |   1.98 us |   3.94 us |   405.8 us |
| PCLOCD |      NoWork |         False |       True |   399.6 us |   9.53 us |  18.81 us |   401.7 us |
| PCLOCD |      NoWork |          True |      False |   416.2 us |   2.07 us |   4.18 us |   415.9 us |
| PCLOCD |      NoWork |          True |       True |   402.7 us |  10.22 us |  20.64 us |   407.1 us |
| PCLOCD | NoExtraLock |         False |      False |   402.9 us |   2.05 us |   4.10 us |   402.9 us |
| PCLOCD | NoExtraLock |         False |       True |   401.0 us |  17.20 us |  34.35 us |   408.2 us |
| PCLOCD | NoExtraLock |          True |      False |   450.3 us |   1.22 us |   2.43 us |   449.9 us |
| PCLOCD | NoExtraLock |          True |       True |   433.2 us |   1.60 us |   3.23 us |   433.6 us |
| PCLOCD |   ExtraLock |         False |      False |   489.6 us |  48.57 us |  97.00 us |   438.3 us |
| PCLOCD |   ExtraLock |         False |       True |   464.0 us |  42.35 us |  83.59 us |   417.1 us |
| PCLOCD |   ExtraLock |          True |      False | 3,193.8 us | 427.48 us | 823.60 us | 3,067.3 us |
| PCLOCD |   ExtraLock |          True |       True |   477.8 us |   1.89 us |   3.81 us |   477.7 us |

// * Warnings *
MultimodalDistribution
  Benchmarks.PCLOCD: Contention100 -> It seems that the distribution is bimodal (mValue = 4.18)
MinIterationTime
  Benchmarks.PCLOCD: Contention100 -> The minimum observed iteration time is 1.7141 ms which is very small. It's recommended to increase it.

// * Hints *
Outliers
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed, 3 outliers were detected (393.07 us, 395.51 us, 414.95 us)
  Benchmarks.PCLOCD: Contention100 -> 2 outliers were removed, 4 outliers were detected (276.70 us, 386.52 us, 417.56 us, 418.56 us)
  Benchmarks.PCLOCD: Contention100 -> 3 outliers were detected (305.68 us..362.29 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (414.74 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed, 5 outliers were detected (193.38 us..393.40 us, 419.50 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed, 2 outliers were detected (443.79 us, 458.09 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  detected (422.52 us)
  Benchmarks.PCLOCD: Contention100 -> 1 outlier  was  removed (1.34 ms)
  Benchmarks.PCLOCD: Contention100 -> 2 outliers were removed (777.81 us, 827.17 us)
  Benchmarks.PCLOCD: Contention100 -> 4 outliers were removed (5.62 ms..183.76 ms)

// * Legends *
  Work          : Value of the 'Work' parameter
  DifferentKeys : Value of the 'DifferentKeys' parameter
  LockPerKey    : Value of the 'LockPerKey' parameter
  Mean          : Arithmetic mean of all measurements
  Error         : Half of 99.9% confidence interval
  StdDev        : Standard deviation of all measurements
  Median        : Value separating the higher half of all measurements (50th percentile)
  1 us          : 1 Microsecond (0.000001 sec)

// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:07:25 (445.53 sec), executed benchmarks: 12

Global total time: 00:07:29 (449.63 sec), executed benchmarks: 12
// * Artifacts cleanup *

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

Observations:

  • On my system, for the price of 4K RAM, using Interlocked.CompareExchange with an array of 1024 int and hashing on key to get a slot and looping with Thread.Yield, there are significant performance gains on locking per key versus not locking per key (the last two rows of each test). There was no significant difference on locking per key using lock, except that memory usage required was 4K versus using .NET objects, which with a 1024 array of object, would require 4K * 3 or 6 bytes, depending on CPU architecture (https://stackoverflow.com/questions/14286421/c-sharp-object-size-overhead). Also reference my baseline test, 696us (single lock) vs 451us (lock per key).
  • Replacing lock with an Interlocked.CompareExchange and Thread.Yield loop using a single global int yielded significant performance gains vs. lock on a single object. See my baseline test second to last row and compare it against my second to last test 696us vs. 507 us.
  • It appears that the higher the core / thread count, the more savings gained by locking per key. This makes sense, especially with lock, where contention will cause a kernel context switch. On a busy server handling hundreds or thousands of requests per second with high core count, the CPU usage and wait time on the single lock will be significant. Imagine a busy C# dns server using polly with a key collapser policy. A single lock would make it unusable.
  • When using multiple keys, there is no significant difference or overhead to add the additional lock per key versus not locking at all when using ConcurrentDictionary, which is awesome!

Questions from above:

  • Is 4K RAM too much? No, and it won't be allocated if the user never creates a collapser policy.
  • Is the code more complex on locking per key? No, unless you consider uint hashCode = (uint)key.GetHashCode % 1024; and while (Interlocked.CompareExchange(ref arrayIntLock[hashCode], 1, 0) == 1) { Thread.Yield(); } complex.
  • Is using .NET lock ever an issue if there is a need to await inside the lock? The Interlocked.CompareExchange method does not have this issue.

My final analysis:

Core counts are increasing rapidly, it seems like AMD has been doubling every year or two. When I make frameworks or architecture decisions, I try to plan in years, and even decades. I think the default behavior of this feature should be equally performant on many cores vs few. I'd be ok using Interlocked.CompareExchange on a single global int, but this may break down in a few years especially if 32/64/128 core systems become the norm in the data-center. The lock per key only uses 4K of RAM, miniscule compared to CLR default overhead. I propose that lock per key using Interlocked.CompareExchange using an array of 1024 int be the default behavior. Again, this 4K is only allocated if someone creates a key collapser policy.

Bottom line, the single lock per process performs linearly more poorly as core counts and requests per second increase.

Source code:

I've pasted in my source using Interlocked.CompareExchange instead of lock calls to see if it makes a difference on your machine. I would be curious to know the results.

using System;
using System.Collections.Concurrent;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;

namespace ConcurrentDictionaryLockContention
{
    [Config(typeof(BenchmarkConfig))]
    public class Benchmarks
    {
        public enum TheWork
        {
            NoWork,
            NoExtraLock,
            ExtraLock,
        }

        [ParamsAllValues]
        public TheWork Work { get; set; }

        [ParamsAllValues]
        public bool DifferentKeys { get; set; }

        [ParamsAllValues]
        public bool LockPerKey { get; set; }

        private string GetKey(int seed) => DifferentKeys ? seed.ToString() : commonKey;

        private object GetLockObject(int seed) => LockPerKey ? lockObjects[seed] : commonLockObject;

        public static int ParallelTasks = 100;

        public readonly Task<Lazy<object>>[] tasks;

        public readonly object[] lockObjects;
        public readonly int[] lockObjectInts = new int[1024];
        public int lockObjectInt;

        public Task[] tasksCast;

        public ConcurrentDictionary<string, Lazy<object>> collapser = new ConcurrentDictionary<string, Lazy<object>>(
            /*ParallelTasks, Enumerable.Empty<KeyValuePair<string, Lazy<object>>>(), EqualityComparer<string>.Default*/ // Explicitly setting the concurrencyLevel at creation did not significantly alter this benchmark.
            );

        private readonly Func<object, CancellationToken, object> commonAction = (o, token) => new object();
        private readonly object commonInputObject = new object();
        private readonly CancellationToken commonToken = default;

        private readonly object commonLockObject = new object();
        private const string commonKey = "SameKey";

        private ManualResetEventSlim startSignal;

        private readonly SpinWait spinner = new SpinWait();

        private int WaitingForStartSignal = 0;

        /*private int ActualContentionEncountered = 0;*/

        private readonly Lazy<object> flyweightWork = new Lazy<object>();

        public Benchmarks()
        {
            tasks = new Task<Lazy<object>>[ParallelTasks];

            lockObjects = new object[ParallelTasks];
            for (int i = 0; i < ParallelTasks; i++)
            {
                lockObjects[i] = new object();
            }

            // Avoid thread-pool starvation influencing test results.
            ThreadPool.GetMinThreads(out _, out int completionPortThreads);
            ThreadPool.SetMinThreads(ParallelTasks + 25, completionPortThreads);
        }

        public void Setup()
        {
            collapser = collapser = new ConcurrentDictionary<string, Lazy<object>>(
                /*ParallelTasks, Enumerable.Empty<KeyValuePair<string, Lazy<object>>>(), EqualityComparer<string>.Default*/ // Explicitly setting the concurrencyLevel at creation did not significantly alter this benchmark.
            );

            startSignal = new ManualResetEventSlim(false);

            Interlocked.Exchange(ref WaitingForStartSignal, 0);
            /*Interlocked.Exchange(ref ActualContentionEncountered, 0);*/

            Lazy<object> BenchmarkedActivity(string key, object lockObject)
            {
                Interlocked.Increment(ref WaitingForStartSignal);
                startSignal.WaitHandle.WaitOne();

                if (Work == TheWork.ExtraLock)
                {
                    // The commented-out line below can be commented back in (in conjunction with its Console.WriteLine and other related lines) to get a visual indication of how many tasks actually experience lock contention.
                    /*if (!Monitor.TryEnter(lockObject, 0)) { Interlocked.Increment(ref ActualContentionEncountered); } else { Monitor.Exit(lockObject); }*/
                    uint hashCode = 0;
                    if (LockPerKey)
                    {
                        unchecked
                        {
                            hashCode = (uint)key.GetHashCode() % 1024;
                        }
                        while (Interlocked.CompareExchange(ref lockObjectInts[hashCode], 1, 0) == 1)
                        {
                            System.Threading.Thread.Yield();
                        }
                    }
                    else
                    {
                        while (Interlocked.CompareExchange(ref lockObjectInt, 1, 0) == 1)
                        {
                            System.Threading.Thread.Yield();
                        }
                    }
                    try
                    {
                        return collapser.GetOrAdd(key, new Lazy<object>(
                            () => commonAction(commonInputObject, commonToken), /* we don't care what this factory does; we are not running it. We are benchmarking locking over the ConcurrentDictionary<,>.GetOrAdd(). But for similarity to proposed Polly code, it is a delegate of the same form Polly uses. */
                            LazyThreadSafetyMode.ExecutionAndPublication));
                    }
                    finally
                    {
                        if (LockPerKey)
                        {
                            lockObjectInts[hashCode] = 0;
                        }
                        else
                        {
                            lockObjectInt = 0;
                        }
                    }
                }

                else if (Work == TheWork.NoExtraLock)
                {
                    return collapser.GetOrAdd(key, new Lazy<object>(
                        () => commonAction(commonInputObject, commonToken), /* we don't care what this factory does; we are not running it. We are benchmarking locking over the ConcurrentDictionary<,>.GetOrAdd(). But for similarity to proposed Polly code, it is a delegate of the same form Polly uses. */
                        LazyThreadSafetyMode.ExecutionAndPublication));
                }

                else if (Work == TheWork.NoWork) { return flyweightWork; }

                else throw new InvalidOperationException($"Unknown value for {nameof(TheWork)}");
            }

            // Set up actions which will contend in parallel.
            for (int i = 0; i < ParallelTasks; i++)
            {
                var key = GetKey(i);
                var lockObject = GetLockObject(i);
                tasks[i] = Task.Run(() =>
                {
                    return BenchmarkedActivity(key, lockObject);
                });
            }

            tasksCast = tasks.Select(t => t as Task).ToArray();

            /*Console.WriteLine($"Potential contention after queueing tasks: {WaitingForStartSignal}");*/

            // To maximise contention, ensure all Tasks have actually started and are gated at the ManualResetEvent, before proceeding.  
            while (WaitingForStartSignal < ParallelTasks)
            {
                spinner.SpinOnce();
                Thread.Yield();
            }

            /*Console.WriteLine($"Potential contention at the starting gate: {WaitingForStartSignal}");*/
        }

        public void TearDown()
        {
            for (int i = 0; i < ParallelTasks; i++)
            {
                tasks[i] = null;
            }

            collapser = null;

            /*Console.WriteLine($"Actual lock contention encountered: {ActualContentionEncountered} ");*/

            startSignal.Dispose();
            spinner.Reset();
        }

        [Benchmark]
        public void PCLOCD() // ParallelContendLockOverConcurrentDictionary
        {
            Setup();

            startSignal.Set();
            Task.WaitAll(tasksCast);

            TearDown();
        }

        /*
        [Benchmark]
        public object PureConcurrentDictionaryGetOrAdd()
        {
            return collapser.GetOrAdd(commonKey, new Lazy<object>(
                () => commonAction(commonInputObject, commonToken), /* for the purposes of the benchmark we don't care what this factory does; we are not running it. We are benchmarking locking over the ConcurrentDictionary<,>.GetOrAdd() #1#
                LazyThreadSafetyMode.ExecutionAndPublication));
        }
        */

    }
}

@jjxtra
Copy link
Author

jjxtra commented Jan 25, 2020

I've shrunk the int array from the above down to 1024 ints (4K RAM) and performance is basically the same, so we can get away with just 4K RAM overhead...

@jjxtra
Copy link
Author

jjxtra commented Jan 27, 2020

Gist source code updated to only release the lock if it was acquired, along with a 1024 int array instead of 8192.

@reisenberger
Copy link

@jjxtra The changes with hasLock look as if they introduce a number of bugs. hasLock is not utilised in a concurrency-safe way (races can invalidate the intent). hasLock is scoped across all locks in the keylocks[] array, invaliding the design of having keylocks[].

@jjxtra
Copy link
Author

jjxtra commented Jan 31, 2020

@reisenberger Lock state is now a flag, it can only move from 0 to 1 and from 1 to 2. Once at 2, InvalidOperationException is always thrown. Key array is now static. Probably better to pass a key array via dependency injection but this is just a proof of concept. As any ILockProvider* returned from a lock factory would be a new instance and not considered to be thread safe (i.e. multiple thread entries into an AcquireLock for the same instance of an ILockProvider would cause undefined behavior) I think I'm good with how it works now. Multiple entries into AcquireLock from the same thread would throw an exception.

@jjxtra
Copy link
Author

jjxtra commented Jan 31, 2020

@reisenberger I went ahead and implemented the factory pattern. Let me know what you think! I went back to a simple bool gotLock because the factory creates a new instance each time and the contract returned (IDisposable) does not allow attempting to re-acquire the lock. Also, the returned IDisposable is not guaranteed to be thread safe, so calling dispose from multiple threads is unsupported.

@reisenberger
Copy link

@jjxtra Again, thanks for everything on the locking!

Some notes just on what turned out different in the main Policy implementation, compared to what discussed previously and what we have in this gist:

  • I was wrong in this case about the benefit of using a struct for the releasers. The compiler does have a special transform for when a using statement directly references a struct (ie using (<some-struct>) { }), to avoid boxing, but here we are returning an IDisposable from a method due to the lock interfaces, so it gets boxed to IDisposable first anyway πŸ™ . So I switched back to class in the v0.1.0 basic lock implementation.
  • There were a couple of places in my early spike which used the non-thread-safe members on ConcurrentDictionary<,> incl. outside the lock - switched to thread-safe.
  • I'm thinking that AcquireLockAsync needs to return IAsyncDisposable, not just ValueTask<IDisposable>. If the lock-releasing is across network I/O (like to Redis), we want the implementation of that release to be able to be fully async. IAsyncDisposable uses ValueTask as the return type of its release method, so when we are just wrapping a sync implementation, we still get the benefit of ValueTask.

Assuming you push on forward to bring the stripe-locking idea into Polly.Contrib.DuplicateRequestCollapser:

  • We probably wouldn't need the async implementation ProcessLockProviderAsync? There's nothing async in the work it does (setting aside the need to yield asynchronously given it's async), so we could just use a shallow async-over-sync wrapper. (Just running it async when it doesn't have any need for async work will create extra allocations for the state machines, async continuations etc.)
  • I dug into the source code around Task.Yield(), to be sure I understood what Task.Yield().GetAwaiter().GetResult() would do. Looks like it doesn't offer any yielding behaviour if used sync like that, .GetAwaiter() returns a YieldAwaitable, GetResult() on that is a no-op. Looks like the benefit of Task.Yield() only comes if it's genuinely used in an async context; the await then schedules the async continuation back onto one or other TaskScheduler, which looks like the mechanism by which await Task.Yield() gets its value - that async continuation has to compete with other async tasks in the scheduler. NB Moot now that we can now drop .Net Standard 1.x (:+1: ), just sharing what I found cos I had dug into this.

Thanks again for everything you are doing on the locking. πŸ‘

@jjxtra
Copy link
Author

jjxtra commented Feb 7, 2020 via email

@reisenberger
Copy link

Yep, PR onto Polly.Contrib.DuplicateRequestCollapser.

My recommendation would be a PR to add a new ISyncLockingProvider, not replace/modify InstanceScopedLockProvider, leaving both as options. Curating a widely-used OSS project really brings home (if it is not already obvious) that different users have different needs. Some will benefit from a striped-lock provider; some will have no need. Giving users options (provided not confusing), rather than pre-deciding on one option to suit all users, can work well. πŸ‘

Thanks again for helping drive this forward!

@jjxtra
Copy link
Author

jjxtra commented Sep 20, 2020

You are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment