Skip to content

Instantly share code, notes, and snippets.

@JeremyRubin
Last active August 5, 2016 17:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save JeremyRubin/6eec0427edec6e68755ae274b27baae5 to your computer and use it in GitHub Desktop.
Save JeremyRubin/6eec0427edec6e68755ae274b27baae5 to your computer and use it in GitHub Desktop.
Bitcoin Lock-Free CCheckQueue Design & Analysis

Bitcoin Lock-Free CCheckQueue Design & Analysis

Summary

When profiling where bitcoind was spending time and could be optimized, I noticed that the sigcache was locking twice per read. This seemed bad, and with Alex Morcos I was able to measure some bad contention. We put together a quick and dirty patch: we eliminated the unnecessary locking and used a lock-free queue to batch cleanup tasks. While Alex continued to look into the cache, I diverted my attention to other sources of contention. I found that the CCheckQueue design was pretty inefficient, and decided to re-implement it with something better.

Previous Design

The old CCheckQueue worked roughly as follows:

Worker Loop:

  1. Lock
  2. Wait on a condition var for work to be ready
  3. Remove a batch of work from the queue, of size set by magic formula
  4. Unlock
  5. Execute all the work

Master Add:

  1. Lock
  2. Put some work into the queue
  3. Notify workers condvars of new work
  4. Unlock

Just instrumenting the Lock acquisition showed that this was a pretty significant portion of the time needed. Locking is a problem not only if there is contention, but also because it mutually excludes threads from working at the same time.

There were a number of other issues as well, all contributing to the overhead:

  1. Dynamic (Heap) Memory Allocation
  2. Condition Variables are slow
  3. Batch Size Formula is bad

New Design Checklist

  1. No (or minimal) locking
  2. No condition vars.
  3. Minimal-to-no heap Allocation after initialization
  4. Batch sizes should be 1

New Design Rev.1

For the first revision I tried simply dropping in boost::lockfree::queue. However, there were a few difficulties.

  1. boost lockfree::queue storable data structures must be trivially copyable we can use a pointer to a structure, but if we do, we may run into ABA errors if we free it. this leads to...
  2. Needing to maintain a to_free_list for cleanup after the round
  3. Contention for the queue head
  4. need to use malloc, or maintain a free_memory_pool internal

Overall this suggested to me that using boost::lockfree::queue would be a very complicated design with high likelihood of memory leak. Furthermore, a queue or a stack implement FIFO or LIFO access, whereas we only care WGIMCO (what goes in must come out, that is, unordered) semantics.

New Design Rev.2

The new design works as follows:

We know in advance that the size can be no larger than MAX_SCRIPTCHECKS = MAX_BLOCK_SERIALIZED_SIZE / 41, or the number of serialized input structures that can fit into a block.

Therefore, a std::array<Check, MAX_SCRIPTCHECKS> checks is allocated at initialization/compile time. We also create a std::array<atomic_flags, MAX_SCRIPTCHECKS> flags.

Lastly, we keep an std::atomic<size_t> nTodo denoting up to what index we have filled the array.

A worker now has the following simple algorithm to do work:

  1. Read nTodo
  2. Try to set the flag corresponding to a job at index i < nTodo
  3. If we were first to set it, execute it
  4. Repeat until master joins and worker has tried all index i

The master must only do the following to add

  1. fill array with the incoming vector of checks
  2. add the size of that vector to nTodo

Like the previous version, the master also joins and becomes a worker.

The C++11 standard guarantees that an atomic flag is lock free on all platforms. While reading nTodo may not be guaranteed to be lock free on all platforms, the atomic library will seamlessly use a lock and correctness will not be altered. Performance will not degrade significantly as a single read of nTodo can reveal many jobs. The write and reading of nTodo synchronizes the checks memory.

The actual code is a bit more complicated than what is described out above. This is because the workers must keep track of the progress of other threads to prevent premature returning or cleanup and sleep when there is not work to be done.

Note that this means the code is not actually lock free. I'm not sure of the proper word to describe it, other than minimally serial sectioned. However, during the most performance critical section, it will be mostly lock free (enqueuing is lock free as is dequeing). Cleanup and result returning must be somewhat serial.

Optimizations

A number of optimizations are used or could be used to further improve performance or scalability. Optimizations are either good or bad (here for historical purposes).

Good

  1. Emplace scriptchecks. Currently, the code creates a vector of script checks and then puts them on the queue. However, because there is already memory for them available in the pre-allocated array, we can just put them in there directly.
  2. Smart ordering. Rather than all contend for the same flags at the same time, we only try to do jobs that are (worker_id === job_id) (mod n_workers) until the master joins. Once the master has joined, the worker tries to read the flags of (worker_id++ === job_id) (mod n_workers) until (worker_id === job_id) (mod n_workers).
  3. Distributed cleanup. Rather than have one thread take care of cleanup, we can let every thread do a portion of the cleanups required.
  4. Early master return. There are certain cleanup tasks that must be done before a new checkqueue controller can be created. Rather than have the master wait for these, we allow the master to exit immediately and have some worker do the masters cleanup. This decreases latency.

Bad

  1. Usually atomic lock-free instructions are dependent on the cache coherence protocol MSI states. This means that two consecutive atomic<uint32_t>s may contend on an atomic load or store. Therefore, if we pad out all of the entries in our arrays we should be able to improve performance. However, in testing I found this to not be the case. Unpadded access was actually faster.
  2. Atomic reads are expensive, so we should cache them for performance reasons. Caching seems to not help that much because the hardware implementation of atomic load is likely 0 overhead if the data is not dirty. If the data is dirty, we want to observe it immediately. Caching is still used where the ordering of observations is important (eg, reading masterJoined after nTodo would be bad).
  3. Previously, we tried to do something smarter with the ordering of processing jobs, but the bookkeeping overhead was pretty high. The right data structure is eventually to make an efficient interval tree, and have helpers try to process the max key and the main worker process the min keys.

Future Work

There is currently ongoing work to make the sigcache lock-free, I may work to integrate that work in a cleaner way with the lock-free checkqueue so that sigcache erases occur out of band of the master.

Acknowledgement

Thanks to Alex Morcos, Suhas Daftuar, Cory Fields, and Matt Corallo for preliminary review and feedback on this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment