JeremyRubin/blog-lockfree-bitcoin.md

## blog-lockfree-bitcoin.md

      
    Raw
  

              blog-lockfree-bitcoin.md
            
          
    Bitcoin Lock-Free CCheckQueue Design & Analysis

Summary

When profiling where bitcoind was spending time and could be optimized, I
noticed that the sigcache was locking twice per read. This seemed bad, and with
Alex Morcos I was able to measure some bad contention. We put together a quick
and dirty patch: we eliminated the unnecessary locking and used a lock-free
queue to batch cleanup tasks. While Alex continued to look into the cache, I
diverted my attention to other sources of contention. I found that the
CCheckQueue design was pretty inefficient, and decided to re-implement it with
something better.
Previous Design

The old CCheckQueue worked roughly as follows:
Worker Loop:


Lock
Wait on a condition var for work to be ready
Remove a batch of work from the queue, of size set by magic formula
Unlock
Execute all the work

Master Add:


Lock
Put some work into the queue
Notify workers condvars of new work
Unlock

Just instrumenting the Lock acquisition showed that this was a pretty significant
portion of the time needed. Locking is a problem not only if there is
contention, but also because it mutually excludes threads from working at the
same time.
There were a number of other issues as well, all contributing to the overhead:

Dynamic (Heap) Memory Allocation
Condition Variables are slow
Batch Size Formula is bad

New Design Checklist


No (or minimal) locking
No condition vars.
Minimal-to-no heap Allocation after initialization
Batch sizes should be 1

New Design Rev.1

For the first revision I tried simply dropping in boost::lockfree::queue.
However, there were a few difficulties.

boost lockfree::queue storable data structures must be trivially copyable we
can use a pointer to a structure, but if we do, we may run into ABA errors if we
free it. this leads to...
Needing to maintain a to_free_list for cleanup after the round
Contention for the queue head
need to use malloc, or maintain a free_memory_pool internal

Overall this suggested to me that using boost::lockfree::queue would be a very
complicated design with high likelihood of memory leak. Furthermore, a queue or
a stack implement FIFO or LIFO access, whereas we only care WGIMCO (what goes in
must come out, that is, unordered) semantics.
New Design Rev.2

The new design works as follows:
We know in advance that the size can be no larger than MAX_SCRIPTCHECKS =
MAX_BLOCK_SERIALIZED_SIZE / 41, or the number of serialized input structures
that can fit into a block.
Therefore, a std::array<Check, MAX_SCRIPTCHECKS> checks is allocated at
initialization/compile time. We also create a std::array<atomic_flags, MAX_SCRIPTCHECKS> flags.
Lastly, we keep an std::atomic<size_t> nTodo denoting up to what index we have filled
the array.
A worker now has the following simple algorithm to do work:

Read nTodo
Try to set the flag corresponding to a job at index i < nTodo
If we were first to set it, execute it
Repeat until master joins and worker has tried all index i

The master must only do the following to add

fill array with the incoming vector of checks
add the size of that vector to nTodo

Like the previous version, the master also joins and becomes a worker.
The C++11 standard guarantees that an atomic flag is lock free on all platforms.
While reading nTodo may not be guaranteed to be lock free on all platforms, the
atomic library will seamlessly use a lock and correctness will not be altered.
Performance will not degrade significantly as a single read of nTodo can reveal
many jobs. The write and reading of nTodo synchronizes the checks memory.
The actual code is a bit more complicated than what is described out above. This is
because the workers must keep track of the progress of other threads to prevent
premature returning or cleanup and sleep when there is not work to be done.
Note that this means the code is not actually lock free. I'm not sure of the proper word to describe it, other than minimally serial sectioned. However, during the most performance critical section, it will be mostly lock free (enqueuing is lock free as is dequeing). Cleanup and result returning must be somewhat serial.
Optimizations

A number of optimizations are used or could be used to further improve
performance or scalability.  Optimizations are either good or bad (here for
historical purposes).
Good


Emplace scriptchecks. Currently, the code creates a vector of script checks
and then puts them on the queue. However, because there is already memory
for them available in the pre-allocated array, we can just put them in there
directly.
Smart ordering. Rather than all contend for the same flags at the same time,
we only try to do jobs that are (worker_id === job_id) (mod n_workers) until the
master joins. Once the master has joined, the worker tries to read the flags of
(worker_id++ === job_id) (mod n_workers) until (worker_id === job_id) (mod
n_workers).
Distributed cleanup. Rather than have one thread take care of cleanup, we
can let every thread do a portion of the cleanups required.
Early master return. There are certain cleanup tasks that must be done
before a new checkqueue controller can be created. Rather than have the master
wait for these, we allow the master to exit immediately and have some worker
do the masters cleanup. This decreases latency.

Bad


Usually atomic lock-free instructions are dependent on the cache coherence
protocol MSI states. This means that two consecutive atomic<uint32_t>s may
contend on an atomic load or store. Therefore, if we pad out all of the entries
in our arrays we should be able to improve performance. However, in testing I
found this to not be the case. Unpadded access was actually faster.
Atomic reads are expensive, so we should cache them for performance
reasons. Caching seems to not help that much because the hardware implementation
of atomic load is likely 0 overhead if the data is not dirty. If the data is
dirty, we want to observe it immediately. Caching is still used where the
ordering of observations is important (eg, reading masterJoined after nTodo
would be bad).
Previously, we tried to do something smarter with the ordering of processing
jobs, but the bookkeeping overhead was pretty high. The right data structure is
eventually to make an efficient interval tree, and have helpers try to process
the max key and the main worker process the min keys.

Future Work

There is currently ongoing work to make the sigcache lock-free, I may work to
integrate that work in a cleaner way with the lock-free checkqueue so that
sigcache erases occur out of band of the master.
Acknowledgement

Thanks to Alex Morcos, Suhas Daftuar, Cory Fields, and Matt Corallo for preliminary review and feedback on this work.