jnthn/x.md Secret

## x.md

      
    Raw
  

              x.md
            
          
    Rakudo gets a new thread pool

Vienna.pm have funded me to work 50 hours
on Perl 6. After some discussion, we decided I would first work on improving
the thread pool scheduler, and then move on to continuing the work around
non-blocking await, a feature of the upcoming Perl 6.d. In this post I'll
discuss the work on the thread pool scheduler, which was merged shortly after
the latest Rakudo release to maximize testing time, and thus will appear in the
next release (2017.10).
What was wrong with the ThreadPoolScheduler before?

My (by now slightly hazy) memory is that I wrote the initial Perl 6 thread pool
implementation in the space of an hour or two on a train, probably heading to
some Perl event or other. It was one of those "simplest thing that could
possibly work" implementations that turned out to work well enough that it
survived with only isolated tweaks and fixes until January this year. At that
point, I added the initial bits of support for non-blocking await. Even that
change was entirely additive,
leaving the existing scheduling mechanism completely intact.
When I first implemented the thread pool, there were approximately no people
writing concurrent Perl 6 programs. Happily, that's changed, but with it came a
range of bug reports that could be traced back to the thread pool. Along with
those, there were things that, while not being outright bugs, were far from
ideal.
Before I dug into a re-design, I first summarized all of the problems I was
aware of with the thread pool, so as I considered new designs I could "crash
test" them against the problems that afflicted the previous design. Here's my
list of problems with the previous design.

It was wasteful, spawning far too many threads in quite a lot of cases. A
single use of Proc::Async would cause some threads to be created. A second
use after that would add some more. A third use would add more. Even if
these uses were not cocurrent, threads would still be added, up until the
maximum pool size. It was a very conservative way to make sure a thread
would be available for each piece of work, provided the pool didn't reach
exhaustion point. But really, the pool doesn't need more than a thread or
two. RT #131915 was
a ticket resulting from this behavior: each thread adds to the memory
consumption of the program, making memory use a good bit higher than it
should have been for programs that really didn't need all of the threads.
Very active async I/O could swamp the thread pool with events. Since there
was a single queue, then timer events - which might be being used to kill off
a process after a timeout - may not have fired in a very timely manner at all,
since the timer event would be stuck behind all of the I/O events. RT #130370
Sometimes, despite the conservative mechanism described in #1, not enough
threads got started and this leads to a deadlock. RT #122709
It didn't try to scale the number of threads to match the available CPU
cores in any way. Just because there is a lot of work in the queue does
not mean that we need more threads; if we are making progress and the CPU
load is pretty high then adding more threads is unlikely to lead to more
progress.
In high-core-count systems, the default limit of 16 was too low. 32 core
CPUs very much exist by now!
For programs that manage to really block a lot of threads with non-CPU
bound work, we could deadlock. Most of that will go away with non-blocking
await, but there will still be ways to write code that really blocks a
bunch of OS threads and can't make progress without yet more threads being
available to run something. At some point, people just have to write their
programs differently, but the default limit of 16 threads was not very
generous.
Despite wishing to raise the maximum default pool size a good bit, we
couldn't do it because issues #1 and #4 mean we'd likely end up hitting the
maximum in most programs, and memory use would become totally unreasonable.
We sufferred from poor thread affinity for events that would innevitably
queue up due to serial supplies. For example, if two packets arrived from a
socket then they might be picked up by different threads in the pool. If
the first packet was still being processed, then the second would contend
for the lock used to enforce serial processing of messages by supplies, and
block until it was available.

3 kinds of queue

Problems 2 (active I/O swamping the queue and delaying timers) and 8 (poor
thread affinity) are resolved in the new thread pool by recognizing that not
all work given to the pool is equal. Timers really want to be dealt with in a
timely manner, so in programs that have timers it makes sense to have a queue,
and one or more worker threads, exclusively for time-based events. Work items
that will need processing sequentially anyway should be assigned to a single
thread, implying that each such "affinity worker" gets a queue of its own.
Finally, there is a general queue for everything else, and one or more threads
all eat from it.
Adding a supervisor

The other problems were addressed by the addition of a Sufficiently Smart
Supervisor. The supervisor is a thread created by the thread pool upon its
first use, and living from then until the end of the program. It spends most of
its time sleeping, waking up around 100 times a second to check how things are
going. It is the supervisor that makes most of the decisions about how many
threads to add to the pool. It factors in:

Demand - that is, how much work is there that needs doing. It determines
this by looking at the lengths of the timer queue and general queue. If they
are empty, then there's clearly no demand for extra workers. If they have
something in them, that's a sign that we might want more workers.
Available workers - if there's a worker thread that's currently not doing
anything, then there's little motivation to add another worker.
CPU core count - for CPU-bound work there's not too much point in having
more threads than there are CPU cores, because otherwise the OS will have to
juggle them and it won't do CPU cache hit rates any good. Modern OS
schedulers are good at dealing with lots of threads, and these days the MoarVM
GC copes alright with having a load of threads to sync up also. Still, each
extra thread brings memory overhead. Making the CPU core count a "soft limit"
for the number of threads to spawn is usually a reasonable choice.
Resource usage - there's work in the queue, but the scheduler already
spawned as many threads as there are CPU cores. At this point, a couple of
situations may apply. Perhaps there's a queue because all the threads are hard
at work on CPU-bound tasks, in which case adding more beyond the core count
will not help matters. Alternatively, the thread pool may be full of threads
that are somehow blocked, and so there isn't much, or any, progress at all.
In the worst case, there may be a deadlock due to the thread pool being full
of blocked threads, and adding an extra thread would break it. To determine
this, the supervisor looks at the program's CPU usage. If it's really low,
then the deadlock situation is relatively likely, so threads may be added
beyond the CPU core count.

Having the pool soft-limited to the number of threads, and reluctantly able to
go beyond that, means that we can raise the maximum default pool size quite a
lot; it now stands at 64 threads. In reality, many programs don't even reach
the CPU core count, as there's never enough work to trigger the addition of
more threads.
Affinity workers

Affinity worker threads aren't added by the supervisor. Instead, when an
affinity queue is requested, any existing affinity worker threads will have
their queue lengths inspected. The one with the shortest queue will be picked,
provided that the queue length is below a threshold. If it's over the threshold
(which increases per affinity worker thread added), then another affinity worker
thread will be spawned and the work allocated to it.
Outcomes

For all the cases I've tried so far, the new scheduler seems to do either at
least as well or better than the one it replaced - and sometimes much better.
The deadlock due to insufficient threads bug is gone, thanks to the supervisor.
Programs that do the occasional bit of work needing pool threads end up with
just a thread or two, greatly reducing their memory consumption. A Cro web
application that is processing just a handful of requests a second will now
spawn just a few threads (and so use much less memory and get better locality),
while one faced with some hundreds of requests per second will spawn more. And
a program doing a lot of CPU-bound work now spawns as many threads as there
are cores, which gives a small speedup compared to oversubscribing the CPU
under the previous scheduler. Finally, timer events are delivered and handled
in a timely way, even when there is a lot of output from a process.
And next?

Well, most immediately I should write some tests, so I can get the RT tickets
mentioned earlier in the article closed up. Feel free to beat me to doing that,
though! :-)
That aside, I'd like to mention that while the new scheduler is a decided
improvement, it's also improvable! The heuristics it uses to decide when to
add further threads can surely be tuned so more. The code doing that decsion
making is written in Perl 6,
which I hope makes it accessible to those who would like to experiment and
tweak.
Once again, thanks to Vienna.pm for making this work possible. And, finally, a
mention that I'm still looking for funding
to help me keep on doing Perl 6 things for a sizeable chunk of my work time; a
handful of companies each sponsoring 10 hours a month would soon add up!