Skip to content

Instantly share code, notes, and snippets.

@djspiewak
Last active September 16, 2024 07:18
Show Gist options
  • Save djspiewak/46b543800958cf61af6efa8e072bfd5c to your computer and use it in GitHub Desktop.
Save djspiewak/46b543800958cf61af6efa8e072bfd5c to your computer and use it in GitHub Desktop.

Thread Pools

Thread pools on the JVM should usually be divided into the following three categories:

  1. CPU-bound
  2. Blocking IO
  3. Non-blocking IO polling

Each of these categories has a different optimal configuration and usage pattern.

For CPU-bound tasks, you want a bounded thread pool which is pre-allocated and fixed to exactly the number of CPUs. The only work you will be doing on this pool will be CPU-bound computation, and so there is no sense in exceeding the number of CPUs unless you happen to have a really particular workflow that is amenable to hyperthreading (in which case you could go with double the number of CPUs). Note that the old wisdom of "number of CPUs + 1" comes from mixed-mode thread pools where CPU-bound and IO-bound tasks were merged. We won't be doing that.

The problem with a fixed thread pool is that any blocking IO operation (well, any blocking operation at all) will eat a thread, which is an extremely finite resource. Thus, we want to avoid blocking at all costs on the CPU-bound pool. Unfortunately, this isn't always possible (e.g. when being forced to use a blocking IO library). When this is the case, you should always push your blocking operations (IO or otherwise) over to a separate thread pool. This separate thread pool should be caching and unbounded with no pre-allocated size. To be clear, this is a very dangerous type of thread pool. It isn't going to prevent you from just allocating more and more threads as the others block, which is a very dangerous state of affairs. You need to make sure that any data flow which results in running actions on this pool is externally bounded, meaning that you have semantically higher-level checks in place to ensure that only a fixed number of blocking actions may be outstanding at any point in time (this is often done with a non-blocking bounded queue).

The final category of useful threads (assuming you're not a Swing/SWT application) is asynchronous IO polls. These threads basically just sit there asking the kernel whether or not there is a new outstanding async IO notification, and forward that notification on to the rest of the application. You want to handle this with a very small number of fixed, pre-allocated threads. Many applications handle this task with just a single thread! These threads should be given the maximum priority, since the application latency will be bounded around their scheduling. You need to be careful though to never do any work whatsoever on this thread pool! Never ever ever. The moment you receive an async notification, you should be immediately shifting back to the CPU pool. Every nanosecond you spend on the async IO thread(s) is added latency on your application. For this reason, some applications may find slightly better performance by making their async IO pool 2 or 4 threads in size, rather than the conventional 1.

Global Thread Pools

I've seen a lot of advice floating around about not using global thread pools, such as scala.concurrent.ExecutionContext.global. This advice is rooted in the fact that global thread pools can be accessed by arbitrary code (often library code) and you cannot (easily) ensure that this code is using the thread pool appropriately. How much of a concern this is for you depends a lot on your classpath. Global thread pools are pretty darn convenient, but by the same token, it also isn't all that hard to have your own application-internal global pools. So… it doesn't hurt.

On that note, view with extreme suspicion any framework or library which either a) makes it difficult to configure the thread pool, or b) just straight-up defaults to a pool that you cannot control.

Either way, you're almost always going to have some sort of singleton object somewhere in your application which just has these three pools, pre-configured for use. If you ascribe to the "implicit ExecutionContext pattern", then you should make the CPU pool the implicit one, while the others must be explicitly selected.

@djspiewak
Copy link
Author

It's a good question to ask. :-) async does not block a thread; that's basically the whole point. However, if the underlying effect blocks the thread, then obviously async can't really save you. So in this case, if the Future is non-blocking (as it should be!) then wrapping it up with async will convert it into an IO which runs the Future and produces the result, without any blocking whatsoever. If the Future does block, then neither async nor IO will make the problem any worse, but the thread will be blocked nonetheless.

Does that mostly answer the question?

@ashwinbhaskar
Copy link

@djspiewak yes, that answers it:) thank you for your patience and explanation!:))

@djerraballi
Copy link

Is looms promise to remove blocking threads, or to remove threads being a finite/scarce resource? There will always be blocking operations, locks not just on OS resources, but for concurrent safe access to several shared memory objects.

My understanding is that Loom will help greatly expand the size of our thread pools such that what bottlenecks throughput would be the underlying contentious resources, and not the artificial constraint of the thread pool size and contention.

@djspiewak
Copy link
Author

Is looms promise to remove blocking threads, or to remove threads being a finite/scarce resource? There will always be blocking operations, locks not just on OS resources, but for concurrent safe access to several shared memory objects.

Kernel threads are always going to be a scarce resource. If you really boil it down, the true underlying resources here are the physical threads provided by the hardware, which are physically limited by definition. Even ascending the abstraction tower though, kernel threads are relatively heavyweight both in the operating system itself and within the JVM. In general, it's difficult for a single process to have more than a few thousand of them, even with careful tuning, and it is always more optimal to have vastly fewer.

What Loom does is play the same trick as frameworks like Cats Effect, which is to say, it creates an abstraction on top of the underlying kernel threads (which it calls "carrier threads"). This abstraction is very lightweight and strictly (sort of…) non-blocking, which makes it possible to have many millions of them within a single process without causing problems. Perhaps confusingly, Loom defines this abstraction to be Thread itself and integrates it directly into the JVM, meaning that any code written on the JVM is able to take advantage of it (as opposed to frameworks like Cats Effect, where you need to explicitly opt-in to things like IO or Future).

So what's happening here is Thread is being redefined to be a more lightweight abstraction sitting on top of underlying carrier threads, which are just as scarce and heavyweight as they've ever been.

The tradeoff is that you need to be very careful with things that hard-block the underlying carrier thread. Loom tries to solve this problem by integrating very tightly into the JVM and the Java Standard Library, such that mechanisms which would normally block the carrier thread instead deschedule the virtual thread, allowing other threads to have access. More succinctly, it converts Unsafe.park into a callback which resumes the Thread continuation when run.

This is a clever trick, particularly integrated into the JVM, but it isn't perfect. As you pointed out, any blocking in native code is completely outside the realm of what Loom can protect you from, and this sort of blocking is far more common than you might expect. Netty, for example, very aggressively blocks in native code due to the fact that it implements its own OS-specific interfaces to asynchronous IO layers (such as epoll and io_uring). Even without third-party frameworks though, examples abound where native blocking is unavoidable. new URL("https://www.google.com).hashCode() is one example, since it delegates to the native OS DNS client, which in turn is blocking on all major operating systems. Another example is file IO, which is non-blocking on NTFS and can be non-blocking on versions of Linux which support io_uring, but which is fundamentally blocking on APFS and HFS+.

In other words, Loom is a classic leaky abstraction: it promises something which it cannot deliver, and in doing so invites you to write code which makes assumptions which do not hold in many common scenarios. This is where it really differs from frameworks like Cats Effect or Vert.x, which are very up front about the fact that blocking is bad and push you (the user) quite hard to declare your blocking so that it can be managed in a less dangerous way (in particular, via shunting strategies such as what is described in the OP).

@ooraini
Copy link

ooraini commented Jul 19, 2021

I would rather write sane looking code using a leaky abstraction than the monstrosity that is reactive frameworks.

Your examples about DNS and file system being a problem, I'm sure whatever code within the JVM will 'just works' with Loom, If you are a library author writing native code, either document it to your users that it shouldn't be used with virtual threads, or work with mechanism that I'm sure will be provided by Java to cooperate with the Loom scheduler.

@djspiewak
Copy link
Author

Your examples about DNS and file system being a problem, I'm sure whatever code within the JVM will 'just works' with Loom

@omaraloraini This is exactly my point: it won't. URL is in the JVM and it will not "just work" with Loom. Ditto with InetAddress (for the same reason). FileInputStream is also in the JVM and it won't "just work" with Loom, at least not on macOS or older versions of the Linux kernel. And it's not like these are uncommon cases.

or work with mechanism that I'm sure will be provided by Java to cooperate with the Loom scheduler.

To my knowledge, Loom does not provide a shunting mechanism. Or if it does, it's very well hidden.

Which brings me back to…

the monstrosity that is reactive frameworks.

I mean, at some point this is an aesthetic concern. I agree that some frameworks in this space are kind of horrifying to work with, but others provide you with tools for building clean, composable and extremely high-performance code that handles scenarios that are entirely unhandled even in Loom (e.g. Thread#interrupt still doesn't work correctly in Loom, and that in turn means that basic things like timeouts and concurrent errors still result in resource leaks). You very much get what you pay for in many of these frameworks; they aren't just hacky ways of defining threads that aren't threads.

But, to each their own. I respect your choice even if I do believe that you will come to regret it.

@sergiojoker11
Copy link

sergiojoker11 commented Jul 19, 2021

Would you be so kind to recommend some literature to expand on this topic? I am struggling to fully understand some concepts. For instance, the difference between a virtual and a physical thread.
Particularly interested on threading and performance of JVM apps. Mainly, the apps I develop are on Scala enriched with Cats.

Never heard of Loom before. Is what everyone refers as Loom this ?

@djspiewak
Copy link
Author

djspiewak commented Jul 19, 2021

Would you be so kind to recommend some literature to expand on this topic? I am struggling to fully understand some concepts. For instance, the difference between a virtual and a physical thread.

There isn't a ton of literature on this topic, unfortunately. It's part of why I wrote this gist. I can give you some quick terminology though:

  • Physical thread Think: a CPU. It's a bit more complex than that though due to hyperthreading, so the preferred term is "physical thread", which carries along with it registers, L1 and L2 cache space, etc.
  • Virtual thread See also: green threads, fibers. Virtual threads are a semantic user-space threading abstraction. Most reactive frameworks (like Cats Effect) use other terminology for this concept, such as "fibers".
  • Fibers A sequence of actions, similar to a thread but higher-level, containing both synchronous and asynchronous (callbacks) actions. Fibers are scheduled onto underlying carrier threads by the runtime. A fiber which is not active on a carrier thread is said to be "suspended", which is to say, it is semantically blocked. This may be because it is waiting for a callback to run, or because it is waiting its turn on the carrier.
  • Carrier thread Related to physical threads. We don't have access to the raw physical threads in user-space, in part because we don't work with a real-time operating system. Carrier threads are kernel-level threads (i.e. java.lang.Thread) which have their scheduling handled by the kernel itself. Virtual threads are mapped down to carrier threads in user space, and carrier threads are mapped down to physical threads in kernel space.

Mainly, the apps I develop are on Scala enriched with Cats.

The very short answer then is "use Cats Effect". As of 3.0 (and higher), it encodes all of the best practices in this gist (and more) and pushes you to use patterns which are optimal for performance.

Benchmarking this stuff gets very complicated and use-case dependent. We've done a lot work on it to tune Cats Effect itself (and other elements of the ecosystem, like Fs2). Benchmarking your own application is recommended, though the defaults should still be very good if not optimal in most scenarios.

Edit: This might also help a bit in terms of understanding the general space of thread optimization: https://typelevel.org/blog/2021/02/21/fibers-fast-mkay.html And this gist goes into more details of the thread scheduling abstraction tower: https://gist.github.com/djspiewak/d9930891d419c26fac1d58b5274f45ba

@ooraini
Copy link

ooraini commented Jul 20, 2021

@djspiewak I think you are being pessimistic about Loom, it's still being developed and not all is set in stone. Go has done it, and in Go it "just works".

@djspiewak
Copy link
Author

and in Go it "just works".

Except it doesn't. The same cases fail in Go, for the same reasons. These are problems at at the kernel level, not problems in the language implementation. This is why I'm pessimistic about Loom.

Ultimately, it is choosing a layer of abstraction which cannot be made air-tight, and rather than accepting that fact, they're trying to claim that the exceptions are edge cases and will be resolved over time. This isn't a new set of problems though. People have been trying to build asynchronous DNS resolvers for decades without success. Can it be done? Yes. Has it been done? No. Will it be done on any sort of reasonable timeline for Loom's availability? No. And again, even if the DNS problem were resolved (which is a huge "if"), you still have issues with the fundamental capabilities of the macOS kernel.

I think my pessimism is justified.

@jumarko
Copy link

jumarko commented Jul 22, 2021

This is such a gem! Thanks a ton, @djspiewak for putting the information together and responding to the comments patiently and with great care.

I had the exact same question in mind about "unbounded thread pool with bounded queue" vs "bounded thread pool" as was already asked.
My knowledge of the Java Executors API is a bit rusty so I was looking at this: https://stackoverflow.com/questions/6306132/java-thread-pool-with-a-bounded-queue.
Could we perhaps achieve the "propagation of hitting the limits" via RejectedExecutionHandler?
Or the problem is that it would still be too low-level and handled in an inappropriate place?

Now, In my code, I like to limit the number of threads in a thread pool handling blocking IO operations (like HTTP connections).
I agree that having a top-level resource control is great but I think it can still be useful to combine that with lower-level resource control (imposed e.g. by a bounded thread pool handling max 10 concurrent HTTP requests to a particular HTTP API)

@sshark
Copy link

sshark commented Jul 23, 2021

@djspewak Thanks for elucidating this complex subject for us. I have questions related to this paragraph,

very aggressively blocks in native code due to the fact that it implements its own OS-specific interfaces to asynchronous IO layers (such as epoll and io_uring). Even without third-party frameworks though, examples abound where native blocking is unavoidable. new URL("https://www.google.com).hashCode() is one example, since it delegates to the native OS DNS client, which in turn is blocking on all major

As you mentioned, for the case of asynchronous file IO, we have support from the OSes epoll and io_uring. Why it is not doable for network IO in new URL("https://www.google.com).hashCode()? Is it because such functions are not available in the OS? How do R2DBC drivers do reactive remote database calls?

@anx21
Copy link

anx21 commented Jul 24, 2021

In the case of blocking IO, is Cats Effect 3 more efficient than Loom because "work stealing" is possible with a scheduler-per-carrier-thread? Are there any other features of Cats Effect 3 that make it more efficient in blocking IO?

@didibus
Copy link

didibus commented Nov 4, 2021

This is exactly my point: it won't. URL is in the JVM and it will not "just work" with Loom. Ditto with InetAddress (for the same reason). FileInputStream is also in the JVM and it won't "just work" with Loom, at least not on macOS or older versions of the Linux kernel. And it's not like these are uncommon cases.

To be honest, I wouldn't have even thought of moving URL or InetAddress use into my blocking IO pool? Do you go to that extent of making sure all blocking IO runs in an unbounded IO pool? Normally I just block the thread on those, so I don't really see where Loom would be inferior to the status quo on that I guess.

@don41382
Copy link

Hi @djspiewak, Thanks for the excellent article about ThreadPools.

I asked a question regarding the ForkJoinPool on Stack Overflow about the blocking {} block. I still didn't get a satisfying answer.

Should you use blocking {} combined with a ForkJoinPool for blocking operations, like API requests and database calls, ... or instead use a separate unbound cache pool, as you mentioned?

Using the ForkJoinPook with blocking has the advantage that IO results can be processed on the same thread. But almost every framework requires/builds on a separate thread pool.

What is your take on this? I would appreciate it!

@amarjeetkapoor1
Copy link

Hi @djspiewak I have a question regarding Why the Compute bounded thread needs to be "number of CPUs".

My understanding initial understanding of this reasoning is to avoid switching threads out of CPUs but wouldn't that be still happening as we have I/O threads or even other processes and Event Loop?

@djspiewak
Copy link
Author

My understanding initial understanding of this reasoning is to avoid switching threads out of CPUs but wouldn't that be still happening as we have I/O threads or even other processes and Event Loop?

This is a very good point. In general you want to think about "active" threads and "inactive" ones. The GC involves some number of threads (2 or 4 is pretty standard), and certain JVM configurations also involve background threads for other things like JIT compilation. These are inactive since they consume a very small amount of CPU time. When they get scheduled, they absolutely create contention with the active threads and impede performance, but this happens so infrequently that it doesn't really matter. As such, we generally don't count these threads against your availableProcessors() limit (in the same way that we also generally don't consider other, usually-idle processes elsewhere in the OS).

Where things get exceptionally tricky is stuff like event loops. These are certainly less active threads than your compute workers, so in that sense we can often ignore them, but not always! Applications which have extremely high-frequency I/O events will often see their event dispatch threads using more and more CPU time, and at that point it becomes a real source of contention. This can be handled in one of two ways: you can either reduce the size of the compute worker pool to make room for the event dispatch threads (usually only 1 or 2), or you can use a more advanced scheduler which combines I/O event loops and compute work onto a single set of worker threads. This is what Cats Effect will do in 3.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment