Skip to content

Instantly share code, notes, and snippets.

@eprothro
Last active December 8, 2017 03:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save eprothro/0d89fe08724ce997c447d45bdb9914ee to your computer and use it in GitHub Desktop.
Save eprothro/0d89fe08724ce997c447d45bdb9914ee to your computer and use it in GitHub Desktop.

MRI GIL isn't pure evil, a.k.a. Native Threads Aren't Magic

The GIL in MRI does not inherintly mean inefficient concurrency. Likewise, using "native threads" does not inherintly mean efficient concurrency.

To be clear, modern ruby (>=1.9) does use native threads. The key is that GIL locking ensures that only 1 thread is executing at a time (regardless of the number of CPU cores available), per process. So, MRI threads are native threads, but to draw the distinction, we'll use the terms "GIL thread" and "native thread" below.

Scheduling

Concurrency is not about magically getting two threads to execute at the same time on a single core (because that would be magic). It is about efficiently managing thread scheduling so the maximum amount of CPU cycles is spent doing work.

In other words, if we have work that does not yield to another thread, with a single thread we achieve 100% utilization.

single thread saturation

So, adding another thread does not increase our CPU utilization, or increase the amount of work accomplished.

multi-thread saturation

Threading has its benefit becuase the code being executed doesn't only require operations performed by the CPU. It also has portions of execution in which is has to wait on something (e.g. I/O and sleep are two prime examples). These latter portions are chances for other work to get done, and examples where a thread yields to other threads.

So, if work comes in that is half CPU work and half yielding work, we could imagine what this would look like. It makes sense that this results in 50% utilization of this CPU.

Single threaded execution profile

If the same work is split across two threads, likewise we can see what this would look like. The interleaving of working and yielding from the two threads results in 100% utilization, and more getting done in the same amount of time.

Multi-threaded execution profile

Back to the GIL

Now, if we have only one process with two threads and we have two cores, yes, GIL threads are at a disadvantage.

Remember, GIL = only thread executing at a time, per process. So, one MRI process will only be able to saturate the equivalent of one core.

GIL locking limits saturation

However, native threads can be scheduled on both cores at the same time and would therefore be able to saturate both.

Native scheduling across cores

We can validate this with htop and irb in a few seconds using MRI and then JRuby.

2.times.map{ Thread.new{ 100_000_000.times{ 2+2 } } }.map(&:join)

However, if we instead have two processes (a fork is still a process) and they each have 2 threads, GIL threads and native threads have conceptually similar execution profiles, because the GIL lock is per-process.

This becomes more clear if we first imagine each process having only 1 thread. Both GIL threads execute in parallel since they can each be scheduled for different cores (since there are two processes).

GIL with two processes

Same story as with native threads -- two threads executing in parallel on two different cores.

Now, when we add a 2nd thread, the story doesn't get magically worse for GIL threads (or magically better for native threads). This is because each core can still only do one thing at at time. That second thread, in both native and GIL scenarios, still has to wait for the first thread to finish or yield its timeslice in order for the second thread to begin execution.

In other words, the rules of scheduling still apply. It is not until thread 1 is waiting for I/O, for example, that thread 2 has its chance to begin execution. This is true for both native threads and GIL threads. Conceptually, MRI does not have a disadvantage here.

A Typical Web Server

The scenario we hint at -- one fork per CPU and a thread pool per fork -- may sounds familiar. Puma? Unicorn? Whatever you're using?

In a typical production web server configuration, with a typical web app, we can now intuit that GIL threads and and native threads have a very similar execution profile, and thus a very similar chance at most efficiently using available CPU (i.e. efficient concurrency).

Each core is executing a request, and the rest of the threads in the thread pool are waiting for the executing thread to finish or yield (read: wait on I/O) so they can get a turn.

Caveats

This is not to say there is no difference between GIL threads and native threads in this multi-fork setup. At least a few differences come to mind immediately.

  1. The web server itself is executing in a GIL thread and is thus limited in how efficiently it can balance/schedule getting new work (selecting/reading/queueing new requests) vs. performing work.

  2. GIL threads are primarily scheduled by MRI wheras native threads are primarily scheduled by the OS, so how long a given request might actually take to fully complete execution across possible context switches is different.

However, my instinct is that these differenes are relatively small factors in the equation of ensuring the "maximum amount of CPU cycles is spent doing work".

Finally, this assumes work is balanced across the MRI processes, and that each thread pool is sufficiently saturated. If all threads are yielding (think 'everyone is performing I/O') for process #1 then CPU might be wasted that could theoretically have been used by a thread that is sitting waiting in process #2.

Estimating yield % for MRI work

So, if you had work that was pure CPU work (e.g. work that never yields to other threads), theoretically you would get no throughput advantage by having multiple threads in your thread pool. Again, this assumes we're talking about this multi-fork + thread pool scenario, and assumes that work is always available.

The percentage of time that work spends yielding (i.e. not locking the GIL) is directly related to the amount of extra throughput we get by having multiple threads in our pool.

Work that doesn't acquire the GIL

work = Proc.new do
  sleep(0.001)
end
1094 iterations @ concurrency = 1  0.020000   0.030000   0.050000 (  1.510599)
547 iterations @ concurrency = 2  0.020000   0.020000   0.040000 (  0.753924)
Pure concurrency across 2 threads would have taken 0.755 sec
Executions split across 2 threads actually took 0.754 sec
This work yields 100% of its time.

Work that does acquire the GIL

work = Proc.new do
  2+2
end
13891668 iterations @ concurrency = 1  1.600000   0.000000   1.600000 (  1.609804)
6945834 iterations @ concurrency = 2  1.640000   0.010000   1.650000 (  1.653363)
Pure concurrency across 2 threads would have taken 0.805 sec
Executions split across 2 threads actually took 1.653 sec
This work yields 0% of its time.

Estimate Calculation

require 'benchmark'
require 'byebug'

work = Proc.new do
  sleep(0.001)
end
concurrency = 2

# heuristically determine iterations required
# to make work dominate benchmark harnessing
observation_sec = 1.5
iterations = 1
iterations = (1 / Benchmark.realtime{ iterations.times{ work.call} }).ceil
actual_sec = Benchmark.realtime{ iterations.times{ work.call} }
iterations = (observation_sec / (actual_sec / iterations)).ceil
# ensure iterations equally divide across threads
iterations = (iterations / concurrency) * concurrency

bm = Benchmark.bm do |x|
  x.report("#{iterations} iterations @ concurrency = 1") do
    1.times.map do
      Thread.new{ (iterations).times{ work.call } }
    end.map(&:join)
  end

  iterations = (iterations / concurrency).to_i
  x.report("#{iterations} iterations @ concurrency = #{concurrency}") do
    concurrency.times.map do
      Thread.new{ (iterations).times{ work.call } }
    end.map(&:join)
  end
end

ideal_time = bm.first.real.to_f / concurrency
concurrency_time = bm.first.real.to_f - bm.last.real

puts "Pure concurrency across #{concurrency} threads would have taken #{ideal_time.round(3)} sec"
puts "Executions split across #{concurrency} threads actually took #{bm.last.real.round(3)} sec"

yield_percentage = (concurrency_time / ideal_time) * 100
yield_percentage = 0 if yield_percentage < 0
yield_percentage = 100 if yield_percentage > 100

puts "This work yields #{yield_percentage.round(0)}% of its time."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment