Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
brief summary of massive performance improvements with JRuby
# Thee will be more information here when I share the entire problem space I'm working on, but
# in short, this is preview material for my second talk in a series called "What Computer Scientists Know".
# The first talk is on recursion, and goes through several examples., leading up to a problem based
# on a simple puzzle that initial estimates based on performance of a previous puzzle would take years
# to solve on modern computers with the techniques shown in Ruby. That sets the stage for improving the
# performance of that problem with threading, concurrency, and related tuning.
# The second talk is on threading and concurrency, touching on algorithmic performance as well.
# Using some knowledge of the problem (board symmetry, illegal moves, etc), we reduce the problem space
# to about .5% of what we initially thought it was. Still, the initial single threaded solution took more
# than 24 hours to find the first solution. Watching that run, we can see with the tool 'htop' that only
# one core on my 8 core laptop was busy. So we introduce threading. Still, with Ruby 1.9.3/2.0, we can't get
# much better performance with threading, using a breadth-first search to decompose the problem into 40
# sub-problems nd giving each of those sub-problems a thread. (This is because of Ruby's GIL).
# Switching to JRuby gave us a huge threading performance, but knowing what I know about Ruby and Java, I
# suspected the implementation wasn't thread-safe with Ruby's built-in datatypes. Sure enough, I can prove
# this with a subset of the problem; the JRuby version, while faster, found fewer solutions. How fast it
# wrong?
# I switched to using native Java collections, and started getting ConcurrentModificationExceptions. I
# made these concurrent, but that reduced performance because the locking on the threadsafe versions of
# Java collections isn't much better than Ruby's GIL.
# So I got out my trusty copy of Brian Goetz's "Java Concurrency in Practice", and looked up the use of
# some of the java.util.concurrent classes. In particular, I had an array of hashes that was created
# in ruby with this line of code:
hash_collection = []
#I replaced that with:
java_import "java.util.concurrent.ConcurrentSkipListSet"
hash_collection =
# no other code changed, and there was a huge improvement (on the order of dozens of times. I'll have
# real numbers and hardware specs in a final version of this analysis).
# I also had a simple count of game moves processed - the worst kind of shared state - an integer that
# every thread was constantly incrementing. I would get huge improvements by taking it out, but then
# I had no visibility into how fast stuff was processing. I replaced an integer that was created in
# Ruby with this line:
@@count = 0
# with this:
java_import "java.util.concurrent.atomic.AtomicInteger"
@@count =
# and the code that I incremented that from this:
@@count = @@count + 1
# with this:
# and that got me about 80% of the performance as not having the integer at all.
# finally, I had been creating about 40 threads being executed on a machine with 16 cores.
# In an effort to further reduce the problem space as well as create more threads, I iterated
# one more time with the breadth first search and had about 2700 initial states I could explore
# in parallel. I did the naive thing, and just created that many threads:
threads = []
things.each do |thing|
threads << do
# and that was BLAZINGLY fast, but ran out of memory on my machine after several hours
# (16 gigabytes!) That many threads with recursion, and the stackframes just got to be
# too much.
# So I researched thread pools, so that I could just keep the jobs in a holding state until
# a thread was available to take on the job, then I could adjust the number of threads based
# on the number of cores, independently from the number of jobs. That led to this code,
# based on java.util.concurrent.Executors:
java_import java.util.concurrent.Executors
executor = Executors.newFixedThreadPool(64)
things.each do |thing|
executor.submit do
# end result is blazingly fast... tens of thousands of times faster than Ruby 1.9.3 with threads.
# I'll have real metrics when this is official but the ruby solution toiled for 24 hours, only
# finding 1 solution; the initial JRuby solution found 6 solutions in 24 hours; this most recent
# solution has found over 270 solutions in two hours, and is evaluating over 40 million game
# positions a minute:
irb(main):287:0> x = Board.evaluated; sleep 60; puts Board.evaluated - x
# the 16 cores on the fastest machine I have access to are pegged with this solution, as shown by
# this output from htop:
# 1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%] 5 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.1%] 9 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||95.6%] 13 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]
# 2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%] 6 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.5%] 10 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.1%] 14 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||96.2%]
# 3 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%] 7 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.5%] 11 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%] 15 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||96.2%]
# 4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||95.6%] 8 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.5%] 12 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%] 16 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||96.8%]
# Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 5122/11901MB] Tasks: 53, 98 thr; 65 running
# Swp[ 0/4095MB] Load average: 56.92 56.51 45.51
# Finally, I was using an array (and .duping it) to keep track of the evaluated moves while recursing
# down the game tree. I switched this to using my own immutable linked list (since previous game moves
# don't change), removed the dup, and got another 15% performance improvement.
# There are potentially two other changes I can make to get similar performance improvements.
# Again, I realize this data in its current form is largly anecdotal; when this is cleaned up for a real
# presentation Ill have several code examples and their evolution, and performance characteristics on
# a few different machines.
# most of the small improvements (linked list, for example), would apply to MRI Ruby as well, but the
# two most massive improvements come from real threads and concurrent data structures. Thats where
# the JVM underneath JRuby shows how incredible it is.

ramonza commented Sep 25, 2013

Logically, even with perfect concurrency (i.e. no Amdahl's law), simply multiplying the number of cores utilized by 16x can't give your 10000x increases, can it? Seems like something else must be going on here.


bokmann commented Sep 26, 2013

it wasn't just the number of cores, it was the switch to JRuby, Java's JIT compiler, the removal of the GIL, etc.

bwalsh commented Oct 3, 2013

nice work

At some point I should try running RubyNEAT on JRuby. The thought of being able to leverage cores for NEAT is very alluring. I'll just have to make it JRuby-aware. Other than that, not much should have to change. Hmmmmm.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment