bokmann/JRuby Awesome Performance

## JRuby Awesome Performance
# Thee will be more information here when I share the entire problem space I'm working on, but
# in short, this is preview material for my second talk in a series called "What Computer Scientists Know".
# The first talk is on recursion, and goes through several examples., leading up to a problem based
# on a simple puzzle that initial estimates based on performance of a previous puzzle would take years
# to solve on modern computers with the techniques shown in Ruby. That sets the stage for improving the
# performance of that problem with threading, concurrency, and related tuning.
#
# The second talk is on threading and concurrency, touching on algorithmic performance as well.
# Using some knowledge of the problem (board symmetry, illegal moves, etc), we reduce the problem space
# to about .5% of what we initially thought it was.  Still, the initial single threaded solution took more
# than 24 hours to find the first solution.  Watching that run, we can see with the tool 'htop' that only
# one core on my 8 core laptop was busy.  So we introduce threading.  Still, with Ruby 1.9.3/2.0, we can't get
# much better performance with threading, using a breadth-first search to decompose the problem into 40
# sub-problems nd giving each of those sub-problems a thread.  (This is because of Ruby's GIL).
#
# Switching to JRuby gave us a huge threading performance, but knowing what I know about Ruby and Java, I
# suspected the implementation wasn't thread-safe with Ruby's built-in datatypes.  Sure enough, I can prove
# this with a subset of the problem; the JRuby version, while faster, found fewer solutions.  How fast it
# wrong?
#
# I switched to using native Java collections, and started getting ConcurrentModificationExceptions.  I
# made these concurrent, but that reduced performance because the locking on the threadsafe versions of
# Java collections isn't much better than Ruby's GIL.
#
# So I got out my trusty copy of Brian Goetz's "Java Concurrency in Practice", and looked up the use of
# some of the java.util.concurrent classes.  In particular, I had an array of hashes that was created
# in ruby with this line of code:

hash_collection = []

#I replaced that with:

java_import "java.util.concurrent.ConcurrentSkipListSet"
hash_collection = ConcurrentSkipListSet.new

# no other code changed, and there was a huge improvement (on the order of dozens of times. I'll have
# real numbers and hardware specs in a final version of this analysis).

# I also had a simple count of game moves processed - the worst kind of shared state - an integer that
# every thread was constantly incrementing.  I would get huge improvements by taking it out, but then
# I had no visibility into how fast stuff was processing.  I replaced an integer that was created in
# Ruby with this line:

@@count = 0

# with this:

java_import "java.util.concurrent.atomic.AtomicInteger"
@@count = AtomicInteger.new

# and the code that I incremented that from this:

@@count = @@count + 1

# with this:

@@count.incrementAndGet

# and that got me about 80% of the performance as not having the integer at all.

# finally, I had been creating about 40 threads being executed on a machine with 16 cores.
# In an effort to further reduce the problem space as well as create more threads, I iterated
# one more time with the breadth first search and had about 2700 initial states I could explore
# in parallel.  I did the naive thing, and just created that many threads:

threads = []
things.each do |thing|
  threads << Thread.new do
    thing.solve
  end
end

# and that was BLAZINGLY fast, but ran out of memory on my machine after several hours
# (16 gigabytes!) That many threads with recursion, and the stackframes just got to be
# too much.

# So I researched thread pools, so that I could just keep the jobs in a holding state until
# a thread was available to take on the job, then I could adjust the number of threads based
# on the number of cores, independently from the number of jobs.  That led to this code,
# based on java.util.concurrent.Executors:

java_import java.util.concurrent.Executors
executor = Executors.newFixedThreadPool(64)

things.each do |thing|
  executor.submit do
    thing.solve
  end
end

# end result is blazingly fast... tens of thousands of times faster than Ruby 1.9.3 with threads.
# I'll have real metrics when this is official but the ruby solution toiled for 24 hours, only
# finding 1 solution; the initial JRuby solution found 6 solutions in 24 hours; this most recent
# solution has found over 270 solutions in two hours, and is evaluating over 40 million game
# positions a minute:

irb(main):287:0> x = Board.evaluated; sleep 60; puts Board.evaluated - x
40488395


# the 16 cores on the fastest machine I have access to are pegged with this solution, as shown by
# this output from htop:

#  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]    5  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.1%]     9  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||95.6%]    13 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]
#  2  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]    6  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.5%]     10 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.1%]    14 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||96.2%]
#  3  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]    7  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.5%]     11 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]    15 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||96.2%]
#  4  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||95.6%]    8  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.5%]     12 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]    16 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||96.8%]
#  Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                                       5122/11901MB]     Tasks: 53, 98 thr; 65 running
#  Swp[                                                                                                                             0/4095MB]     Load average: 56.92 56.51 45.51

#
# Finally, I was using an array (and .duping it) to keep track of the evaluated moves while recursing
# down the game tree.  I switched this to using my own immutable linked list (since previous game moves
# don't change), removed the dup, and got another 15% performance improvement.
#
# There are potentially two other changes I can make to get similar performance improvements.
#
# Again, I realize this data in its current form is largly anecdotal; when this is cleaned up for a real
# presentation Ill have several code examples and their evolution, and performance characteristics on
# a few different machines.
#
# most of the small improvements (linked list, for example), would apply to MRI Ruby as well, but the
# two most massive improvements come from real threads and concurrent data structures. Thats where
# the JVM underneath JRuby shows how incredible it is.
	# Thee will be more information here when I share the entire problem space I'm working on, but
	# in short, this is preview material for my second talk in a series called "What Computer Scientists Know".
	# The first talk is on recursion, and goes through several examples., leading up to a problem based
	# on a simple puzzle that initial estimates based on performance of a previous puzzle would take years
	# to solve on modern computers with the techniques shown in Ruby. That sets the stage for improving the
	# performance of that problem with threading, concurrency, and related tuning.
	#
	# The second talk is on threading and concurrency, touching on algorithmic performance as well.
	# Using some knowledge of the problem (board symmetry, illegal moves, etc), we reduce the problem space
	# to about .5% of what we initially thought it was. Still, the initial single threaded solution took more
	# than 24 hours to find the first solution. Watching that run, we can see with the tool 'htop' that only
	# one core on my 8 core laptop was busy. So we introduce threading. Still, with Ruby 1.9.3/2.0, we can't get
	# much better performance with threading, using a breadth-first search to decompose the problem into 40
	# sub-problems nd giving each of those sub-problems a thread. (This is because of Ruby's GIL).
	#
	# Switching to JRuby gave us a huge threading performance, but knowing what I know about Ruby and Java, I
	# suspected the implementation wasn't thread-safe with Ruby's built-in datatypes. Sure enough, I can prove
	# this with a subset of the problem; the JRuby version, while faster, found fewer solutions. How fast it
	# wrong?
	#
	# I switched to using native Java collections, and started getting ConcurrentModificationExceptions. I
	# made these concurrent, but that reduced performance because the locking on the threadsafe versions of
	# Java collections isn't much better than Ruby's GIL.
	#
	# So I got out my trusty copy of Brian Goetz's "Java Concurrency in Practice", and looked up the use of
	# some of the java.util.concurrent classes. In particular, I had an array of hashes that was created
	# in ruby with this line of code:

	hash_collection = []

	#I replaced that with:

	java_import "java.util.concurrent.ConcurrentSkipListSet"
	hash_collection = ConcurrentSkipListSet.new

	# no other code changed, and there was a huge improvement (on the order of dozens of times. I'll have
	# real numbers and hardware specs in a final version of this analysis).

	# I also had a simple count of game moves processed - the worst kind of shared state - an integer that
	# every thread was constantly incrementing. I would get huge improvements by taking it out, but then
	# I had no visibility into how fast stuff was processing. I replaced an integer that was created in
	# Ruby with this line:

	@@count = 0

	# with this:

	java_import "java.util.concurrent.atomic.AtomicInteger"
	@@count = AtomicInteger.new

	# and the code that I incremented that from this:

	@@count = @@count + 1

	# with this:

	@@count.incrementAndGet

	# and that got me about 80% of the performance as not having the integer at all.

	# finally, I had been creating about 40 threads being executed on a machine with 16 cores.
	# In an effort to further reduce the problem space as well as create more threads, I iterated
	# one more time with the breadth first search and had about 2700 initial states I could explore
	# in parallel. I did the naive thing, and just created that many threads:

	threads = []
	things.each do \|thing\|
	threads << Thread.new do
	thing.solve
	end
	end

	# and that was BLAZINGLY fast, but ran out of memory on my machine after several hours
	# (16 gigabytes!) That many threads with recursion, and the stackframes just got to be
	# too much.

	# So I researched thread pools, so that I could just keep the jobs in a holding state until
	# a thread was available to take on the job, then I could adjust the number of threads based
	# on the number of cores, independently from the number of jobs. That led to this code,
	# based on java.util.concurrent.Executors:

	java_import java.util.concurrent.Executors
	executor = Executors.newFixedThreadPool(64)

	things.each do \|thing\|
	executor.submit do
	thing.solve
	end
	end

	# end result is blazingly fast... tens of thousands of times faster than Ruby 1.9.3 with threads.
	# I'll have real metrics when this is official but the ruby solution toiled for 24 hours, only
	# finding 1 solution; the initial JRuby solution found 6 solutions in 24 hours; this most recent
	# solution has found over 270 solutions in two hours, and is evaluating over 40 million game
	# positions a minute:

	irb(main):287:0> x = Board.evaluated; sleep 60; puts Board.evaluated - x
	40488395



	# the 16 cores on the fastest machine I have access to are pegged with this solution, as shown by
	# this output from htop:

	# 1 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|98.7%] 5 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|98.1%] 9 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|95.6%] 13 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|98.7%]
	# 2 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|98.7%] 6 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|97.5%] 10 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|98.1%] 14 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|96.2%]
	# 3 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|98.7%] 7 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|97.5%] 11 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|98.7%] 15 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|96.2%]
	# 4 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|95.6%] 8 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|97.5%] 12 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|98.7%] 16 [\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|96.8%]
	# Mem[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| 5122/11901MB] Tasks: 53, 98 thr; 65 running
	# Swp[ 0/4095MB] Load average: 56.92 56.51 45.51

	#
	# Finally, I was using an array (and .duping it) to keep track of the evaluated moves while recursing
	# down the game tree. I switched this to using my own immutable linked list (since previous game moves
	# don't change), removed the dup, and got another 15% performance improvement.
	#
	# There are potentially two other changes I can make to get similar performance improvements.
	#
	# Again, I realize this data in its current form is largly anecdotal; when this is cleaned up for a real
	# presentation Ill have several code examples and their evolution, and performance characteristics on
	# a few different machines.
	#
	# most of the small improvements (linked list, for example), would apply to MRI Ruby as well, but the
	# two most massive improvements come from real threads and concurrent data structures. Thats where
	# the JVM underneath JRuby shows how incredible it is.