OK, Let's begin. Hello, Everybody.
My name is Narihiro Nakamura.
Today, I'm talking about "Parallel worlds of CRuby's GC".
I'm very happy now, because I'm meeting attendee in rubyconf.
And, one of my dreams is to talk in rubyconf.
So, I'm very happy and exciting.
Today is my first presentation in English.
My English is not good.
But, I'll do my best. Please accept me.
I'd like to introduce myself with time-line.
I was not a programmer before 2005.
I worked at a ice cream factory.
In ice-cream factory, I worked in an assembly line.
For example, I made many cardboard boxes.
I was a professional cardboard box maker.
I made 150 boxes per hour.
I was like a machine!!
I went the computer school since 2005.
I learned Java language over six month.
I worked at Computer Company since 2006.
I used Java language then.
I worked in a big company.
This work was similar to assembly line work.
Because, I made a part of a product. I didn't understand whole product.
I was still like a machine!!
Then, I met Ruby.
I've used Ruby in my hobby.
I changed job. This Company works using Ruby.
I have been using Ruby in my work.
Currently, I work at NaCl.
matz and shyouhei and takaokouji are my co-workers.
shugo is my boss.
They are CRuby committers.
When I started Ruby programming, I felt free.
This work wasn't similar to assembly line work.
I could make the whole product.
I was no longer a machine!!
I met Garbage Collection.
GC world is a very interesting for me.
GC is a garbage collecting machine.
I've been creating it since then. It's very fun!!
I'm making a machine!!
I'd like to talk about my relationship to GC.
I'm a CRuby Committer. I work on GC.
And, I wrote a book about GC.
But, it's only in Japanese.
And, I've been creating GC with RDD.
What is RDD?
RDD is RubyKaigi Driven Development.
Let me talk about my RDD history.
I created LonglifeGC in RubyKaigi 2009.
LonglifeGC treats long-life objects as special case.
It's similar to Generational GC.
LonglifeGC was rejected in CRuby 1.9.2 by some reason.
I'm so sad.
But, LonglifeGC has been used in Kiji.
Kiji is an optimized version of REE by Twitter developers.
The twitter team substantially extended LonglifeGC.
But, Kiji will be rejected also. I'm so sad.
Next, I created LazySweepGC in RubyKaigi 2008 and 2010.
Traditional M&S GC executes mark and sweep atomically.
Ruby application stops during GC. This is called 'stop-the-world'.
In Lazy sweeping, sweeping is lazy.
Each invocation of the object allocation sweeps Ruby's heap, until it finds an appropriate free object.
This improves the response time of GC. I.e. the worst case time of GC decreases.
And, You can use LazySweepGC since Ruby 1.9.3.
Next, I created Parallel Marking GC in current year. So, it's main topic in this presentation.
Now, I'd like to introduce about today's topics.
First, Why do we need Parallel Marking?
And, What to consider?
How to implement?
How much did performance improve?
First topic is why do we need Parallel Marking?
This is CRuby's current GC.
GC operates on only 1 core.
In multi-core environment, other cores don't help GC.
GC is alone.
We should run GC in parallel!!
First, I'd like to explain a few GC related concepts.
GC collects all dead objects.
What is a dead object?
A dead object is an object that is never referenced by the program.
In GC terms, we say a that dead object is unreachable from Roots.
What is Roots?
Roots is a set of pointers that directly reference objects in the program.
For example, Ruby's local variables and Ruby's global variables and so on..
For example, we executes this ruby's program.
String "b" is dead object, because this object is unreachable from Roots.
GC collects objects that are unreachable from Roots.
Let me introduce the current CRuby GC algorithm.
CRuby adopts the Mark & Sweep algorithm.
Collector works in separate Mark and Sweep phases.
In the Mark phase, collector marks live objects that are reachable from Roots.
For example, we executes example program.
'a' push new array, then new array push string 'a'.
then 'a' push string "b", Finally 'a'.pop.
So, it's objects relationship that is created by example program.
Please note that reference from array 'a' to string 'b' is cut. here.
Then, we executes GC.start.
In mark phase, all reachable objects from Roots are marked.
So, string 'b' isn't marked, because string 'b' is unreachable from roots.
Ruby Heap is divided into each block. This is called "Ruby Heap Block".
And, Ruby Heap Block is divided into each object.
In after marking, all objects be categorized as dead-object or live-object.
Here, black object describes marked object. gray object describes unmarked object.
In sweep phase, Collector sweeps "dead" objects.
A "dead" means unmarked. A "dead" means unreachable from Roots.
In sweep phase, collector scans entire ruby heap.
collector unmarks object if it's marked.
And, collector sweeps object if it's not marked.
In this case, string "b" is swept.
After Sweep, GC is completed.
Let me introduce characteristics of CRuby's GC.
First characteristics is the stop-the-world algorithm.
Next is single thread execution.
Today, I'd like to talk about single thread execution.
Recently, PC has multi-core processors. But, GC executes on a single thread.
Other cores don't work during GC.
What a waste!!
How can we fix this?
We should use parallel marking.
OK, What is Parallel Marking?
Parallel Marking run several marking processes in parallel by using native threads.
We will be happy on multi-core machine.
This shows flow diagram for parallel marking.
The line at top of this diagram is the ruby application execution flow.
Following lines are GC execution flow.
After the ruby application is stopped, GC is invoked.
And, GC execute in parallel by using native threads.
In this diagram, GC has three marking threads.
By the way, Why not perform sweeping in parallel?
The sweeping is much faster than the marking.
You can see kouichi's research, if you are interesting.
So, Mark phase improvement equals GC improvement.
And, we already have the lazy sweeping.
Next topic is What to consider when implementing Parallel Marking?
Actually, We should consider two problems.
First problem is workload balancing.
How can we divide the marking task into sub-tasks?
I tried think about a simple approach.
1 branch of Roots is marked by 1 thread.
A “branch” means an object which is referenced from Root, and its children.
'A' branch, and 'B' branch are shown here. Usually, there will be many branches. But, only two a shown here.
'A' branch is marked by 'A' thread.
And, 'B' branch is marked by 'B' thread.
So, Marking of single branch is performed by single mark thread.
This means tasks are distributed to multiple threads.
The task of marking the entire heap is divided into several tasks, each marking a single branch.
This seems to be no problem.
But actually, this solution suffers from the workload problem.
This illustration shows each mark thread has branch.
Each thread doesn't know what the other threads are doing.
For instance, if A and B finishes work early, then, they will stop doing anything.
I think "machines should work forever".
So, I think A and B should steals a other thread's task.
Let me introduce Parallel Marking with task stealing.
If A and B finishes work early, steal!!
This is called "Task Stealing".
OK, Next problem is wait-free algorithm.
What does "wait-free" mean?
A wait-free program does non-blocking execution.
It guarantees per-thread progress.
Why is wait-free important?
Amdahls's law gives an answer for this question.
Amdahl's law is used to find the maximum expected improvement to an overall system when only part of the system is improved.
Amdahl's law is used in parallel computing.
If parallel portion of the system is X% and number of processors is Y, How much speedup can we expect?
This graph describes how much speedup overall system.
X-axis is a number of processors.
Y-axis is a speedup of system.
For example, if number of processor is 8 and parallel portion in the system is 95%, then 6-times faster than with one core and 0% parallelization.
But, if Parallel Portion is 90%, then Non-parallel portion is 10%, this system can be speed up by a maximum of a factor of 10, no matter how many a processes you use.
So, if you have any sequential part, that may be a bottleneck on multi-core processors.
It's worse than expected, right?
The conclusion so far.
We should consider how we can efficiently balance workloads.
We use Task Stealing.
We should eliminate non-parallel parts.
We use wait-free algorithm.
Next topic is How to implement Parallel Marking?
In Task Stealing, Threads steal tasks from each other.
And, Task Stealing is achieved with Arora's Deque.
Deque stands for the Double-Ended Queue.
In Arora's Deque, the deque contains tasks as elements.
And, It's a wait-free data structure.
Arora's Deque has only three operations.
First operation is pop.
pop operation removes last task from deque and returns it.
It's same ruby's pop method behavior, you know.
Second operation is push.
push operation appends the task on to the end of this deque.
Third operation is shift.
shift operation returns the first task of deque and removes it.
Each mark worker corresponds to a single mark thread.
This shows mark worker 1 and 2 and 3.
Each mark worker has single deque. like this.
So, only the owner worker can call pop and push operation.
And, worker can call shift operation to steal other workers' deque.
In this case, worker1's deque is empty. worker1 calls shift of worker2's deque.
Then, returns the first task of worker2's deque and removes it.
Worker1 gets the task of worker2's deque.
This is task stealing.
Perhaps, you think “doesn't shift operation have contention problems?”
In what ways could shift operation cause contention problems?
Multi-thread may call shift operation of same deque at the same time.
shift and pop operations could be called at the same time when deque has only one element.
But, Arora's Deque avoids these contention problems.
Shift operation is serialized by using Compare And Swap.
And, this serialization doesn't use a lock.
I omit details of the implementation of the serialization.
For the sake of this presentation, let's assume that Arora's Deque avoids contention problems.
I'd like to summarize what I have said about Arora's deque.
A simple data structure for Task Stealing.
Each worker has a single deque.
Stealing is wait-free!
How to use Arora's Deque in Parallel Marking?
First try is "A task is an object".
Let's say that worker A has a branch that is composed of 4 objects.
We start by marking A and pushing it to the deque.
then, we pop 'A' from deque, and mark 'B' and 'C'.
'B' and 'C' are 'A' s children.
And we push 'B' and 'C' to deque.
then, we pop 'C' from deque. and mark 'D'. and push 'D' to deque.
Finally, we pop 'D' and 'B'.
This is a branch marking.
Next question is How do you steal?
Suppose that worker1 has task B and C. Worker2's has no task.
Worker2 steals task B on Worker1.
Marker uses Arora's Deque as a marking stack.
A "task" means an object.
The granularity of the task is very fine.
This is a naive implementation.
I implemented this approach.
But.. It's slower than original GC.
I was so sad.
I fell into the Pitfalls of Parallel Processing's.
I thought why is it slow.
pop and push and shift operations are called frequently.
Because deque has fine-grained tasks.
Their overhead is too big.
How to fix this?
We can make the tasks less fine-grained.
A task is a branch.
Now, all branches in Roots are divided roughly among the deques.
A worker has any branches on deque, like this. B worker has any branches too.
So now the deque holds only the top node of each branch.
And, each worker traverses the rest of the branch without using the deque.
As you may have noticed, branch marking is actually the same as the traditional marking in current CRuby's GC.
when deque is empty, A steal a task on other worker.
In this case, A's branch is empty.
So, A steals a branch on B's deque.
This approach has good point and bad point.
Good point is number of calls to Deque's operations was reduced.
So, marking speed of the worker is improved.
However, Coarse-grained tasks decrease parallelism.
Why do coarse-grained tasks decrease parallelism?
Tasks may involve a large branch.
If an object in B's branch has many child objects..
then A can't steal it while B is marking the large branch.
So, A must waits for B.
Therefore, parallelism is decreased.
So, the worker needs to treat large branches as special cases.
Almost all large branches hold large Array objects and/or large Hash objects.
Each marker has a special deque to manage them.
A marker divides them into fixed size tasks.
For example, first ten elements of the array, next ten elements of the array and so on.
By doing this, other workers can steal divided tasks.
This improves parallelism.
I'd like to summarize what I have said about Task Stealing implementation.
The naive implementation was slow, Because grain of the task was too fine.
A "task" means a branch in Roots. So, grain of the task is coarse.
Next topic is How much did performance improve?
These are my machine specs.
My machine has only 2 cores.
And, Parallel marking uses 4 marking threads.
First benchmark program is make benchmark.
This is the benchmark which used in CRuby development.
Next benchmark program is make rdoc.
make rdoc generates the Ruby documentation.
This benchmark measures execution time and the GC execution time of make rdoc.
This shows result of `make benchmark`.
This shows rate of increased in performance that is compared original and parallel GC version.
Each line is result of each benchmark.
If the line points upwards it means that performance is increased.
If the line points downwards it means that performance is decreased.
Why does this seem so slow?
I think it's affected by Parallel Marking's preparation.
For example, creating marking threads and allocation of deques..
In most of the benchmarks, the mark target objects are few.
In this case, Parallel Marking cost is expensive.
Next benchmark is make rdoc.
it takes about 80 seconds on my machine.
In fact, 30% of that time is spent on GC!!
How much did performance improve?
Original GC takes 26 sec. But, ParallelGC takes 16 sec.
It's 10 second faster than original gc.
All GC time is improved by 40%!
I expect we get a large improvement in many core environment.
For example, 8 core and 16 core..
But, My machine has just 2 cores.
I can't see it.
I explain best case for Parallel GC.
If the objects are many.
In this case, mark targets is many.
If the objects are long-lived.
For example, Server-side application.
I want to show the performance improvement with Parallel GC.
This demonstration is video game style.
Game title is SUPER NARIO GC.
It's a traditional side-scrolling video game.
Character has HP.
When GC runs,
the character loses HP while waiting for the GC to finish.
If character's HP is empty, character is died. Game over!
We must reach the goal before HP run out.
GC is running in fixed intervals.
A lot of objects are generated to increase GC's burden.
Burden equals Game Level.
Try to compare Original GC and Parallel GC.
Original GC pause time is long.
This game will be difficult.
Parallel GC pause time is short.
This game will be easy.
OK, Let's try!
original GC version.
Oops.. so difficult!!!
Parallel GC version.
Let's compare average time of GC.
In original GC version, GC average time is xx second.
And GC count is xx.
Next, in parallel GC version, GC average time is xx second.
It's faster than original GC!!
Finally, I'd like to talk about Remaining Problems.
First problem, Windows OS is not supported.
Mark Worker uses pthread as native thread.
And, uses some gcc built-in functions.
But, I'll support for Windows eventually.
Next problem, increased memory usage.
Size of 1 Deque is roughly 32KB.
But generally multi-core machine have plenty of memory.
I implemented Parallel Marking GC.
GC was improved!
I'll report to ruby-core soon.
But, Parallel Marking has some problems.
I'll fix these.
You can get source code here if you are interesting.
Acknowledgments, Following people helped me make this presentation!!
I couldn't do this without you!!
OK, That's all. Thank you for your attention!!
Do you have any questions?
Please short and simple questions.