Created
December 10, 2012 23:56
-
-
Save rurban/4254490 to your computer and use it in GitHub Desktop.
Day 11 - Parrot threads
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" ?> | |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> | |
<html xmlns="http://www.w3.org/1999/xhtml"> | |
<head> | |
<title>Parrot threads</title> | |
<meta http-equiv="content-type" content="text/html; charset=utf-8" /> | |
<link rev="made" href="mailto:rurban@cpanel.net" /> | |
</head> | |
<body style="background-color: white"> | |
<h1><a name="parrot_threads">Parrot threads</a></h1> | |
<p><em>Editors note: I, rurban, does know almost nothing about threads. Any errors are probably mine. I just tested them, fixed some deadlocks, added the numcpu code and merged the threads branch to master.</em></p> | |
<p>Parrot now supports fast and lightweight OS threads, based on Nat Tucks's | |
initial GSoC work together with Andrew "whiteknight" Whitworth on green threads | |
and finally Stefan Seifert's extension to true parallel OS threads as hybrid threads. | |
See <a href="http://wknight8111.blogspot.co.at/2010/08/gsoc-threads-chandons-results.html">http://wknight8111.blogspot.co.at/2010/08/gsoc-threads-chandons-results.html</a> | |
and <a href="http://niner.name/Hybrid_Threads_for_the_Parrot_VM.pdf">http://niner.name/Hybrid_Threads_for_the_Parrot_VM.pdf</a></p> | |
<p>Parrot always supported <em>"threads"</em>, i.e. concurrency models over the last years, | |
but we identified various problems with the particular designs and were continuously | |
improving them. In our case without changing the API too much, as the pdd25 concurrency spec | |
is pretty high level, describing the various models parrot should support, and also pretty | |
low-level in describing the two PMC's which export the threads API, the <strong>Task</strong> and the | |
<strong>Scheduler</strong> classes.</p> | |
<p>Being born at a time when Perl 6 still looked much more similar to | |
Perl 5 than it does nowadays, Parrot's threading support initially was very | |
close to Perl's ithreads model. Previous attempts to change this into the | |
more conventional model of data shared by default or implementing new | |
technologies like STM "Software Transactional Memory" failed. For example Parrot | |
has never supported running multiple threads and having garbage collection | |
at the same time.</p> | |
<p>In the year 2005 development of faster Central Processing Units (CPUs) | |
shifted from increased speed of a single core to adding more cores. Modern | |
processors contain up to 12 cores with even mobile phones having up to four. | |
To utilize a modern CPU's power, code needs to be run in parallel. In UNIX | |
(and thus Perl) tradition, this is accomplished using multiple processes being | |
a good solution for many use cases. For many others like auto threading of | |
hyper operators in Perl 6, the cost of process setup and communication would | |
be prohibitively high except for very large data sets.</p> | |
<p>During the years of back and forth and failed attempts of adding threading | |
support to Parrot, the Perl 6 specification evolved to a point where the largest | |
parts of the language were covered and its features were implemented in the | |
compilers. The lack of concurrency primitives in Parrot however prevents | |
any progress in the area of concurrency support.</p> | |
<p>Green threads were used to simplify the implementation of a | |
nearly lock free multithreading implementation. | |
<a href="http://niner.name/Hybrid_Threads_for_the_Parrot_VM.pdf">http://niner.name/Hybrid_Threads_for_the_Parrot_VM.pdf</a></p> | |
<p>Parrot supports now native Win32 threads and POSIX threads. Win32 alarm, sleep | |
and premption is unified with POSIX, it is handled on a common timer thread.</p> | |
<p>Parrot creates at startup a thread pool of <code>--numthreads</code> threads, | |
which defaults to the number of available CPU cores. Activating a new | |
thread at runtime causes no run-time penalties, until the number of | |
cores is utilized. When a user starts a new task, the scheduler first | |
looks for an idle thread. If one can be found, the task is scheduled | |
on the thread's interpreter. If more tasks are started than the | |
maximum number of threads, the tasks are distributed evenly among the | |
running interpreters. This is effectively an implementation of the | |
N:M threading model.</p> | |
<p> | |
</p> | |
<h2><a name="green_threads">Green threads</a></h2> | |
<p>Our GSOC student Nan "Chandor" Tuck worked in summer 2010 on green threads.</p> | |
<p><em>What I have working now is a pre-emptively scheduled green threads system for Parrot that allows programs to be written in a concurrent style. Individual green threads can do basic blocking file input without stopping other threads from running. These logical threads are accessed using the Task API that I described a couple weeks ago.</em> | |
<em>This functionality makes Parrot similarly powerful at threading as the standard version of Ruby or Python: Threads can do pretty much everything except run at the same time.</em> | |
<a href="http://parrot.org/content/hybrid-threads-gsoc-project-results">http://parrot.org/content/hybrid-threads-gsoc-project-results</a></p> | |
<p>What was missing from this green threads branch was true parallel execution | |
in OS threads, one global_interpreter structure that is shared and | |
protected by locks or other concurrent access rules and many | |
local_interpreters that run simultaneously in separate OS threads.</p> | |
<p> | |
</p> | |
<h2><a name="__os_threads">+ OS threads</a></h2> | |
<p>From Fall 2011 to Summer 2012 Stefan "nine" Seifert implemented true | |
OS threads on top of green threads to finally allow true parallel | |
execution of Tasks, to implement blocking IO, and to give perl6 some | |
more advantages over perl5.</p> | |
<p>The lightweight "green" threads are used as messages in a system where | |
reading shared variables is allowed but only the one owner thread may write | |
to it. That's why we call it hybrid threads.</p> | |
<p> | |
</p> | |
<h2><a name="why_is_multithreading_support_so_difficult_to_implement">Why is multithreading support so difficult to implement?</a></h2> | |
<p>Low level programming languages like C provide only the bare necessities, | |
leaving the responsibility for preventing data corruption and synchronization | |
entirely to the user. A high-level language like Perl 6 on the other hand | |
provides complex and compound data types, handles garbage collection and | |
a very dynamic object system. Even seemingly simple things like a method | |
call can become very complex. In a statically typed programming language | |
the definition of a class is immutable. Thus, calling a method on an object | |
contains just the steps of determining the object's class, fetching the required | |
method from this class and calling it. Calling the same method again can | |
then even omit the first two steps since their results cannot change.</p> | |
<p>In a dynamic language, the object may change its class at runtime. The | |
inheritance hierarchy of the class may be changed by adding or removing | |
parent classes. Methods may be added to or removed from classes (or objects) | |
at runtime and even the way how to find a method of a class may change. | |
So a simple method call results in the following steps:</p> | |
<pre> | |
· determining the class of the object, | |
· determining the method resolution method of the class, | |
· finding the actual method to call, | |
· calling the method.</pre> | |
<p>These steps have to be repeated for every method call, because the results | |
may change any time. In a threaded environment, a thread running in parallel | |
may change the underlying data and meta data in between those sequences | |
and even between those steps. As a consequence, this meta data has to be | |
protected from corruption introducing the need for locks in a performance | |
critical area.</p> | |
<p>Many interpreters for dynamic languages like Python or Ruby handle | |
this problem by using a global interpreter lock (GIL) to effectively | |
serialize all operations. This is a proven and reliable way but leaves | |
much of the hardware's potential unused.</p> | |
<p> | |
</p> | |
<h2><a name="java">Java</a></h2> | |
<p>In Java, the user is responsible for preventing concurrency issues. The lan- | |
guage provides synchronization primitives like mutexes, but the interpreter | |
(the Java Virtual Machine, JVM) does not protect the consistency of the | |
provided data structures. The class library provides the user with high-level | |
data structures explicitly designed for multithreaded scenarios.</p> | |
<p>Java version 1.1 used green threads to support multithreaded execution | |
of Java programs. Green threads are threads simulated by the virtual | |
machine (VM) but unable to use more than one CPU core for processing. | |
Version 1.2 introduced native Operating System (OS) threading support | |
which since has become the standard way to do multithreading in Java.</p> | |
<p> | |
</p> | |
<h2><a name="python">Python</a></h2> | |
<p>The CPython implementation of the Python runtime uses a Global Inter- | |
preter Lock (GIL) to protect its internal consistency. This is a single lock | |
taken whenever the interpreter executes Python bytecode. Because of this | |
lock, only one thread can execute bytecode at any time so all built-in types | |
and the object model are implicitly type safe. The drawback is that Python | |
code cannot benefit from having multiple CPU cores available. However I/O | |
operations and calls to external libraries are executed without holding the | |
GIL, so in applications with multiple I/O bounded threads, there may still | |
be a performance benefit from using multithreading.</p> | |
<p>To run Python code in parallel, multiple processes have to be used. The | |
multiprocessing module provides support for spawning processes exposed | |
through an API similar to the threading module. Since processes may | |
not directly access other processes' memory, the multiprocessing module pro- | |
vides several means of communication between processes: Queues, Pipes and | |
shared memory support.</p> | |
<p> | |
</p> | |
<h2><a name="parrot">Parrot</a></h2> | |
<p>Much of Parrot's previous threading related code has been removed to | |
clean up the code and improve performance. Since the existing | |
threading support was known to be unreliable and seriously flawed, | |
this was no trade off. The final parts were removed by the merging of | |
the <code>kill_threads</code> branch on September, 21st 2011.</p> | |
<p>In 2010, Nat Tuck began working on a <code>green_threads</code> branch during | |
his Google Summer of Code internship. The feature got prototyped using | |
pure PIR and then implemented in Parrot's core. He got it to work in | |
simple cases and started to work on OS thread support but the | |
internship ended before the code was ready to be merged into the | |
master branch. The code lay dormant until the work on hybrid threads | |
in the <code>threads</code> branch started in 2011.</p> | |
<p>In Parrot, green threads are called Tasks. Each task is assigned a fixed | |
amount of execution time. After this time is up a timer callback sets a flag | |
which is checked at execution of every branch operation. Since the inter- | |
preter's state is well defined at this point, its internal consistency is guaran- | |
teed. The same holds for the GC. Since task preemption is only done while | |
executing user-level code, the GC can do its work undisturbed and without | |
the need for measures like locking. Since user-level code is allowed to dis- | |
able the scheduler, it can be guaranteed to run undisturbed through critical | |
sections.</p> | |
<p>The scheduler is implemented as a PMC type. This allows the user to | |
subclass this PMC thus allowing fine-grained control over the scheduling | |
policy. Features, a user could add this way would be for example giving | |
different priorities to tasks or implementing the possibility to suspend and | |
resume a task.</p> | |
<p> | |
</p> | |
<h2><a name="shared_data">Shared data</a></h2> | |
<p>Cross-thread writes to shared variables may endanger the internal | |
consistency of the interpreter. Traditionally, the solution to this | |
problem is the use of locks of varying granularity. Fine-grained | |
locking allows code to run in parallel but taking and releasing locks | |
costs per- formance. It not only increases the instruction count and | |
memory accesses but it also forces the CPU cores to coordinate and | |
thus communicate. Even a seemingly simple operation like an atomic | |
increment can take two orders of magnitude longer than a normal | |
increment. While the gain through being able to utilize multiple | |
CPU cores may offset this cost, it is still impacting | |
the common case of having only a single thread running.</p> | |
<p>Too coarse locking on the other hand would reduce scalability and the | |
performance gains through parallel execution by having threads wait for | |
extended periods for locks to become available. In the extreme case of having | |
a global interpreter lock it would effectively serialize all computations costing | |
much of the benefits of using threads in the first place.</p> | |
<p>The other problem with locking is the possibility of introducing | |
deadlocks. For example, two functions F1 and F2 both use two resources | |
A and B protected by locks. If F1 first locks A and then tries to lock | |
B while F2 has already locked B and is now trying to lock A, the | |
program would come to a halt. Both functions would be left waiting for | |
the other to unlock the resource which will never happen. With | |
fine-grained locking, the possibilities for such bugs grow quickly. At | |
the same time, it is easy to miss a case where a lock would be | |
appropriate leading to difficult to diagnose corruption bugs.</p> | |
<p>The solution for these problems implemented in this thesis is to | |
sidestep them altogether by disallowing write access to shared | |
variables. The programmer (or in most cases the compiler) is obliged | |
to declare a list of all shared variables before a newly created task | |
is started. The interpreter would then create proxy objects for these | |
variables which the task can use to access the data. These proxies | |
contain references to the original objects. They use these references | |
to forward all reading vtable functions to the originals. Write access | |
on the other hand would lead to a runtime error.</p> | |
<p>In other words, all data is owned by the thread creating it and only the | |
owner may write to it. Other threads have only read access.</p> | |
<p>For threads to be able to communicate with their creators and other | |
threads, they still need to write to shared variables. This is where | |
green threads come into play. Since green threads are light weight, it | |
is feasible for a thread to create a task just for updating a | |
variable. This task is scheduled on the interpreter owning this | |
variable. To reduce latency, the task is flagged to run | |
immediately. The data-owning interpreter will preempt the currently | |
running task and process the new write task. Put another way, the | |
data-owning interpreter is told what to write to its variables, so | |
other threads don't have to.</p> | |
<p> | |
</p> | |
<h2><a name="proxies">Proxies</a></h2> | |
<p>Proxies are the arbiters between threads. They are the only means for a | |
thread to access another thread's data and are implemented by the Proxy | |
PMC type.</p> | |
<p>Proxy has default implementations for all functions, writing functions raise | |
a cant_do_write_method runtime exception. If a method returns a PMC from the | |
target's interp, another proxy object has to be created and wrapped around it so it | |
can be safely returned to the caller.</p> | |
<p> | |
</p> | |
<h2><a name="sub">Sub</a></h2> | |
<p>The Sub PMC represents executable subroutines. A Sub does not only con- | |
tain the code to execute but also the context in which to execute the code | |
such as visible globals and namespaces. If a proxy to such a Sub were created | |
and invoke called on it, the code would access this context directly since it | |
belongs to the same interp as the proxied Sub itself. Thus, an operation like | |
<code>get_global</code> fetches a global from an unproxied namespace and an unproxied | |
global is be put into the target register. Since this is happening while run- | |
ning invoke on the original Sub, Proxy cannot intercept the call and create | |
a Proxy for the result.</p> | |
<p>This is the reason why <code>Parrot_thread_create_proxy</code> does not create | |
a Proxy for a Sub but uses <code>Parrot_thread_create_local_sub</code> to create a | |
copy on the thread's interp with proxies for all PMC attributes.</p> | |
<p> | |
</p> | |
<h2><a name="writing_to_shared_variables">Writing to shared variables</a></h2> | |
<p>As described in chapter 5, to write to shared variables, a thread creates a | |
task and schedules it on the data owning interpreter. An example task looks | |
like this:</p> | |
<pre> | |
.sub write_to_variable | |
.param pmc variable | |
variable = 1 | |
.end</pre> | |
<p>This is a subroutine with just one parameter. The variable passed as this | |
parameter is the one the task should write to. In this case the constant value | |
1 would be written to the variable. In PIR, an assignment to a PMC gets | |
translated to a method call. In this case, the set_integer_native is called | |
changing the variable's value. Since PMCs are passed by reference, it is the | |
original variable which gets written to.</p> | |
<p>Code to create the task looks like:</p> | |
<pre> | |
1 write_task = new ['Task'] | |
2 setattribute write_task, 'code', write_to_variable | |
3 setattribute write_task, 'data', shared_variable | |
4 interp.'schedule_proxied'(write_task, shared_variable)</pre> | |
<p>Line 1 creates a new task object. The example subroutine is used for the | |
task's code attribute. <code>shared_variable</code> is used for data. At this point, | |
<code>shared_variable</code> is actually the proxy object created for the shared integer PMC. | |
The interpreter object contains a <code>schedule_proxied</code> method which is used | |
to schedule the <code>write_task</code> on the thread owning the original variable.</p> | |
<p><code>schedule_proxied</code> uses <code>Parrot_thread_create_local_task</code> which in this | |
case detects that the data given as parameter for the task's code is actu- | |
ally a proxy already and unwraps the proxied object. <code>Parrot_cx_schedule_immediate</code> | |
is then used to make the data owning interpreter execute the task | |
as soon as possible.</p> | |
<p>To protect a critical section, preemption can be disabled so the critical | |
section runs uninterrupted:</p> | |
<pre> | |
1 .sub swap_variables | |
2 .param pmc a, b | |
3 .local temp | |
4 disable_preemption | |
5 temp = a | |
6 a = b | |
7 b = temp | |
8 enable_preemption | |
9 .end</pre> | |
<p> | |
</p> | |
<h3><a name="wait">wait</a></h3> | |
<p>Using tasks to write to shared variables makes such actions inherently | |
asynchronous. This is not always what is needed by the implemented | |
algorithm. For example, when the shared variable is a lock, | |
processing should continue as soon as it's acquired. The wait | |
operation is used to wait for a task's completion. The waiting task | |
is added to the waited for task's waiters list and preempted | |
immediately. When a task finishes, all the tasks in the waiters list | |
are scheduled again for execution. Since for each task a local copy is | |
created on the target thread, the running task not only checks its own | |
waiters list but also its partner's.</p> | |
<p>If a task on the main thread was waiting for a task on another thread to | |
finish and no other tasks are in the scheduler's queue on the main thread, | |
the main thread exits if no alarms are pending. To prevent this unintended | |
exit, all tasks are added to the scheduler's foreign_tasks list when they are | |
scheduled on other threads. To end the program with other threads still | |
running, an explicit exit operation has to be used.</p> | |
<p> | |
</p> | |
<h2><a name="benchmarks">Benchmarks</a></h2> | |
<p>Preliminary benchmarks have shown Parrot's performance to be within | |
an order of magnitude of that of an optimized implementation in Perl 5.</p> | |
<p>Since Parrot does not yet offer the user any synchronization primitives, locks | |
had to be implemented using a shared variable which is written to only by | |
the main thread. Replacing this primitive method with a native semaphore | |
implementation would probably reduce runtime to a small fraction.</p> | |
<p> | |
</p> | |
<h3><a name="runtime_comparison_for_matrix_multiplication">Runtime comparison for matrix multiplication</a></h3> | |
<pre> | |
singlethreaded computation multithreaded computation | |
1. run 28.522 s 19.530 s 17.543 s 8.478 s | |
2. run 28.427 s 19.463 s 17.320 s 8.283 s | |
3. run 28.200 s 19.235 s 17.489 s 8.473 s | |
average 28.383 s 19.409 s 17.451 s 8.411 s</pre> | |
<pre> | |
This test implements matrix multiplication using four threads. For | |
simplicity the second matrix only has one column. The program is | |
written in the Winxed programming language. Winxed is a low-level | |
language with Javascript like syntax and the possibility to include | |
sections of PIR code verbatim making it possible to try experimental | |
opcodes while writing more readable and concise code than with PIR | |
alone. The complete source code is available in | |
F<examples/threads/matrix_part.winxed></pre> | |
<p>The program consists of the parts initialization, computation and ver- | |
ification. Computation is parallelized using four tasks each | |
calculating one fourth of the result vector. Runtime is compared to a | |
simple singlethreaded implementation. Run times were measured using | |
the time command and are recorded in the above table.</p> | |
<p>As can be seen, the multithreaded implementation gives an average | |
speedup of 2.31 for the computation and 1.61 in total.</p> | |
<p> | |
</p> | |
<h3><a name="runtime_comparison_for_mandelbrot_set_calculation">Runtime comparison for Mandelbrot set calculation</a></h3> | |
<pre> | |
singlethreaded 1 thread 2 threads 4 threads 8 threads | |
1. run 89.931 s 89.978 s 45.813 s 24.028 s 17.445 s | |
2. run 89.707 s 89.871 s 45.906 s 24.048 s 17.695 s | |
3. run 90.318 s 89.839 s 45.951 s 24.049 s 17.573 s | |
average 89.985 s 89.896 s 45.890 s 24.042 s 17.571 s | |
speedup 1.000 1.001 1.959 3.739 5.116</pre> | |
<p>The complete source code is available in <em class="file">examples/threads</em></p> | |
<p>Calculating an image of the Mandelbrot set is a common benchmark for | |
multithreading implementations since calculations of points are | |
independent of each other and are thus easily parallelizable. A simple | |
implementation of the escape time algorithm written in Winxed has been | |
used to determine scalability properties of the threading | |
implementation. The image is split into lines which are calculated | |
alternatedly by a configured number of tasks. Run times were measured | |
using the time command on an Intel Core i7 3770K processor with 16 GiB | |
RAM running openSUSE 12.1 and are recorded in the Mandelbrot table. As | |
can be seen, the implementation scales nearly linearly up to four | |
threads reflecting the CPUs four physical cores. Using eight threads, | |
the speedup is only 1.368 compared to four threads but this seems to | |
be more a limitation of the hardware than the implementation.</p> | |
</body> | |
</html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment