public
Last active

pthreads.md

  • Download Gist
pthreads.md
Markdown

Multi-Threading in PHP with pthreads

A Brief Introduction to Multi-Threading in PHP

  • Foreword
  • Execution
  • Sharing
  • Synchronization
  • Pitfalls
  • WTF ??

Preface

If you are a PHP programmer who spends a lot of time at the console, or someone who is interested in high performance modern programming of PHP, this document is for you.

The intention here is to provide information concise and short enough that you (and the community at large) remember it; in the hope that one day all of this will be common knowledge among PHP programmers.

By the end of the document, you should have a clear understanding of how, and why, pthreads exists, and executes.

If you have any comments, suggestions or insults please forward them to krakjoe@php.net

Insults will be ignored.

Foreword

Since PHP4, May 22nd 2000, PHP has been equipped to execute isolated instances of the interpreter in multiple threads within a single process without any context interfering with another. We call this TSRM, it is a rarely studied omnipresent part of PHP that nobody really talks about.

If you have ever used XAMPP or PHP on Windows, it’s likely that you used a threaded PHP without even knowing it.

TSRM has the ability to create isolated instances of the interpreter, which is how pthreads executes userland threads in PHP. The instances of the interpreter are as isolated as they are when executing any threaded build of PHP, the Apache2 Worker MPM PHP5 Module for example. The job of pthreads is to facilitate communication and synchronization between the otherwise isolated contexts.

Exactly how TSRM works is beyond the scope of this document, and would only confuse the reader (and subject), suffice to say that PHP has been able to work in a multi-threaded environment for more than a decade. The implementation is stable; there is however one well known, but completely misunderstood pitfall, which I shall explain the facts of, and clarify: PHP is a wrapper around third parties, every part of PHP is implemented like this, if a third party does not implement their library in a re-entrant (thread safe) way then the PHP wrapper for that library will fail and or cause unexpected behaviour during execution. A well known example of such a library is locale. It should be clear that this is beyond the control of PHP or pthreads. Such libraries are well known (documented) and or obvious, the vast majority of extensions will have no problem executing in a pthreads application.

Threading in user land was never a concern for the PHP team, and it remains as such today. You should understand that in the world where PHP does its business, there's already a defined method of scaling - add hardware. Over the many years PHP has existed, hardware has got cheaper and cheaper and so this became less and less of a concern for the PHP team. While it was getting cheaper, it also got much more powerful; today, our mobile phones and tablets have dual and quad core architectures and plenty of RAM to go with it, our desktops and servers commonly have 8 or 16 cores, 16 and 32 gigabytes of RAM, though we may not always be able to have two within budget and having two desktops is rarely useful for most of us.

In addition to the concerns of the PHP team, there are concerns of the programmer: PHP was written for the non-programmer, it is many hobbyists native tongue. The reason PHP is so easily adopted is because it is an easy language to learn and write. Multi-threaded programming is not easy for most, even with the most coherent and reliable API, there are different things to think about, and many misconceptions. The PHP group do not wish for user land multi-threading to be a core feature, it has never been given serious attention - and rightly so. PHP should not be complex, for everyone.

All things considered, there are still benefits to be had from allowing PHP to utilize its production ready and tested features to allow a means of making the most out of what we have, when adding more isn't always an option, and for a lot of tasks is never really needed if you can take advantage of all you have.

A note about nothing, or more precisely, sharing nothing: The architecture of PHP is referred to as Shared Nothing, this simply means that whenever PHP services a request, via any SAPI, its environment, in the sense of the data structures PHP requires to operate, are isolated from one another. On the surface, pthreads would appear to violate this standard and break the architecture that keeps PHP executing. Relax, this is not so. In fact, another job of pthreads (that is never evident to the programmer) is to maintain that architecture; it does this utilizing copy-on-read and copy-on-write semantics and carefully programmed mutex manipulation. The upshot of this is, any time a user does anything, in the sense of reading or writing to an object, or executing its methods, it is safe to assume that the operation was safe and there is no need for further action like the explicit use of mutex by the programmer.

Terms in the foreword that are new to the reader should now be researched, as they may appear throughout this document

Execution

Threading is about dividing your instructions into units of execution, and distributing those units among your processors &| cores in such a way as to maximize the throughput of your application.

This should always be done using as few threads as possible.

pthreads exposes two models of execution. The Thread model and the Worker model, they expose much of the same functionality to the programmer, and are internally very similar, with one key difference: what they consider to be the unit of execution.

A Thread is representative of both an interpreter context and a unit of execution (that’s its ::run method).

A Worker is representative of an interpreter context; its ::run method is used to configure that context. The unit of execution for this model is the Stackables, more precisely Stackable::run.

When the programmer calls Thread::start, a new thread is created, a PHP interpreter context is initialized and then (safely) manipulated to mirror the context that made the call to ::start. Execution continues concurrently in both contexts at this point. Execution in the Thread is passed to the ::run method of the Thread. At the end of the ::run method the context for the Thread is destroyed.

When the programmer calls Worker::start, a new thread is created, a PHP interpreter context is again initialized in the same way as a normal Thread, when execution in the Worker leaves Worker::run, the Worker begins to pop Stackables from the stack and execute them in the order they were stacked. If there are no items on the stack the Worker will wait for some to appear. The Worker will continue to do this until Worker::shutdown is called. If Worker::shutdown is called while items remain on the stack they will be executed first and the context that called Worker::shutdown will block until shutdown can occur.

Great care should be taken to avoid wasting contexts unnecessarily, starting a Thread or Worker is not free. Where you can, use the Worker model, this almost eliminates the tendency to be wasteful while multi-threading. Almost, but not completely ...

There is a tendency to be wasteful; it’s a common misunderstanding to think that threading anything can make it faster, it cannot. More threads does not always equate to more throughput, in the same way as more water does not always equate to wetter.

wetness

Thinking outside the box is a prerequisite of a good multi-threaded programmer; common sense should dictate that more water does mean wetter, but if you consider the central point of the bottom of the bowl: Once it is wet, it does not matter how much water you place on top, it cannot get wetter ...

Too much water, or threads, and you will drown.

The author of pthreads will not take responsibility for drowning programmers, or their code.

Sharing

Threading would be rather useless if threads could not manipulate a common set of data, which appears to be a problem in a shared nothing architecture. I don’t see shared nothing as a hindrance, I see it as a rather big helpful push in the right direction.

One of the normal problems for a programmer writing multi-threaded code is the safety and synchronization of data, it is normally very very easy to corrupt an array if 10 threads manipulate it at once.

Shared Nothing solves this problem; if no two contexts ever manipulate the same data then they cannot corrupt each others stack, the architecture is maintained along with its stability.

Objects descending from pthreads utilize a thread safe member storage table that works slightly differently to any other objects. When you write a member to such an object, the table is locked, the data is copied, and then stored in the table and the lock is released. When a subsequent read of that member occurs, the table is locked, the data is copied for return and the lock is released. This means that no two contexts ever manipulate the same physical data - Share Nothing.

Some data does not lend itself to being easily copied, PHP has a solution to this in the form of the serialization API. Serialization is utilized on arrays, and objects not descended from pthreads. Objects descended from pthreads are never serialized, as such you should always use pthreads objects as containers for data you intend to manipulate in multiple contexts.

All objects descending from pthreads can be manipulated, by any context with a reference, as arrays and objects, they also include methods for manipulating members in a thread safe manner. There shouldn’t be a kind of data set you cannot implement with what is exposed by pthreads, and basic sets (arrays) are built in.

This is all done in such a way that minimizes memory usage while still maintaining architecture and safety. It may seem wasteful, but it’s a small price to pay, that diminishes with the price of memory.

Synchronization

Sharing isn’t enough, the last piece of the puzzle is synchronization. This is going to be a topic completely alien to a lot of programmers.

While your are executing, and sharing, you must also be able to control when to share, and when to execute; it is no good trying to manipulate data that does not exist !!

Synchronization can be used to put a thread into a receptive, but sleepy state, known as waiting, and can be used to awaken such a thread, known as notifying.

Synchronizing with a unit of execution is easy, but does come with a danger of misuse, which I hope to give a brief, simple explanation of that will stick in your mind and help you to avoid misuse.

Make this your mantra: Only ever wait FOR something

$this->synchronized(function(){
    $this->wait();
});

The above code looks simple enough, but what or who is it waiting for, and what happens if whatever they are waiting for has already sent notification ... waiting forever is the price for not paying attention to your own mantra.

The syntax of synchronization may look a bit strange, here’s an explanation that gives you a good reason to keep typing all that stuff: when you call ::synchronized a mutex (lock) is acquired, when you call ::wait, that mutex is atomically locked and unlocked to allow other contexts to acquire it while the waiting context blocks on a condition waiting for notification.

Waiting for something looks like this:

$this->synchronized(function(){
    if (!$this->data) {
        $this->wait();
    }
});
/* I can manipulate $this->data and know it exists */

While notification looks like this:

$that->synchronized(function($that){
    $that->data = “some”;
    $that->notify();
}, $that);

In the notification example, you ensure that the context that is waiting is not left hanging around forever because if you have acquired the synchronization lock and the object is not waiting then it need not wait (by the time it can acquire the synchronization lock the data is already set). A call to notify will ensure if you managed to acquire the synchronization lock because it was atomically released by the waiting thread, the waiting thread is awoken and will continue executing.

This kind of explicit synchronization can make for powerful programming, study it well.

Pitfalls

The garbage collection built into PHP was never prepared for this kind of prolonged execution, if pthreads followed the PHP way and edited reference counts of objects when we accepted them (as an argument to a method, or as the data for a member property), then memory usage soars, it becomes difficult to retain control of your own code.

So we do not do the done thing; in a pthreads application, you are responsible for the objects you create, you are also responsible for retaining a reference to objects that are going to be executed, or accessed from other executing contexts, until that execution or access has taken place.

This circumvents the problem of out of control memory usage, but it creates another problem; dreaded segfaults.

Segmentation faults occur when you instruct a processor to address memory that it cannot access, they result in abortion of execution. The prime suspect when you encounter segmentation faults during development is objects being referenced that were already destroyed in the context that originally created the object.

Avoiding these segmentation faults sounds much more complex than it in reality is, this can be illustrated best with a (bad) example:

class W extends Worker {
    public function run(){}
}
class S extends Stackable {
    public function run(){}
}
/* 1 */
$w = new W();
/* 2 */
$j = array(
    new S(), new S(), new S()
);
/* 3 */
foreach ($j as $job)
    $w->stack($job);
/* 4 */
$j = array();
$w->start();
$w->shutdown();

The above example will always segfault; steps 1-3 are perfectly normal, but before the Worker is started the stacked objects are deleted, resulting in a segfault when the Worker is allowed to start. Your code will not always look so explicit, but if you can see a route where this could conceivably happen, then program a different way.

Other symptoms of this kind of programming error are the fatal error

Call to a member function member() on a non-object in /my/code.php

and the notice

Trying to get property of non-object in /my/code.php

If you experience these errors, carefully look over your code and make sure everything you have passed to any other context exists all the time it is being referenced or executed in any other context.

This is probably the hardest part of creating applications with pthreads, but it doesn't take a lot to avoid; plan with care, and program with even more care.

WTF ??

I hear the criticism that I have taken something simple, that's PHP, and made it more complex by exposing this kind of functionality. I hear you; I would argue that I have taken something complex, and made it relatively simple.

Something being complex, or difficult, is no kind of justification for avoiding it. The complexity of anything should decrease as your knowledge increases, if it does not, then you are not taking in the right kind of information. This is the nature of learning.

To the idea that I haven't made anything simple; oh rly? If the task is simple: get two things done at once, the implementation is simple. The fact that you are even considering complex ideas is the thing you should be paying attention to!!

To the rest of the nay-sayers: Progress is made by pushing forwards, when we all push at once, we make more progress !!

Even if you hate the idea, I hope I've said enough to convince you to give it a try before you form a long lasting opinion that will affect your decisions in the future, what is the worst that can happen !?

Great writup, thanks a lot! Makes me want to try threads in PHP.
One question on the garbage collection: did i understand correctly that it still happens, but only looks at what happens in the thread that created the variable? So no need for explicit 'free' calls or anything?

Sorry for the delay, bit busy ...

That's right yeah, garbage collection still occurs for each thread, and it's allowed to collect the original references you used to set members (because pthreads doesn't change refcount) and the data in variables as a result of reading the object context, but the data in the object is retained until the thread or context that created the object is destroyed, or the object is explicitly or implicitly overwritten, either because it goes out of scope or it's container is reused ...

There's no need for explicit free, but if you look at some of the examples, some thought must be given to memory usage if you intend to execute for a long time (the normal php way will work for short lived executions as it does now), this is not so much because of pthreads as it is because of php: if you keep $staking[] = "onto"; arrays you will run out of memory eventually in any script ...

Joe, thanks for this writeup, and thanks for the quick response to the issue I was having today.

I do want to say one thing in response to this:

The PHP group do not wish for user land multi-threading to be a core feature, it has never been given serious attention - and rightly so. PHP should not be complex, for everyone.

Threading certainly gives you enough rope to hang yourself with, but no more than fork(), and that's built in via PCNTL. The stuff you need to have to simulate threading with fork() -- SysV IPC -- is all built in as well. At least with threading there's only one thing to kill to clean up, and no reason to expose new developers (who often are not sysadmins) to ipcs and ipcrm and the potential damage they can cause to other running processes with the latter.

Anyone suggesting threading in php is a bad idea, period, probably just hates it in general thanks to the bad rep it earned in v4 and earlier.

@alandsidel morning ...

Threading is hard in any language, and I stand by what I said; PHP should not be complex, for everyone.

What you say is (nearly) completely correct, depending on perspective multi-threading is simpler than multi-processing, it is faster, it is more efficient .... however, we don't see from our perspective what you see from yours; we see the implementation - that's not simple - there's only a handful of people on the php core team that are well equipped enough to debug or develop such a thing, and none of them did, in 15 odd years, after several requests from the community to introduce threading.
My estimation of a handful may well be incorrect; in reality, I write and develop it on my own and when I bring it up in IRC, arguments ensue, so I haven't, in many many months.

When I first wrote it, I was up for pushing through an RFC to include it, 18 months later, hundreds and hundreds of bug reports on git, email, irc etc; the fact is, most people don't get it, they don't read the documentation, or this article (which I link everyone to, all the time), they install it like they would a new database or redis driver and carry on writing however they normally write. There will never be an implementation of threading that is as relaxed as normal PHP in it's operation; you do have to care about what you're doing, you cannot do everything 40 ways, there aren't ways to fix every leak or segfault - that's nothing to do with threading, PHP's master repository has known leaks and segfaults in it so I have not much chance of complete perfection.

I'm happy enough that pthreads got to be something we can call stable, and we can call it stable, despite the fact you can make it segfault or leak, PHP has a long tradition of allowing you to shoot yourself in the foot, and pthreads holds up that tradition.
It's worth mentioning, I've been writing concurrent java applications for a long long time, the JVM can leak, despite all the articles telling you it cannot, and that memory is completely managed ... it can leak and you can watch it, there are java profilers galore for that reason, it will throw fatal exceptions for no good reason ... you can make it misbehave in the same and even even stupider ways as you can make pthreads misbehave - if you don't do things properly.

Right now, there's just nothing to gain from trying to get pthreads into the core, most users do not have a TS install and so cannot "just use it", nor could we make TS the default build configuration since it has a performance penalty (not much) that some users would be unhappy about, although it's got to be said that most won't actually notice the difference, and as soon as you start creating threads the penalty diminishes.
If, at some point in the future, there's good reason to push for threading to get into the core, then I'll start pushing; if some API is exposed or created that I require for better performance, or if the distinction between ZTS and NTS goes away and everyone can use it, or something of that nature, where I can see there is actual benefit ...

Thanks for your kind words, sorry bout the delay, don't check gist much :)

Joe, thanks a lot for this amazing article on threading and detailed comment replies. In my opinion, more people from among experienced developers should understand pthreads clearly as you mentioned and contribute back. Hope they do so soon. I also have a suggestion, this article could to be linked to http://www.pthreads.org site or github page(https://github.com/krakjoe/pthreads) as it's very useful in understanding how pthreads works internally and clarify a number of doubts.

I have added the article to gh-pages and the README.md ... thanks for the suggestion ...

Regarding rebuilding PHP to use ZTS; is there a preference on the two options I see?

1). enable-roxen-zts
2). enable-maintainer-zts

EDIT: Answer in the php docs: "pthreads requires a build of PHP with ZTS (Zend Thread Safety) enabled ( --enable-maintainer-zts or --enable-zts on Windows )" Thanks for pthreads!

Roxen is a SAPI module ... the required option is --enable-maintainer-zts

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.