Skip to content

Instantly share code, notes, and snippets.

@voodootikigod
Created June 22, 2015 15:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save voodootikigod/e146469b95b7c25962cd to your computer and use it in GitHub Desktop.
Save voodootikigod/e146469b95b7c25962cd to your computer and use it in GitHub Desktop.
JSConf US 2015 Track A Transcript for Naveed Ihsanullah: Parallelism experiments in JavaScript

That was a very kind introduction. Thank you very much. So how is everyone doing at JSConf? I heard I have no competition at the beach. Everyone is here. Full house. Thank you for coming by. So yeah. Like I said, I want to talk to you about concurrency and parallelism in JavaScript. Maybe a brief introduction is warranted. I do lead the JS team, and for the last several years, we've been thinking about this, these problems in JavaScript, and how we can address that. I would like to share some of the ways we've come around to improving concurrency and parallelism in JS. But before we... The string is a little tight. Before we go down that road, let's actually talk about some definitions. Concurrency and parallelism often come up together but they're not the same thing. The way I think about it is... Concurrency is how do you make an application more usable? And here we have two queues of people and a single revolving door. You can come up with some logic to have them take turns going through there, but it doesn't matter what you do. You have one whole queue go through first or have them take turns like that -- some sort of multitasking to utilize the resource and make sure everything happens. And examples inside the running the program or application might be pre-emptive multitasking, Windows 95, or async programming that we're very familiar with. Parallelism is all about how you can be faster by leveraging or exploiting hardware availability. So in this case, we now have two revolving doors, and we can theoretically hopefully go twice as fast, because there's no contention and complex in my examples. But the real world is not like that. The real world has complexity, and the reason why concurrency and parallelism are often mixed together is because in the real world, it's really hard to separate the different aspects.

And you have to be very sensitive to the way you take care of your application to avoid deadlocks and races. So now that we talked about those definitions, I want to explain why improving this is actually important in JavaScript. And to do that, let's talk about the evolution of the web application. Back in the day, not quite the '80s, but way, way back, we all had a homepage that looked something like this. I'm sorry. Mine is under construction at the moment, but maybe it will evolve a little. Static html and GIFs can only take us so far. In 1995, though, something really big happened. I think most of us know what this is? Any guesses? I think I heard it. Yes. JavaScript was invented, and it was good. The universe of... I think I came to the right venue to make that comment. The universe of the modern interactive web was really born. And homepages were never the same again. So in the next 20 years, I think the home pages and websites have evolved past Windows alert. And now you have desktop applications that have actually gone non-native. Here we have an example of Excel, and most of the Microsoft Office suite is now available as a web application.

JavaScript and the web platform is actually capable of supporting even virtual reality technology. Here we have a frame or a couple of frames from the Polar Sea Project that MozVR is working on. So today's JavaScript is really vast. We track most of the modern browsers, their performance, and how they're doing over time. And here you can see that we're all getting a lot faster. And that's continued to be true for other factors as well, because we're inventing other technologies, like (inaudible), that allow us to bypass some of the inherent bottle necks in regular JavaScript. In this slide... The bottom line is native. C++ compiled code. And JS running on Firefox is only 2X of that. We have techniques in certain workloads that can get much closer to native. 1.5X. So if JavaScript is so fast, why do we have a problem?

Well, to illustrate that, I have a snapshot here of my Windows desktop. Right now you can see that the desktop doesn't seem to be doing a whole lot. We're running at 5% CPU utilization, and I have a ton of applications open, including Firefox with, I believe, at the moment, 30 tabs. So I open one more tab with a really intensive JavaScript application. And you'll see that the PerfMon seems to be indicating something. Those of you who are not familiar with Windows and PerfMon, each of these squares represents a CPU core and how much work it's doing. In this illustration, cores three and four -- they're virtual cores, but we'll not worry about that detail -- cores three and four are doing a lot of work, but the rest of the CPU is mostly just sitting there, and overall utilization is at 24%. Out of a possible 100. Right?

So the computer is still 75% idle. Why does this matter? Well, the gigahertz race is effectively at an end. It's hard to make CPUs faster than they are now. Power and heat have stopped our ability to innovate there. So for future hardware, the way we're scaling is by adding additional cores or execution units. Your modern smartphone typically has two cores. Some of them are even quad-core enabled. Laptops and desktops have four to eight virtual course and workstations can have 4 to 20 cores per CPU, and they may have multiple CPUs on their motherboards. So right here I've only talked about the parallelism aspect, how to do work at the same time. But concurrency has its issues too. We're always waiting on something in a JavaScript application. Waiting on events, on the user, on the cache, and of course always waiting on the internet to do something.

So improving parallelism and improving concurrency can really make a big difference to a host of applications. Some of the application types are of course games, AI, rendering, code generation, and each of these specific application types have their own domain-specific way of dealing with these constructs. But something that -- if we do solve this problem, we want to preserve it, if possible. While looking at this problem and investigating different ways to deal with it, web workers are around. Web workers do promote parallelism. You can do something on a thread. They tend to be very heavy threads. But it works. The real issue for saturating a core doesn't seem to be spinning up another thread. It's communication between that. And coordination is really about concurrency. But concurrency does exist on the web today. Exists in JavaScript. We've had at this point parallel JavaScript -- post message will always be around. It's very useful. And then the buffered transfer semantics. So let's dive into each one of these and see where they may have some limitations.

Parallel JS, for those who are not familiar with it, it was an experimental effort to see if we can bring some of the semantics of MapReduce to JavaScript, and enable a whole slew of application types. It was a casual API, but it really took some sophistication to use it, so the average programmer couldn't make that much out of it, and it had some really high implementation costs in the engine, so it was hard to support going forward. PostMessage obviously... Around and incredibly useful. But it also has its own set of issues. Dependent on the event loop. The performance, which we'll go into a little bit more later, just isn't there, and you cannot share state. You're transferring state over. Really one at a time.

And if you care about the response, you have to transfer a new state back. It's possible with postMessage to transfer a buffer entirely by reference, and that solves some subsets of problems, but you still can't work on the same data at the same time. This really is a bottleneck, a coordination bottleneck, and it's something that we want to address. So while looking at different possible ways of handling this, it sort of became apparent that obviously native has handled this already. The operating system libraries have a slew of constructs that allow us to solve concurrent problems. Solve them well. And solve them in ways that each individual application can tailor for its own needs.

So that helped us come up with some design considerations for what our ideal solution may look like. Of course, we want native-like performance. That's the Holy Grail benchmark. Let's be as fast as we can be. We don't want to be dependent on the main event loop, if possible. Because everyone uses the event loop, and that -- as Simon pointed out, can lead to some jankiness. Implementation versatility. What I mean by that is... Every problem has probably an ideal answer. And so you want to allow the developer to come up with an answer that makes the most sense for them, instead of being constrained by a really high level set of utilities that force them to talk in a particular vocabulary.

We want to support -- this one may seem a little arbitrary, but we'll explain more in the future -- we want to support the algorithms and applications that are based on threads and pthreads. For a long time, threads have been around, and there's a lot of good computer science research on them. How to use them really effectively. And all those algorithms have a lot of value. Not every single application, not every single problem can benefit from those algorithms. But it is possible to directly apply them with minimal changes. It does seem like a win for the user. So it's something that we would like to support. And finally, support the extensible web philosophy.

Anyone out there familiar with the extensible web philosophy? Or have read the extensible web manifesto? A couple of people? It's worth explaining a little bit here. I'm sure every single person will have their own take on it, and I certainly invite you to go to the extensible web manifesto website, to read about it in more detail. But at a high level, it's thought that the best innovation is going to come from developers, come from you guys. Sometimes standards bodies and browser developers can spend too much time iterating an idea, trying to come up with something perfect, and by the time it comes out to where the developers can use it, we didn't really hit the mark. So it seems like it's a better model for us to provide low level primitives that you can then generate and iterate on quickly in JS, instead of for us to provide rich high level implementations that are surfaced, and then you don't have a lot of flexibility to tailor that to the way you want to do work.

So taking all this together, we came up with shared memory for JavaScript. And it looks something like this. So this font might be a little bit small for some of the people in the back to read. I know you guys in the front can only read the first line. So I would like to thank Lars Hanson for actually putting this document together. Obviously it's incredibly detailed, and there's a link at the end, so you can take a look at it yourself. But I think this might be one of those times where an example might be more beneficial than going through all the pages. So what I would like to do is... Build an example. The incrementing worker. And what I want to do here is implement that same example in a couple of different ways. First with postMessage, and then with using the shared memory techniques that we just came up with, and you can get an idea of what that work flow would look like, and what the trade-offs might be.

So... First we're going to do this with postMessage, and what I did here was build a sequence diagram so you can really understand exactly what I'm trying to do. Create a worker. Brief that worker to tell us it's ready, and then we'll start working. And the goal of this incrementing worker -- really all it's going to do is talk in some kind of mechanism channel back to the master or the thread, passing an integer each time and incrementing it when it gets an integer from the master. They're going to reach some count, in this case I used 100,000, and when that message passing is done, the master will display some results. So the simplest scenario here is... The master says -- hey, start at zero. Go for 100,000 times. At the end, the worker will -- and the master will have a variable that says 100,000. Not rocket science, but hopefully enough to illustrate some points. Here's an implementation of the master side. It's code and it's small, and I think it's worth calling out some lines here.

So we agree on 100,000 iterations, and the master posts to the worker to start at zero. The worker and the master will talk inside this callback, and the worker's date is going to come back as event data. It's going to be unpackaged. When some condition has met the terminal condition, it will end. Otherwise the master will pass the integer right back to the worker. On the worker's side, it's even simpler. The worker -- all it does is unpack the master's integer, increments it, and sends it right back on its way. So how does that work? How fast does that work? Well, for postMessage in this implementation on this laptop, I get about 54,000 messages per second.

I think the example is pretty simple. There's probably a variety of these that you use all the time just for solving message passing techniques between the worker and the master. Let's do that again, using shared memory, specifically using shared Int32 array. So there's going to be some new constructs here, coming out of the shared memory document that I would like to call out. This is actually too small for even me to read. Make that bigger. Going to make sure I'm on the same page with all of you. So the first line I'm highlighting here is shared array buffer. This is one of the core pieces of the concurrency technique that we've come up with. Which allows us to share memory.

Basically you specify a side. And here is a syncretic object that's actually layered on top of what actually exists in the spec. But what's in the spec is a little too low level to cleanly show it in a presentation. The purpose of this is basically -- it's a waitable object. It's going to allow for the master and the worker to coordinate. Using postMessage, the only use of postMessage in this example, we're passing the shared buffer to the worker, but because it's a shared buffer, the master will be able to use it as well. So this is a shared Int32 array. That's a view on the shared buffer. We're basically saying -- at this address, inside the shared memory region, it's an integer. Treat it like an integer. Give me integer semantics on it. And here, inside the callback, using the syncretic, the master is saying, after incrementing the buffer, the master is saying -- all right, I'm done. So the worker will wake up. Now, in this example, unlike the example from before, the master is also incrementing the counts. The only real reason for doing that is -- this loop, the master will do nothing other than saying I'm done, without doing any real work. So I wanted him to do something to justify the existence of that loop, other than the wall for the worker to continue. Otherwise the worker would count to 1,000 instantly and be done without the master and the worker going back and forth. So here on the worker side we have once again that waitable object, the syncretic, and the shared Int32 view on the shared array buffer.

And on the loop side, we wait for the master to be done, increment the buffer, and then say we're done. So it looks nothing like the postMessage version. It's incredibly specific to integer-only. PostMessage you can pass in anything. But it does function, and it functions incredibly fast. Where before we had 54,000 messages per second, this time we have 6 million. Over 6 million messages per second. I think we can all agree that I cheated a little bit. Right? So unlike the postMessage model, where I can pass anything in and do anything I want with it, where it's synchronized for me, and where I have constructs where I don't have to worry about locking, all that sort of nitty-gritty low level stuff was exposed in the shared memory version, so it's not really a drop-in replacement for postMessage. I think I can do better, and I'm going to try to do better in this second version. Once again, because performance isn't everything. Sometimes ergonomics of the implementation matter.

So here what I do is -- and I'm hiding a lot of the details, but what I do differently here is I'm creating a new object that wraps a sender and receiver, a channel sender and a channel receiver, and all this is built on top of the same primitives that you saw before. And if we look inside our loop now, it's a lot clearer. And it maps much, much more directly, one to one, with postMessage. We can basically treat the send like the post to the worker, and the receive like the post from the worker. There's no locks exposed here. There's no waitable objects that you can screw up, by signaling at the wrong time. And the worker side is just as clean, just as simple. And I think very, very easy to read. Arguably even easier to read than the native postMessage implementation. With receive, increment, and send. So how does this perform? Still a lot better than postMessage. And it's a huge trade-off for using these high level constructs, versus shared Int32 array. But the choice is yours. Right? The choice for having high level constructs or low level performance is something that this implementation allows you to have.

So going back to our design criteria, let's see how we did. Did we get native--like performance? Well, I didn't directly compare it against native C++, but we're 5X faster than postMessage. I think we did pretty well there. We'll come back to that later. How about not dependent on the main thread event loop? Aside from the shared array buffer, we're not dependent on the event loop at all. Versatility, I give myself a check. We have two implementations, one really fast and really specific and one general purpose that's still reasonably fast. Do we support the extensible web philosophy? The fact that you can have these multiple implementations and our iterations were done in JavaScript instead of rebuilding the browser each time, I think I can give myself a check for that as well. And the final one is -- do you support applications and algorithms based on threads and pthreads? We didn't really talk about that at all here, so I'm going to table that again for a little bit later in the discussion.

So we agree at least on three of five? Maybe three and a half of five? Anyone? Okay. Hard graders out there in the audience. Okay. So... I have a couple of demos, and they're live demos. So we'll see how that goes. This is against a version of Nightly with one private patch applied. So one of these demos will work as is, in a Nightly version of Firefox. The other one is not quite ready yet. The first one that I would like to show you is... An implementation of the Mandelbrot algorithm, visualized in JavaScript. Let's go out to my Nightly. I don't know why that's here, but we'll ignore it. So here we have Mandelbrot in JS. And you're going to do the usual thing. Zoom around. By the way, this is the first demo that I ever showed my wife, who's in the audience. Something that I actually made possible on the web. So she was so excited to see this, thinking now she's going to understand what I do for work. After I was done showing her this, she has absolutely no idea what I do for work. So hopefully this will go over a little bit better for everybody else. Right now the default on this is to use one to four threads. I'm going to reduce it to one thread, and just show that our CPU utilization is... That number is... Not exactly right. What it's showing is out of 800... 8 cores. So 100% would be full utilization of one core. So to show that a little bit more clearly, let's actually go to this, which I have running in the background. I don't know how familiar you are with the htop application. It's similar to the Perfmon I was using on Windows. You can see in the top left corner there are eight effective CPU cores on this machine, and right now it looks like four of them are kind of doing something, and overall CPU utilization is around 100% out of a possible 800%. So back to our demo.

Let's change this to use four threads and increase the number of times that this thing updates from 50 to 500. Now, my frame rates are dropping, because of the update interval, but increasing, because I added more threads. What I really want to show here is... That it's possible for us to really make this computer come to a halt. So it's not very often that you can run a JavaScript application that pegs the CPU on a modern computer. And if you were up here, and I don't know if we can pick this up... The fans on this laptop are now at 100%. So yay! I came up with the most inefficient heater in the whole world. Hopefully this thing will... Shut down politely. I really don't know what that is. I'm just going to ignore it. Ask you to do the same. Go back to our presentation. On a different machine, with 16 effective cores, I actually captured the performance of that demo, so I could illustrate it here. And in that capture, I actually enabled something called SIMD, which had a blue bar. So just ignore that right now. If you guys are really interested in that, you can find me afterwards and we can talk more about SIMD. But the grey bars are the performance normalized against one core performance on that Mandelbrot example. You can see we have perfect CPU scaling across all 16 cores. Adding a core added about 70% more processing power, and that's processing that we could use for really awesome game demos like we saw in the presentations prior to this one.

So the next demo I would like to show... Something I'm really, really excited by. And in fact, something we only just got completed. So hopefully this works. Like I said, I'm running a private build of Firefox. So I'll take a sip of water for good luck. This is the Unity WebGL benchmark. People out there familiar with Unity technology as a company? Pretty good percentage of the audience. For those who are not, Unity technologies makes a 3D game engine called Unity. It's one of the most popular licensed 3D game engines in the world, if not the most popular one, and it's a company that I think does really, really exciting things. So for us to be able to work with them on enabling their benchmark, using our shared memory, was really exciting. Like I said, it's something we only got done Friday of last week. So we'll see how that goes.

Back to my nightly. And let's start this up. These are default four cores. So... As you can see right away, we have a text rendering bug on the first screen, which we'll ignore. I'm also going to uncheck the first two demos, because they're graphically maybe not as interesting as some of the other ones. Let this go. So this benchmark is much, much more than a WebGL benchmark. What it is, actually, is testing the entire Unity 3D engine, and it's all running in the browser. Basically automatically ported from C++, very minimal changes on the Unity side. All the changes were in our shared memory and in Firefox itself. Here we have a bunch of dancing bears. And what's more exciting than skinned dancing bears?

So like I said, it's much more than WebGL. They have AI running. Physics. Particles. Skinning. And also their job system, which allows for execution of threads. Threaded content. I think it was much prettier than any demo I could write. I'll let it go one more... Oh, yes, have to wait for the snowman in the middle of the flurries. Like I said, the reason why I'm really excited by this is... The Unity 3D engine is a massive code base. Enabling that -- basically automatically ported over directly from C++ on top of this is an incredible validation of the technology we put together. And that it's working at all is amazing. But I think... I'll show here... Let me stop this. Oops. Too many.

The performance is really quite good too. I'm going to focus on just the top half of the slide. Those are all the tests that I was running on the left side. And what we had is the blue bar is native performance. We seem to have an anomaly on the first test. The blue bar is shorter than anything running in JavaScript. So I'm sure it's something I did. And we'll look at that later. But for all the other tests, green is normalized against one core performance. Yellow is four cores. And then eight -- red is eight. Reddish-orange. So you can see as we added four and then eight cores, we came much closer to native performance on a bunch of these tests. We're still behind on some of the other ones. Obviously there's always going to be work to do. And not everything is going to scale as well in JS on cores as other tests. So I've covered this a little bit, but yeah, this is actually design criteria. How we did. We talked about native-like performance. Now we can say -- yes, check. We do have native-like performance. Especially on certain work loads. We're almost where we are in native, on the Unity benchmark. Support algorithms and applications on pthreads. It's a direct port of the pthread implementation of Unity. And we were able to make it with almost no changes to the code base. I'd like to give myself a pat on the back. Think I deserve it? Thank you. So I talked about some of the reasons why the Unity test is significant. Not everyone is a game dev, not everyone cares that much about games. I personally love them.

I think they're not just fun to play. They're fun to develop, and they really challenge software developers and hardware developers. So I think it's a fantastic test of any infrastructure. And we were able to make all this work -- the benchmark is not just functional. But we were able to make it work fast. So Yuca Yolanki did most of the implementation work on making the Unity benchmark functional. He worked on the (inaudible) side, and he worked on the JS side. He's a Mozillian. I asked the team for why is this particular thing significant to them, and he offered up this quote, which I would like to share with everyone. With shared memory, the web lifts an important implementation with shared execution that it had compared to native. Shared memory is not comparable to a library call that can be emulated or polyfilled. I'm sorry for all the people who are asking this question -- can we polyfill this? We cannot. But it's a fundamental concept of parallel execution architectures. I'm not sure if it can get any bigger than this. He's a very emphatic guy and he's very excited, but I think he really did capture the significance of what we accomplished here. So what's next for shared memory and for the work that we've done here?

Well, shared memory is in Nightly right now. As on JS, for those that are... Can I get another show of hands -- are people familiar with (inaudible) JS? Most of the room. So it's a low level subset of JS that we put together that you can really optimize for, and it's fantastic for certain workloads, especially cross compilation ones. So it's not going to move out of Nightly, until we standardize it. But we are in talks to standardize it. The API, the set of documents you saw that Lars put together, still subject to change. We're getting feedback from a lot of sources inside Mozilla and out. But the good news is Google has actually announced fairly recently that they're going to start implementing this too. So hopefully a year from now we're going to see this in applications built against it everywhere. I have a bunch of links that nobody can read, but I think there'll be some opportunities to share this presentation later. So hopefully you can catch up on any details here that you may have. And you're always welcome to find me. Thank you very much for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment