voodootikigod/ihsanullah.md

## ihsanullah.md

      
    Raw
  

              ihsanullah.md
            
          
    That was a very kind introduction. Thank you very much. So how is
everyone doing at JSConf? I heard I have no competition at the beach.
Everyone is here. Full house. Thank you for coming by. So yeah. Like I said, I
want to talk to you about concurrency and parallelism in JavaScript.
Maybe a brief introduction is warranted. I do lead the JS team, and for the
last several years, we've been thinking about this, these problems in
JavaScript, and how we can address that. I would like to share some of the
ways we've come around to improving concurrency and parallelism in JS. But
before we... The string is a little tight. Before we go down that road, let's
actually talk about some definitions. Concurrency and parallelism often come
up together but they're not the same thing. The way I think about it is...
Concurrency is how do you make an application more usable? And here we have
two queues of people and a single revolving door. You can come up with some
logic to have them take turns going through there, but it doesn't matter what
you do. You have one whole queue go through first or have them take turns
like that -- some sort of multitasking to utilize the resource and make sure
everything happens. And examples inside the running the program or
application might be pre-emptive multitasking, Windows 95, or async
programming that we're very familiar with. Parallelism is all about
how you can be faster by leveraging or exploiting hardware availability. So
in this case, we now have two revolving doors, and we can theoretically
hopefully go twice as fast, because there's no contention and complex in
my examples. But the real world is not like that. The real world has
complexity, and the reason why concurrency and parallelism are often mixed
together is because in the real world, it's really hard to separate the
different aspects.
And you have to be very sensitive to the way you take care of your
application to avoid deadlocks and races. So now that we talked about those
definitions, I want to explain why improving this is actually important in
JavaScript. And to do that, let's talk about the evolution of the web
application. Back in the day, not quite the '80s, but way, way back, we all
had a homepage that looked something like this. I'm sorry. Mine is under
construction at the moment, but maybe it will evolve a little. Static
html and GIFs can only take us so far. In 1995, though, something really
big happened. I think most of us know what this is? Any guesses? I think I
heard it. Yes. JavaScript was invented, and it was good. The universe of... I
think I came to the right venue to make that comment. The universe of
the modern interactive web was really born. And homepages were never the
same again. So in the next 20 years, I think the home pages and websites
have evolved past Windows alert. And now you have desktop applications that
have actually gone non-native. Here we have an example of Excel, and most
of the Microsoft Office suite is now available as a web application.
JavaScript and the web platform is actually capable of supporting even
virtual reality technology. Here we have a frame or a couple of frames from
the Polar Sea Project that MozVR is working on. So today's JavaScript is
really vast. We track most of the modern browsers, their performance, and how
they're doing over time. And here you can see that we're all getting a lot
faster. And that's continued to be true for other factors as well, because
we're inventing other technologies, like (inaudible), that allow us
to bypass some of the inherent bottle necks in regular JavaScript. In this
slide... The bottom line is native. C++ compiled code. And JS running on
Firefox is only 2X of that. We have techniques in certain workloads that can
get much closer to native. 1.5X. So if JavaScript is so fast, why do we have
a problem?
Well, to illustrate that, I have a snapshot here of my Windows desktop.
Right now you can see that the desktop doesn't seem to be doing a whole
lot. We're running at 5% CPU utilization, and I have a ton of applications
open, including Firefox with, I believe, at the moment, 30 tabs. So I open
one more tab with a really intensive JavaScript application. And you'll see
that the PerfMon seems to be indicating something. Those of you who are not
familiar with Windows and PerfMon, each of these squares represents a CPU
core and how much work it's doing. In this illustration, cores three
and four -- they're virtual cores, but we'll not worry about that detail --
cores three and four are doing a lot of work, but the rest of the CPU is
mostly just sitting there, and overall utilization is at 24%. Out of a
possible 100. Right?
So the computer is still 75% idle. Why does this matter? Well, the gigahertz
race is effectively at an end. It's hard to make CPUs faster than they are
now. Power and heat have stopped our ability to innovate there. So for
future hardware, the way we're scaling is by adding additional cores or
execution units. Your modern smartphone typically has two cores. Some of them
are even quad-core enabled. Laptops and desktops have four to eight virtual
course and workstations can have 4 to 20 cores per CPU, and they may have
multiple CPUs on their motherboards. So right here I've only talked about the
parallelism aspect, how to do work at the same time. But concurrency has
its issues too. We're always waiting on something in a JavaScript
application. Waiting on events, on the user, on the cache, and of course
always waiting on the internet to do something.
So improving parallelism and improving concurrency can really make a big
difference to a host of applications. Some of the application types are of
course games, AI, rendering, code generation, and each of these specific
application types have their own domain-specific way of dealing with these
constructs. But something that -- if we do solve this problem, we want to
preserve it, if possible. While looking at this problem and investigating
different ways to deal with it, web workers are around. Web workers do
promote parallelism. You can do something on a thread. They tend to be
very heavy threads. But it works. The real issue for saturating a core
doesn't seem to be spinning up another thread. It's communication between
that. And coordination is really about concurrency. But
concurrency does exist on the web today. Exists in JavaScript. We've had at
this point parallel JavaScript -- post message will always be around. It's
very useful. And then the buffered transfer semantics. So let's dive
into each one of these and see where they may have some limitations.
Parallel JS, for those who are not familiar with it, it was an experimental
effort to see if we can bring some of the semantics of MapReduce to
JavaScript, and enable a whole slew of application types. It was a casual API,
but it really took some sophistication to use it, so the average programmer
couldn't make that much out of it, and it had some really high implementation
costs in the engine, so it was hard to support going forward. PostMessage
obviously... Around and incredibly useful. But it also has its own set of
issues. Dependent on the event loop. The performance, which we'll go
into a little bit more later, just isn't there, and you cannot share state.
You're transferring state over. Really one at a time.
And if you care about the response, you have to transfer a new state back.
It's possible with postMessage to transfer a buffer entirely by
reference, and that solves some subsets of problems, but you still
can't work on the same data at the same time. This really is a bottleneck, a
coordination bottleneck, and it's something that we want to address. So
while looking at different possible ways of handling this, it sort of
became apparent that obviously native has handled this already. The operating
system libraries have a slew of constructs that allow us to solve
concurrent problems. Solve them well. And solve them in ways that each
individual application can tailor for its own needs.
So that helped us come up with some design considerations for what our
ideal solution may look like. Of course, we want native-like performance.
That's the Holy Grail benchmark. Let's be as fast as we can be. We don't want
to be dependent on the main event loop, if possible. Because everyone
uses the event loop, and that -- as Simon pointed out, can lead to some
jankiness. Implementation versatility. What I mean by that is... Every problem
has probably an ideal answer. And so you want to allow the developer to come
up with an answer that makes the most sense for them, instead of being
constrained by a really high level set of utilities that force them
to talk in a particular vocabulary.
We want to support -- this one may seem a little arbitrary, but we'll
explain more in the future -- we want to support the algorithms and
applications that are based on threads and pthreads. For a long time,
threads have been around, and there's a lot of good computer science
research on them. How to use them really effectively. And all those
algorithms have a lot of value. Not every single application, not every
single problem can benefit from those algorithms. But it is possible to
directly apply them with minimal changes. It does seem like a win for the
user. So it's something that we would like to support. And finally,
support the extensible web philosophy.
Anyone out there familiar with the extensible web philosophy? Or have read
the extensible web manifesto? A couple of people? It's worth explaining a
little bit here. I'm sure every single person will have their own take on it,
and I certainly invite you to go to the extensible web manifesto website, to
read about it in more detail. But at a high level, it's thought that the best
innovation is going to come from developers, come from you guys. Sometimes
standards bodies and browser developers can spend too much time iterating an
idea, trying to come up with something perfect, and by the time it comes
out to where the developers can use it, we didn't really hit the mark. So it
seems like it's a better model for us to provide low level primitives that
you can then generate and iterate on quickly in JS, instead of for us to
provide rich high level implementations that are surfaced, and then you
don't have a lot of flexibility to tailor that to the way you want to do
work.
So taking all this together, we came up with shared memory for
JavaScript. And it looks something like this. So this font might be a little
bit small for some of the people in the back to read. I know you guys in
the front can only read the first line. So I would like to thank Lars Hanson
for actually putting this document together. Obviously it's incredibly
detailed, and there's a link at the end, so you can take a look at it
yourself. But I think this might be one of those times where an example
might be more beneficial than going through all the pages. So what I would
like to do is... Build an example. The incrementing worker. And
what I want to do here is implement that same example in a couple of
different ways. First with postMessage, and then with using the shared
memory techniques that we just came up with, and you can get an idea of what
that work flow would look like, and what the trade-offs might be.
So... First we're going to do this with postMessage, and what I did
here was build a sequence diagram so you can really understand exactly
what I'm trying to do. Create a worker. Brief that worker to tell us it's
ready, and then we'll start working. And the goal of this incrementing worker
-- really all it's going to do is talk in some kind of mechanism channel
back to the master or the thread, passing an integer each time and
incrementing it when it gets an integer from the master. They're going to
reach some count, in this case I used 100,000, and when that message passing
is done, the master will display some results. So the simplest scenario here
is... The master says -- hey, start at zero. Go for 100,000 times. At the
end, the worker will -- and the master will have a variable that says 100,000.
Not rocket science, but hopefully enough to illustrate some points.
Here's an implementation of the master side. It's code and it's small, and I
think it's worth calling out some lines here.
So we agree on 100,000 iterations, and the master posts to the worker to
start at zero. The worker and the master will talk inside this callback, and
the worker's date is going to come back as event data. It's going to be
unpackaged. When some condition has met the terminal condition, it will
end. Otherwise the master will pass the integer right back to the worker.
On the worker's side, it's even simpler. The worker -- all it does is
unpack the master's integer, increments it, and sends it right back on its
way. So how does that work? How fast does that work? Well, for postMessage
in this implementation on this laptop, I get about 54,000 messages per second.
I think the example is pretty simple. There's probably a variety of these
that you use all the time just for solving message passing techniques between
the worker and the master. Let's do that again, using shared memory,
specifically using shared Int32 array. So there's going to be some new
constructs here, coming out of the shared memory document that I would like
to call out. This is actually too small for even me to read. Make that
bigger. Going to make sure I'm on the same page with all of you. So the
first line I'm highlighting here is shared array buffer. This is one of the
core pieces of the concurrency technique that we've come up with. Which
allows us to share memory.
Basically you specify a side. And here is a syncretic object that's
actually layered on top of what actually exists in the spec. But what's in the
spec is a little too low level to cleanly show it in a presentation. The
purpose of this is basically -- it's a waitable object. It's going to allow
for the master and the worker to coordinate. Using postMessage, the only use
of postMessage in this example, we're passing the shared buffer to the
worker, but because it's a shared buffer, the master will be able to use
it as well. So this is a shared Int32 array. That's a view on the shared
buffer. We're basically saying -- at this address, inside the shared memory
region, it's an integer. Treat it like an integer. Give me integer semantics
on it. And here, inside the callback, using the syncretic, the master is
saying, after incrementing the buffer, the master is saying -- all right, I'm
done. So the worker will wake up. Now, in this example, unlike the example
from before, the master is also incrementing the counts. The only real
reason for doing that is -- this loop, the master will do nothing other than
saying I'm done, without doing any real work. So I wanted him to do something
to justify the existence of that loop, other than the wall for the worker to
continue. Otherwise the worker would count to 1,000 instantly and be done
without the master and the worker going back and forth. So here on the
worker side we have once again that waitable object, the syncretic, and
the shared Int32 view on the shared array buffer.
And on the loop side, we wait for the master to be done, increment the buffer,
and then say we're done. So it looks nothing like the postMessage version.
It's incredibly specific to integer-only. PostMessage you can pass in
anything. But it does function, and it functions incredibly fast. Where
before we had 54,000 messages per second, this time we have 6 million.
Over 6 million messages per second. I think we can all agree that I cheated a
little bit. Right? So unlike the postMessage model, where I can pass
anything in and do anything I want with it, where it's synchronized for me,
and where I have constructs where I don't have to worry about locking, all
that sort of nitty-gritty low level stuff was exposed in the shared
memory version, so it's not really a drop-in replacement for
postMessage. I think I can do better, and I'm going to try to do better
in this second version. Once again, because performance isn't
everything. Sometimes ergonomics of the implementation matter.
So here what I do is -- and I'm hiding a lot of the details, but what I do
differently here is I'm creating a new object that wraps a sender and
receiver, a channel sender and a channel receiver, and all this is built on
top of the same primitives that you saw before. And if we look inside our
loop now, it's a lot clearer. And it maps much, much more directly, one to
one, with postMessage. We can basically treat the send like the post to the
worker, and the receive like the post from the worker. There's no locks
exposed here. There's no waitable objects that you can screw up, by
signaling at the wrong time. And the worker side is just as clean, just as
simple. And I think very, very easy to read. Arguably even easier to
read than the native postMessage implementation. With receive, increment,
and send. So how does this perform? Still a lot better than postMessage.
And it's a huge trade-off for using these high level constructs, versus
shared Int32 array. But the choice is yours. Right? The choice for
having high level constructs or low level performance is something
that this implementation allows you to have.
So going back to our design criteria, let's see how we did. Did we get
native--like performance? Well, I didn't directly compare it against native
C++, but we're 5X faster than postMessage. I think we did pretty well there.
We'll come back to that later. How about not dependent on the main
thread event loop? Aside from the shared array buffer, we're not dependent
on the event loop at all. Versatility, I give myself a check. We have two
implementations, one really fast and really specific and one general
purpose that's still reasonably fast. Do we support the extensible web
philosophy? The fact that you can have these multiple implementations
and our iterations were done in JavaScript instead of rebuilding the browser
each time, I think I can give myself a check for that as well. And the final
one is -- do you support applications and algorithms based on threads
and pthreads? We didn't really talk about that at all here, so I'm going to
table that again for a little bit later in the discussion.
So we agree at least on three of five? Maybe three and a half of five? Anyone?
Okay. Hard graders out there in the audience. Okay. So... I have a couple
of demos, and they're live demos. So we'll see how that goes. This is against
a version of Nightly with one private patch applied. So one of these
demos will work as is, in a Nightly version of Firefox. The other one is
not quite ready yet. The first one that I would like to show you is... An
implementation of the Mandelbrot algorithm, visualized in JavaScript.
Let's go out to my Nightly. I don't know why that's here, but we'll
ignore it. So here we have Mandelbrot in JS. And you're going to do the
usual thing. Zoom around. By the way, this is the first demo that I ever
showed my wife, who's in the audience. Something that I actually made
possible on the web. So she was so excited to see this, thinking now she's
going to understand what I do for work. After I was done showing her this,
she has absolutely no idea what I do for work. So hopefully this will go over
a little bit better for everybody else. Right now the default on this is to
use one to four threads. I'm going to reduce it to one thread, and just show
that our CPU utilization is... That number is... Not exactly right. What
it's showing is out of 800... 8 cores. So 100% would be full utilization
of one core. So to show that a little bit more clearly, let's actually go
to this, which I have running in the background. I don't know how familiar you
are with the htop application. It's similar to the Perfmon I was using on
Windows. You can see in the top left corner there are eight effective CPU
cores on this machine, and right now it looks like four of them are kind of
doing something, and overall CPU utilization is around 100% out of a
possible 800%. So back to our demo.
Let's change this to use four threads and increase the number of times that
this thing updates from 50 to 500. Now, my frame rates are dropping, because
of the update interval, but increasing, because I added more threads. What I
really want to show here is... That it's possible for us to really make this
computer come to a halt. So it's not very often that you can run a
JavaScript application that pegs the CPU on a modern computer. And if you
were up here, and I don't know if we can pick this up... The fans on this
laptop are now at 100%. So yay! I came up with the most inefficient heater in
the whole world. Hopefully this thing will... Shut down politely. I really
don't know what that is. I'm just going to ignore it. Ask you to do the same.
Go back to our presentation. On a different machine, with 16 effective
cores, I actually captured the performance of that demo, so I could illustrate
it here. And in that capture, I actually enabled something called SIMD,
which had a blue bar. So just ignore that right now. If you guys are really
interested in that, you can find me afterwards and we can talk more about
SIMD. But the grey bars are the performance normalized against one core
performance on that Mandelbrot example. You can see we have perfect CPU
scaling across all 16 cores. Adding a core added about 70% more processing
power, and that's processing that we could use for really awesome game demos
like we saw in the presentations prior to this one.
So the next demo I would like to show... Something I'm really, really excited
by. And in fact, something we only just got completed. So hopefully
this works. Like I said, I'm running a private build of Firefox. So I'll
take a sip of water for good luck. This is the Unity WebGL benchmark.
People out there familiar with Unity technology as a company? Pretty good
percentage of the audience. For those who are not, Unity technologies makes a
3D game engine called Unity. It's one of the most popular licensed 3D game
engines in the world, if not the most popular one, and it's a company that I
think does really, really exciting things. So for us to be able to work with
them on enabling their benchmark, using our shared memory, was really
exciting. Like I said, it's something we only got done Friday of last week.
So we'll see how that goes.
Back to my nightly. And let's start this up. These are default four cores.
So... As you can see right away, we have a text rendering bug on the first
screen, which we'll ignore. I'm also going to uncheck the first two
demos, because they're graphically maybe not as interesting as some of the
other ones. Let this go. So this benchmark is much, much more than a WebGL benchmark. What it is, actually, is testing the entire Unity 3D engine, and
it's all running in the browser. Basically automatically ported from
C++, very minimal changes on the Unity side. All the changes were in our
shared memory and in Firefox itself. Here we have a bunch of dancing bears.
And what's more exciting than skinned dancing bears?
So like I said, it's much more than WebGL. They have AI running. Physics.
Particles. Skinning. And also their job system, which allows for execution
of threads. Threaded content.  I think it was much prettier than any demo I
could write. I'll let it go one more... Oh, yes, have to wait for the
snowman in the middle of the flurries. Like I said, the reason why I'm
really excited by this is... The Unity 3D engine is a massive code base.
Enabling that -- basically automatically ported over directly from C++ on
top of this is an incredible validation of the technology we put together.
And that it's working at all is amazing. But I think... I'll show here...
Let me stop this. Oops. Too many.
The performance is really quite good too. I'm going to focus on just the
top half of the slide. Those are all the tests that I was running on the
left side. And what we had is the blue bar is native performance. We seem
to have an anomaly on the first test. The blue bar is shorter than anything
running in JavaScript. So I'm sure it's something I did. And we'll look at
that later. But for all the other tests, green is normalized against one core
performance. Yellow is four cores. And then eight -- red is eight.
Reddish-orange. So you can see as we added four and then eight cores, we
came much closer to native performance on a bunch of these tests. We're still
behind on some of the other ones. Obviously there's always going to be
work to do. And not everything is going to scale as well in JS on cores as
other tests. So I've covered this a little bit, but yeah, this is actually
design criteria. How we did. We talked about native-like performance.
Now we can say -- yes, check. We do have native-like performance.
Especially on certain work loads. We're almost where we are in native, on
the Unity benchmark. Support algorithms and applications on pthreads.
It's a direct port of the pthread implementation of Unity. And we were able
to make it with almost no changes to the code base. I'd like to give
myself a pat on the back. Think I deserve it? Thank you. So I talked
about some of the reasons why the Unity test is significant. Not everyone
is a game dev, not everyone cares that much about games. I personally love
them.
I think they're not just fun to play. They're fun to develop, and they really
challenge software developers and hardware developers. So I think it's
a fantastic test of any infrastructure. And we were able to make all this
work -- the benchmark is not just functional. But we were able to make it
work fast. So Yuca Yolanki did most of the implementation work on
making the Unity benchmark functional. He worked on the (inaudible) side,
and he worked on the JS side. He's a Mozillian. I asked the team for why is
this particular thing significant to them, and he offered up this quote,
which I would like to share with everyone. With shared memory, the web
lifts an important implementation with shared execution that it had
compared to native. Shared memory is not comparable to a library
call that can be emulated or polyfilled. I'm sorry for all the people who are
asking this question -- can we polyfill this? We cannot. But it's a
fundamental concept of parallel execution architectures. I'm not sure
if it can get any bigger than this. He's a very emphatic guy and he's very
excited, but I think he really did capture the significance of what we
accomplished here. So what's next for shared memory and for the work
that we've done here?
Well, shared memory is in Nightly right now. As on JS, for those that are...
Can I get another show of hands -- are people familiar with (inaudible)
JS? Most of the room. So it's a low level subset of JS that we put together
that you can really optimize for, and it's fantastic for certain
workloads, especially cross compilation ones. So it's not going to move out
of Nightly, until we standardize it. But we are in talks to standardize it.
The API, the set of documents you saw that Lars put together, still subject
to change. We're getting feedback from a lot of sources inside Mozilla and
out. But the good news is Google has actually announced fairly recently
that they're going to start implementing this too. So hopefully a year from
now we're going to see this in applications built against it everywhere. I
have a bunch of links that nobody can read, but I think there'll be some
opportunities to share this presentation later. So hopefully you can catch up
on any details here that you may have. And you're always welcome to find me.
Thank you very much for your time.