Skip to content

Instantly share code, notes, and snippets.

@shawwn

shawwn/email.md Secret

Created October 4, 2020 15:01
Show Gist options
  • Save shawwn/2af039264ad0639ef43456e8a749b06e to your computer and use it in GitHub Desktop.
Save shawwn/2af039264ad0639ef43456e8a749b06e to your computer and use it in GitHub Desktop.
subject: Model Parallelism is Awesome

Hiya,

It's been bugging me that I only email you guys when something goes wrong, or when we need something or whatever. So I just wanted to say that I came across https://cloud.google.com/tpu/docs/spatial-partitioning last night, and within about an hour we had started using that feature in production. Or to put it more simply, I twiddled a few options to enable the feature and my jaw hit the floor that our stylegan2 models were now able to train at full 1024x1024 resolution. Previously 512x512 res was the maximum on our codebase before running out of memory per TPU core.

It was mindblowing to discover that TPUs have this magical feature. It's amazing, incredible, wonderful, you name it. From an end-user standpoint, I can't think of any other feature I've ever tried to use in any software library that (a) has been so painless, (b) worked perfectly on the first try, and (c) had as much impact on my day to day work as that just did.

It's also, to me, a technical marvel – one does not simply train a larger stylegan2 model by training 4 smaller stylegan2 models on different parts of the image. That's not how stylegan works. It's not how any of this works! Yet apparently, someone figured out how to carefully organize every computation such that the XLA compiler correctly "farms out" the work to each core, then gathers the results such that it's completely transparent that anything is different at all. Now, speaking as a programmer who once devoted a sizable portion of my life to "multithreading / GPU / performance stuff," I am uniquely qualified to understand just how hard of a problem that is. Somehow XLA pulls it off!

I spent some time digging into how this magic works, reading through the tensorflow source code. But ultimately the magic seems to happen in the XLA compiler (I think?) and most of that processing happens inside the TPU software stack, on the TPU itself. And short of somehow tricking the TPU into dumping out its own source code, I haven't found a way of peeking inside the TPU yet, so the magic will have to remain a mystery for now.

So I wanted to give a shoutout / kudos to all of the people who made this feature possible – and to whoever wrote the documentation calling it out as a possibility. (In fact, the docs seem way, way better than the last time I went through them carefully, which was several months ago. But I somehow doubt the docs changed all that much; probably just my skill level.)

Sometimes it feels like I happen to be the first programmer on earth outside of google to notice how absolutely delightful TPUs are. After all, this was one of the grand challenges in software development – so much so, that it even appears as #6 on pg's "Frighteningly Ambitious Startup Ideas" list: http://www.paulgraham.com/ambitious.html

It would be great if a startup could give us something of the old Moore's Law back, by writing software that could make a large number of CPUs look to the developer like one very fast CPU. There are several ways to approach this problem. The most ambitious is to try to do it automatically: to write a compiler that will parallelize our code for us. There's a name for this compiler, the sufficiently smart compiler, and it is a byword for impossibility. But is it really impossible? Is there no configuration of the bits in memory of a present day computer that is this compiler? If you really think so, you should try to prove it, because that would be an interesting result. And if it's not impossible but simply very hard, it might be worth trying to write it. The expected value would be high even if the chance of succeeding was low.

To my knowledge, work in this area has been mostly theoretical. So picture me casually browsing the TPU docs, reading each page one by one, doing a double take at https://cloud.google.com/tpu/docs/spatial-partitioning, then once more, then a third time, saying "No way it's that easy," trying it, and then being completely floored that this apparently does work and it really is completely transparent. It was like a scene out of a novel where some apprentice sorcerer is poring over some super-old tomes of long-forgotten magic spells and unearths some knowledge that instantly makes them twice as powerful.

It was truly one of the most delightful experiences I have ever had as a programmer to discover this, so I wanted to share a bit of that delight while it's still fresh. Thank you, Cloud TPU team!

Best, Shawn

@shoyer
Copy link

shoyer commented Feb 9, 2021

You can find the source code for XLA's spatial partitioning here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/xla/service/spmd

@sanjoy
Copy link

sanjoy commented Feb 9, 2021

Note that XLA's SPMD support is not inherently TPU specific -- we're actively looking into bringing SPMD support for XLA GPU as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment