shawwn/email.md Secret

## email.md

      
    Raw
  

              email.md
            
          
    Hiya,
It's been bugging me that I only email you guys when something goes
wrong, or when we need something or whatever. So I just wanted to say
that I came across
https://cloud.google.com/tpu/docs/spatial-partitioning last night, and
within about an hour we had started using that feature in production.
Or to put it more simply, I twiddled a few options to enable the
feature and my jaw hit the floor that our stylegan2 models were now
able to train at full 1024x1024 resolution. Previously 512x512 res was
the maximum on our codebase before running out of memory per TPU core.
It was mindblowing to discover that TPUs have this magical feature.
It's amazing, incredible, wonderful, you name it. From an end-user
standpoint, I can't think of any other feature I've ever tried to use
in any software library that (a) has been so painless, (b) worked
perfectly on the first try, and (c) had as much impact on my day to
day work as that just did.
It's also, to me, a technical marvel – one does not simply train a
larger stylegan2 model by training 4 smaller stylegan2 models on
different parts of the image. That's not how stylegan works. It's not
how any of this works! Yet apparently, someone figured out how to
carefully organize every computation such that the XLA compiler
correctly "farms out" the work to each core, then gathers the results
such that it's completely transparent that anything is different at
all. Now, speaking as a programmer who once devoted a sizable portion
of my life to "multithreading / GPU / performance stuff," I am
uniquely qualified to understand just how hard of a problem that is.
Somehow XLA pulls it off!
I spent some time digging into how this magic works, reading through
the tensorflow source code. But ultimately the magic seems to happen
in the XLA compiler (I think?) and most of that processing happens
inside the TPU software stack, on the TPU itself. And short of somehow
tricking the TPU into dumping out its own source code, I haven't found
a way of peeking inside the TPU yet, so the magic will have to remain
a mystery for now.
So I wanted to give a shoutout / kudos to all of the people who made
this feature possible – and to whoever wrote the documentation calling
it out as a possibility. (In fact, the docs seem way, way better than
the last time I went through them carefully, which was several months
ago. But I somehow doubt the docs changed all that much; probably just
my skill level.)
Sometimes it feels like I happen to be the first programmer on earth
outside of google to notice how absolutely delightful TPUs are. After
all, this was one of the grand challenges in software development – so
much so, that it even appears as #6 on pg's "Frighteningly Ambitious
Startup Ideas" list: http://www.paulgraham.com/ambitious.html

It would be great if a startup could give us something of the old
Moore's Law back, by writing software that could make a large number
of CPUs look to the developer like one very fast CPU. There are
several ways to approach this problem. The most ambitious is to try
to do it automatically: to write a compiler that will parallelize
our code for us. There's a name for this compiler, the sufficiently
smart compiler, and it is a byword for impossibility. But is it
really impossible? Is there no configuration of the bits in memory
of a present day computer that is this compiler? If you really think
so, you should try to prove it, because that would be an interesting
result. And if it's not impossible but simply very hard, it might be
worth trying to write it. The expected value would be high even if
the chance of succeeding was low.

To my knowledge, work in this area has been mostly theoretical. So
picture me casually browsing the TPU docs, reading each page one by
one, doing a double take at
https://cloud.google.com/tpu/docs/spatial-partitioning, then once
more, then a third time, saying "No way it's that easy," trying it,
and then being completely floored that this apparently does work and
it really is completely transparent. It was like a scene out of a
novel where some apprentice sorcerer is poring over some super-old
tomes of long-forgotten magic spells and unearths some knowledge that
instantly makes them twice as powerful.
It was truly one of the most delightful experiences I have ever had as
a programmer to discover this, so I wanted to share a bit of that
delight while it's still fresh. Thank you, Cloud TPU team!
Best,
Shawn