Skip to content

Instantly share code, notes, and snippets.

@metacritical
Last active December 2, 2023 18:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save metacritical/14abf55ab29740e7fb0e37e6f2deaefe to your computer and use it in GitHub Desktop.
Save metacritical/14abf55ab29740e7fb0e37e6f2deaefe to your computer and use it in GitHub Desktop.
Tiny corp master plan - By George Hotz (This gist exists purely for accessibility purposes)

Tiny corp master plan

Our mission is to “commoditize the petaflop.” By building an ML framework where the engineering cost to add a new accelerator is 10-100x lower than competitors, we lower the cost for new players to enter the market.

Similarly, we lower the cost of new operations and optimizers, allowing ML to avoid local minima. In older versions of PyTorch, batchnorm and groupnorm have very different performance. There’s no fundamental reason for this, it was just because more work had been put into the batchnorm kernels. In tinygrad, all the kernels are compiled on the fly so things like this won’t happen.

The key is factorization: OP x TYPE x DEVICE x JIT x MULTI

All independent of each other. Avoiding issues like this pytorch/pytorch#11936, where the batch_mm on CUDA doesn’t support uint8. Our ops are much lower level, our devices are abstracted, and all kernels are live compiled so dtypes shouldn’t be an issue like this

NVIDIA benefits from platform lock-in because things like flash-attention target CUDA. While they may eventually be ported to other accelerators, the others will always be playing catch up. In tinygrad for example, the optimizers are pure easy to read Python code, with no specialization for dtypes or devices. Eventually the optimizing compiler will auto discover things like flash attention.

At tiny corp, we believe all FLOPS are created equal, and we want hardware to boil down to two numbers, compute and memory bandwidth. FLOPS and GB/s.

The framework play

The crown jewel of tiny corp is tinygrad, a competitor to TensorFlow and PyTorch. The main advantage to the user of the framework is debuggability. When TF and PyTorch have internal issues, it is a long journey to fix them. Once you do, you have to rebuild and deploy a complex package across your cluster. tinygrad is pure python.

Of course, ease of use and speed matter also. Our goal is to at least match PyTorch on both fronts before tinygrad is 1.0. People migrated from TensorFlow to PyTorch because the former started to collapse under its own weight. The migration from PyTorch to tinygrad will be similar.

There’s still 6 months - 1 year of development work before tinygrad is 1.0. A major milestone will include getting AMD (in the form of a tinybox) on MLPerf.

The hardware play

We sell a $15,000 deep learning box (the tinybox) with 738 FP16 TFLOPS and 5.76 TB/s of ram bandwidth. After extensive research, we haven’t found anything that’s a cheaper off the shelf way to achieve those $/TFLOPS and $/GB/s numbers. The first generation tinybox is $20.33/TFLOP and $2.60/GB/s. While gaming PCs may have higher numbers, they do not scale. tinyboxes can be linked with a full 16x PCI-E 4 fabric.

What’s inside the box doesn’t really matter, and will be updated over time with the best way to lower $/TFLOP and $/GB/s. We will push down the stack as appropriate, potentially going as far as taping out chips if it makes sense. All our hardware is sold at a profit such that we can be sustainable.

The cloud play

After we have a framework in widespread use it becomes easy to introduce the cloud play. Currently, you set the default device in tinygrad with an environment variable, like GPU=1. With CLOUD=1, all your compute will be in the cloud. The cloud also supports storage in the form of DiskTensors, meaning you can store your datasets in the cloud and quickly access them.

Currently, people wait hours/days/weeks for training jobs. It shouldn’t work like this, it should be simply priced by the number of operations and memory accesses used, expanding your job to many compute units and completing as quickly as possible. The cost is the same if it runs for 1 hour at 10 exaflops or 1000 hours at 10 petaflops. Obviously it’s preferable to the user to complete the job in one hour. If S3 is “infinitely scalable” storage, tinycloud is “infinitely scalable” compute.

The key to enabling this is not offering a bare metal cloud, but a tinygrad specific cloud. There’s zero lock in, since it’s the exact same tinygrad code, you just run with CLOUD=1 instead of GPU=1.

And since the cloud is tinygrad specific, we can run our datacenters in a very different way. Current datacenters spend way too much money to ensure as close to 100% uptime as possible. If you are doing a training job and the datacenter goes down for a couple minutes, it doesn’t really matter beyond adding a few minutes to the completion time.

We take lessons in running datacenters from crypto miners, not traditional cloud providers.

In short:

  1. build tinygrad framework (give it away)
  2. build very cost efficient hardware to run tinygrad (sell it to make money)
  3. build a massive cloud of this very cost efficient hardware (rent it to make money)
  4. train the largest models in history

Don’t tell the doomers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment