metacritical/Tiny_corp_master_plan.md

## Tiny_corp_master_plan.md

      
    Raw
  

              Tiny_corp_master_plan.md
            
          
    Tiny corp master plan

Our mission is to “commoditize the petaflop.” By building an ML framework where the engineering cost to add a
new accelerator is 10-100x lower than competitors, we lower the cost for new players to enter the market.
Similarly, we lower the cost of new operations and optimizers, allowing ML to avoid local minima. In older versions
of PyTorch, batchnorm and groupnorm have very different performance. There’s no fundamental reason for this, it was just
because more work had been put into the batchnorm kernels. In tinygrad, all the kernels are compiled on the fly so things
like this won’t happen.
The key is factorization: OP x TYPE x DEVICE x JIT x MULTI
All independent of each other. Avoiding issues like this pytorch/pytorch#11936, where the
batch_mm on CUDA doesn’t support uint8. Our ops are much lower level, our devices are abstracted, and all kernels are
live compiled so dtypes shouldn’t be an issue like this
NVIDIA benefits from platform lock-in because things like flash-attention target CUDA. While they may eventually be ported
to other accelerators, the others will always be playing catch up. In tinygrad for example, the optimizers are pure
easy to read Python code, with no specialization for dtypes or devices. Eventually the optimizing compiler will auto
discover things like flash attention.
At tiny corp, we believe all FLOPS are created equal, and we want hardware to boil down to two numbers, compute and
memory bandwidth. FLOPS and GB/s.
The framework play

The crown jewel of tiny corp is tinygrad, a competitor to TensorFlow and PyTorch. The main advantage to the user of
the framework is debuggability. When TF and PyTorch have internal issues, it is a long journey to fix them. Once you do,
you have to rebuild and deploy a complex package across your cluster. tinygrad is pure python.
Of course, ease of use and speed matter also. Our goal is to at least match PyTorch on both fronts before tinygrad is 1.0.
People migrated from TensorFlow to PyTorch because the former started to collapse under its own weight.
The migration from PyTorch to tinygrad will be similar.
There’s still 6 months - 1 year of development work before tinygrad is 1.0. A major milestone will include getting
AMD (in the form of a tinybox) on MLPerf.
The hardware play

We sell a $15,000 deep learning box (the tinybox) with 738 FP16 TFLOPS and 5.76 TB/s of ram bandwidth.  After extensive research, we haven’t found anything that’s a cheaper off the shelf way to achieve those  $/TFLOPS and $/GB/s numbers. The first generation tinybox is $20.33/TFLOP and $2.60/GB/s. While gaming  PCs may have higher numbers, they do not scale. tinyboxes can be linked with a full 16x PCI-E 4 fabric.
What’s inside the box doesn’t really matter, and will be updated over time with the best way to lower $/TFLOP and $/GB/s.
We will push down the stack as appropriate, potentially going as far as taping out chips if it makes sense.
All our hardware is sold at a profit such that we can be sustainable.
The cloud play

After we have a framework in widespread use it becomes easy to introduce the cloud play. Currently, you set the default
device in tinygrad with an environment variable, like GPU=1. With CLOUD=1, all your compute will be in the cloud.
The cloud also supports storage in the form of DiskTensors, meaning you can store your datasets in the cloud and quickly
access them.
Currently, people wait hours/days/weeks for training jobs. It shouldn’t work like this, it should be simply priced by
the number of operations and memory accesses used, expanding your job to many compute units and completing as quickly
as possible. The cost is the same if it runs for 1 hour at 10 exaflops or 1000 hours at 10 petaflops. Obviously it’s
preferable to the user to complete the job in one hour. If S3 is “infinitely scalable” storage, tinycloud is “infinitely
scalable” compute.
The key to enabling this is not offering a bare metal cloud, but a tinygrad specific cloud. There’s zero lock in, since
it’s the exact same tinygrad code, you just run with CLOUD=1 instead of GPU=1.
And since the cloud is tinygrad specific, we can run our datacenters in a very different way. Current datacenters spend
way too much money to ensure as close to 100% uptime as possible. If you are doing a training job and the datacenter
goes down for a couple minutes, it doesn’t really matter beyond adding a few minutes to the completion time.
We take lessons in running datacenters from crypto miners, not traditional cloud providers.
In short:

build tinygrad framework (give it away)
build very cost efficient hardware to run tinygrad (sell it to make money)
build a massive cloud of this very cost efficient hardware (rent it to make money)
train the largest models in history

Don’t tell the doomers.