Our mission is to “commoditize the petaflop.” By building an ML framework where the engineering cost to add a new accelerator is 10-100x lower than competitors, we lower the cost for new players to enter the market.
Similarly, we lower the cost of new operations and optimizers, allowing ML to avoid local minima. In older versions of PyTorch, batchnorm and groupnorm have very different performance. There’s no fundamental reason for this, it was just because more work had been put into the batchnorm kernels. In tinygrad, all the kernels are compiled on the fly so things like this won’t happen.
The key is factorization: OP x TYPE x DEVICE x JIT x MULTI
All independent of each other. Avoiding issues like this pytorch/pytorch#11936, where the batch_mm on CUDA doesn’t support uint8. Our ops are much lower level, our devices are abstracted, and all kernels are live compiled so dtypes shouldn’t be an issue like this
NVIDIA benefits from platform lock-in because things like flash-attention target CUDA. While they may eventually be ported to other accelerators, the others will always be playing catch up. In tinygrad for example, the optimizers are pure easy to read Python code, with no specialization for dtypes or devices. Eventually the optimizing compiler will auto discover things like flash attention.
At tiny corp, we believe all FLOPS are created equal, and we want hardware to boil down to two numbers, compute and memory bandwidth. FLOPS and GB/s.
The crown jewel of tiny corp is tinygrad, a competitor to TensorFlow and PyTorch. The main advantage to the user of the framework is debuggability. When TF and PyTorch have internal issues, it is a long journey to fix them. Once you do, you have to rebuild and deploy a complex package across your cluster. tinygrad is pure python.
Of course, ease of use and speed matter also. Our goal is to at least match PyTorch on both fronts before tinygrad is 1.0. People migrated from TensorFlow to PyTorch because the former started to collapse under its own weight. The migration from PyTorch to tinygrad will be similar.
There’s still 6 months - 1 year of development work before tinygrad is 1.0. A major milestone will include getting AMD (in the form of a tinybox) on MLPerf.
We sell a $15,000
deep learning box (the tinybox) with 738 FP16 TFLOPS and 5.76 TB/s of ram bandwidth. After extensive research, we haven’t found anything that’s a cheaper off the shelf way to achieve those
What’s inside the box doesn’t really matter, and will be updated over time with the best way to lower
After we have a framework in widespread use it becomes easy to introduce the cloud play. Currently, you set the default device in tinygrad with an environment variable, like GPU=1. With CLOUD=1, all your compute will be in the cloud. The cloud also supports storage in the form of DiskTensors, meaning you can store your datasets in the cloud and quickly access them.
Currently, people wait hours/days/weeks for training jobs. It shouldn’t work like this, it should be simply priced by the number of operations and memory accesses used, expanding your job to many compute units and completing as quickly as possible. The cost is the same if it runs for 1 hour at 10 exaflops or 1000 hours at 10 petaflops. Obviously it’s preferable to the user to complete the job in one hour. If S3 is “infinitely scalable” storage, tinycloud is “infinitely scalable” compute.
The key to enabling this is not offering a bare metal cloud, but a tinygrad specific cloud. There’s zero lock in, since it’s the exact same tinygrad code, you just run with CLOUD=1 instead of GPU=1.
And since the cloud is tinygrad specific, we can run our datacenters in a very different way. Current datacenters spend way too much money to ensure as close to 100% uptime as possible. If you are doing a training job and the datacenter goes down for a couple minutes, it doesn’t really matter beyond adding a few minutes to the completion time.
We take lessons in running datacenters from crypto miners, not traditional cloud providers.
In short:
- build tinygrad framework (give it away)
- build very cost efficient hardware to run tinygrad (sell it to make money)
- build a massive cloud of this very cost efficient hardware (rent it to make money)
- train the largest models in history
Don’t tell the doomers.