shawwn/tpunicorn-tpu-vm-support.md

## tpunicorn-tpu-vm-support.md

      
    Raw
  

              tpunicorn-tpu-vm-support.md
            
          
    Hello! I've updated tpunicorn (https://github.com/shawwn/tpunicorn) with support for TPU VMs. But before I do a full release, I was hoping that a TPU VM user (like you!) would help me test the pre-release version. If it seems to seems to work for you, let me know and I'll do a release announcement on my twitter (https://twitter.com/theshawwn) and plug TRC while I'm at it (https://blog.gpt4.org/jaxtpu).
What's tpunicorn?

I wanted an effortless way to manage all my TPUs, so I made tpunicorn. It's basically a TPU devops power tool.
Quickstart

SSH into one of your existing TPU VMs, then run:
pip3 install --pre -U tpunicorn

(note the --pre. Without it, you'll get the old version of tpunicorn without TPU VM support. The current prerelease test version is 0.6.0.rc6, which you can verify with ~/.local/bin/pu --version.)
Here's what it looks like on a fresh TPU that I SSH'ed into just now:

Adding pu to your PATH

Here's a handy snippet I use to add ~/.local/bin to my PATH:
pip3 install userpath
~/.local/bin/userpath append ~/.local/bin
exec $SHELL

Then you can run pu rather than ~/.local/bin/pu.
(userpath is what pipx ensurepath uses (https://github.com/pypa/pipx). I like it for its simplicity.)
Basic usage

Once pu is installed, you can:


pu list to see a nice graphical view of all TPUs across your project.


pu top clears the terminal and runs pu list every few seconds. Handy for watching the status of a TPU pod you've just created.


pu create to fire up a new preemptible v2-8 in us-central1-f.


pu create tpu-v2-512-usc1a-42 to create a 512-core TPU pod in us-central1-a.


pu create -a v3-8 -np to create a non-preemptible v3-8 in europe-west4-a.


pu {start,stop,delete,recreate,ssh} 0 to start/stop/delete/recreate/ssh into the TPU whose ID ends with -0


You'll notice that pu create tries to infer the proper zone automatically. (I wanted it to be totally effortless to create TPU pods.) You can specify the zone using -z or --zone. To see all possible zones, run pu create --help:

(NOTE FOR TESTERS: If you have access to any secret zones, please verify that they show up in this list. I fetch the list dynamically via the API, so I'm keen to see whether pu is suitable for e.g. the Cloud TPU team.)
For all commands, you can also use shorthand zone names (e.g. usc1f instead of us-central1-f).
One nice feature of pu is that it prints out the corresponding command to undo the last action. For example, when I ran pu delete on a TPU with a harddrive attached, it printed out the full gcloud command to run (against the new TPU VM API) to recreate that TPU:

Advanced usage

You can spit out all JSON info about a specific TPU using pu list -t that-tpu --format json. Using jq, you can extract basically any possible information you might need (for shell scripts, etc).
For example, before I added pu ssh, I used this tpu-ssh script to SSH into my TPUs: https://github.com/shawwn/scrap/blob/master/tpu-ssh
It's quite lovely to run tpu-ssh 0 instead of gcloud alpha compute tpus tpu-vm ssh tpu-v2-8-usc1f-0 --zone us-central1-f.
The pu ssh command exists now, but this script serves as a nice demo of using pu to grab some info about your TPUs.
Errors and Security

If you get permission errors such as "Permission 'tpu.locations.list' denied on 'projects/..., you can fix it in one of the following ways:


Run gcloud auth list and then give the default account a "TPU Admin" role via your IAM dashboard (https://console.cloud.google.com/iam-admin/iam). For example, here's how I set up mine: https://i.imgur.com/TPXNdwG.png


Or run gcloud auth login (and perhaps gcloud auth application-default login), then log in as yourself


Or use a service account key which you've set up as a TPU Admin or TPU Viewer: https://cloud.google.com/iam/docs/creating-managing-service-account-keys


I like option #1, since it lets me control all my TPUs from any of my TPU VMs. But that means anyone with access to any TPU VM can create/delete TPUs in your project. (You could use "TPU Viewer" role instead of "TPU Admin" if you're okay with giving read-only access to all your TPU VMs.)
Here's what a permission error looks like, followed by option #1, at which point the error went away: https://i.imgur.com/XRjKYlZ.png
Option #2 (gcloud auth login) is convenient, but watch out! Anyone who obtains access to that box can run GCP commands as you. Which means they probably have total control over your GCP account.
Option #3 (a custom service account key) is the most secure, but it's a complete pain. On the other hand, you'd be able to give specific VMs "TPU Admin" or "TPU Viewer" roles, which you can't do via any other technique.
Security of tpunicorn itself

This is the current list of dependencies for tpunicorn:
    install_requires=[
        'Click>=7.1.2',
        'six>=1.11.0',
        'ring>=0.7.3',
        'moment>=0.0.10',
        'google-auth>=0.11.0',
        'google-api-python-client>=1.7.11',
        'cachier>=1.5.0',
        'braceexpand>=0.1.7',
    ],

Theoretically any of those projects could inject a backdoor into your TPUs.
And it seems to be a realistic attack vector: https://arstechnica.com/gadgets/2021/03/more-top-tier-companies-targeted-by-new-type-of-potentially-serious-attack/
I plan to address this by reducing the dependency list down to Click (required for CLI) and google-auth / google-api-python-client (required for TPU operations). But for now, that's a TODO.
Shawn's security

Lastly, you're trusting me (https://github.com/shawwn) not to backdoor your TPUs. Anyone who compromises my pypi account (https://pypi.org/user/shawwn/) could ship a new version of tpunicorn with a backdoor.
I wish I had a better answer to this than "Trust me." But, you could glance over the code (https://github.com/shawwn/tpunicorn/blob/master/tpunicorn/program.py) and then do a local install:
git clone https://github.com/shawwn/tpunicorn
cd tpunicorn
pip3 install -e .

At that point, ~/.local/bin/pu will point to your local tpunicorn repository.
Thank You

I appreciate your time! Thanks for helping to test tpunicorn pre-release.