Hello! I've updated tpunicorn
(https://github.com/shawwn/tpunicorn) with support for TPU VMs. But before I do a full release, I was hoping that a TPU VM user (like you!) would help me test the pre-release version. If it seems to seems to work for you, let me know and I'll do a release announcement on my twitter (https://twitter.com/theshawwn) and plug TRC while I'm at it (https://blog.gpt4.org/jaxtpu).
I wanted an effortless way to manage all my TPUs, so I made tpunicorn
. It's basically a TPU devops power tool.
SSH into one of your existing TPU VMs, then run:
pip3 install --pre -U tpunicorn
(note the --pre
. Without it, you'll get the old version of tpunicorn without TPU VM support. The current prerelease test version is 0.6.0.rc6
, which you can verify with ~/.local/bin/pu --version
.)
Here's what it looks like on a fresh TPU that I SSH'ed into just now:
Here's a handy snippet I use to add ~/.local/bin
to my PATH:
pip3 install userpath
~/.local/bin/userpath append ~/.local/bin
exec $SHELL
Then you can run pu
rather than ~/.local/bin/pu
.
(userpath
is what pipx ensurepath
uses (https://github.com/pypa/pipx). I like it for its simplicity.)
Once pu
is installed, you can:
-
pu list
to see a nice graphical view of all TPUs across your project. -
pu top
clears the terminal and runspu list
every few seconds. Handy for watching the status of a TPU pod you've just created. -
pu create
to fire up a new preemptible v2-8 in us-central1-f. -
pu create tpu-v2-512-usc1a-42
to create a 512-core TPU pod in us-central1-a. -
pu create -a v3-8 -np
to create a non-preemptible v3-8 in europe-west4-a. -
pu {start,stop,delete,recreate,ssh} 0
to start/stop/delete/recreate/ssh into the TPU whose ID ends with-0
You'll notice that pu create
tries to infer the proper zone automatically. (I wanted it to be totally effortless to create TPU pods.) You can specify the zone using -z
or --zone
. To see all possible zones, run pu create --help
:
(NOTE FOR TESTERS: If you have access to any secret zones, please verify that they show up in this list. I fetch the list dynamically via the API, so I'm keen to see whether pu
is suitable for e.g. the Cloud TPU team.)
For all commands, you can also use shorthand zone names (e.g. usc1f
instead of us-central1-f
).
One nice feature of pu
is that it prints out the corresponding command to undo the last action. For example, when I ran pu delete
on a TPU with a harddrive attached, it printed out the full gcloud
command to run (against the new TPU VM API) to recreate that TPU:
You can spit out all JSON info about a specific TPU using pu list -t that-tpu --format json
. Using jq
, you can extract basically any possible information you might need (for shell scripts, etc).
For example, before I added pu ssh
, I used this tpu-ssh
script to SSH into my TPUs: https://github.com/shawwn/scrap/blob/master/tpu-ssh
It's quite lovely to run tpu-ssh 0
instead of gcloud alpha compute tpus tpu-vm ssh tpu-v2-8-usc1f-0 --zone us-central1-f
.
The pu ssh
command exists now, but this script serves as a nice demo of using pu
to grab some info about your TPUs.
If you get permission errors such as "Permission 'tpu.locations.list' denied on 'projects/...
, you can fix it in one of the following ways:
-
Run
gcloud auth list
and then give the default account a "TPU Admin" role via your IAM dashboard (https://console.cloud.google.com/iam-admin/iam). For example, here's how I set up mine: https://i.imgur.com/TPXNdwG.png -
Or run
gcloud auth login
(and perhapsgcloud auth application-default login
), then log in as yourself -
Or use a service account key which you've set up as a TPU Admin or TPU Viewer: https://cloud.google.com/iam/docs/creating-managing-service-account-keys
I like option #1, since it lets me control all my TPUs from any of my TPU VMs. But that means anyone with access to any TPU VM can create/delete TPUs in your project. (You could use "TPU Viewer" role instead of "TPU Admin" if you're okay with giving read-only access to all your TPU VMs.)
Here's what a permission error looks like, followed by option #1, at which point the error went away: https://i.imgur.com/XRjKYlZ.png
Option #2 (gcloud auth login
) is convenient, but watch out! Anyone who obtains access to that box can run GCP commands as you. Which means they probably have total control over your GCP account.
Option #3 (a custom service account key) is the most secure, but it's a complete pain. On the other hand, you'd be able to give specific VMs "TPU Admin" or "TPU Viewer" roles, which you can't do via any other technique.
This is the current list of dependencies for tpunicorn
:
install_requires=[
'Click>=7.1.2',
'six>=1.11.0',
'ring>=0.7.3',
'moment>=0.0.10',
'google-auth>=0.11.0',
'google-api-python-client>=1.7.11',
'cachier>=1.5.0',
'braceexpand>=0.1.7',
],
Theoretically any of those projects could inject a backdoor into your TPUs.
And it seems to be a realistic attack vector: https://arstechnica.com/gadgets/2021/03/more-top-tier-companies-targeted-by-new-type-of-potentially-serious-attack/
I plan to address this by reducing the dependency list down to Click
(required for CLI) and google-auth
/ google-api-python-client
(required for TPU operations). But for now, that's a TODO.
Lastly, you're trusting me (https://github.com/shawwn) not to backdoor your TPUs. Anyone who compromises my pypi
account (https://pypi.org/user/shawwn/) could ship a new version of tpunicorn
with a backdoor.
I wish I had a better answer to this than "Trust me." But, you could glance over the code (https://github.com/shawwn/tpunicorn/blob/master/tpunicorn/program.py) and then do a local install:
git clone https://github.com/shawwn/tpunicorn
cd tpunicorn
pip3 install -e .
At that point, ~/.local/bin/pu
will point to your local tpunicorn
repository.
I appreciate your time! Thanks for helping to test tpunicorn
pre-release.