Last active
October 31, 2022 20:59
-
-
Save fmacrae/c5bfe2e295bf2c3eec638de61e88fd9b to your computer and use it in GitHub Desktop.
Getting Deepracer 'local' training running on Google Cloud Compute
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Follow instructions from older version of https://course.fast.ai/start_gcp.html | |
Step 1: Creating your account | |
Cloud computing allows users access to virtual CPU or GPU resources on an hourly rate, depending on the hardware configuration. Find more information in the Google Cloud Platform documentation. In case you don’t have a GCP account yet, you can create one here, which comes with $300 worth of usage credits for free. | |
Potential roadblock: Even though GCP provides a $300 initial credit, you must enable billing to use it. You can put a credit card or a bank account but the latter will take several days for the activation. | |
The project on which you are going to run the image needs to be linked with your billing account. For this navigate to the billing dashboard, click the ‘…’ menu and choose ‘change billing account’. | |
Step 2: Install Google CLI | |
To create then be able to connect to your instance, you’ll need to install Google Cloud’s command line interface (CLI) software from Google. For Windows user, we recommend that you use the Ubuntu terminal and follow the same instructions as Ubuntu users (see the link to learn how to paste into your terminal). | |
To install on Linux or Windows (in Ubuntu terminal), follow these four steps: | |
# Create environment variable for correct distribution | |
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" | |
# Add the Cloud SDK distribution URI as a package source | |
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list | |
# Import the Google Cloud Platform public key | |
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - | |
# Update the package list and install the Cloud SDK | |
sudo apt-get update && sudo apt-get install google-cloud-sdk | |
You can find more details on the installation process here | |
To install Google CLI on MacOS, in the terminal run | |
curl https://sdk.cloud.google.com | bash | |
exec -l $SHELL | |
In both cases, once the installation is done run this line | |
gcloud init | |
You should then be prompted with this message: | |
To continue, you must log in. Would you like to log in (Y/n)? | |
Type Y then copy the link and paste it to your browser. Choose the google account you used during step 1, click ‘Allow’ and you will get a confirmation code to copy and paste to your terminal. | |
Then, if you have more than one project (if already created on your GCP account), you’ll be prompted to choose one: | |
Pick cloud project to use: | |
[1] [my-project-1] | |
[2] [my-project-2] | |
... | |
Please enter your numeric choice: | |
Just enter the number next to the project you created on step 1. If you just created your account it will likely have a generated random name for its Project ID. If you select the choice “Create a new project”, you will be reminded you also have to run “gcloud projects create my-project-3”. | |
In order to set a default region you’ll need to enable the Compute Engine API, the CLI will output a link you can follow to do this. | |
If you’ve enabled the Compute Engine API you’ll be asked if you want to choose a default region, choose us-west1-b if you don’t have any particular preference, as it will make the command to connect to this server easier. | |
You can modify this later with gcloud config set compute/zone NAME | |
Once this is done, you should see this message on your terminal: | |
Your Google Cloud SDK is configured and ready to use! | |
* Commands that require authentication will use your.email@gmail.com by default | |
* Commands will reference project `my-project-1` by default | |
Run `gcloud help config` to learn how to change individual settings | |
This gcloud configuration is called [default]. | |
but when you get to step 3 use these instructions: | |
#Use this instead of the fast AI image - | |
export IMAGE_FAMILY="tf-latest-gpu" | |
export ZONE="us-west1-b" | |
export INSTANCE_NAME="my-deepracer-instance-test" | |
export INSTANCE_TYPE="n1-highmem-8" # budget: "n1-highmem-4" | |
gcloud compute instances create $INSTANCE_NAME \ | |
--zone=$ZONE \ | |
--image-family=$IMAGE_FAMILY \ | |
--image-project=deeplearning-platform-release \ | |
--maintenance-policy=TERMINATE \ | |
--accelerator="type=nvidia-tesla-k80,count=1" \ | |
--machine-type=$INSTANCE_TYPE \ | |
--boot-disk-size=200GB \ | |
--metadata="install-nvidia-driver=True" \ | |
--preemptible | |
#Nip into VPC Network - Firewall Rules and open ports 9000, 8080, 6379, 8081, 5800, 5901 | |
#connect via ssh (after about 5 mins to let it build) then run: | |
#Use the Canada track till I get the v1.1 enhancements working on GCP. | |
wget https://raw.githubusercontent.com/fmacrae/AI-Learning/master/GCPDeepracerSetup_Canada.sh | |
bash GCPDeepracerSetup_Canada.sh | |
It should install everything then give you the three sets of commands you need to run it | |
For the first time minio is already running so you can skip that line. | |
Second set of commands run sagemaker | |
Open another terminal /ssh connection to run the third set of commands | |
You can then monitor the gazeebo on port 8081 via VNC | |
You can use screen or nohup to run these in the background in case you disconnect from the VM during training. | |
Have a look at the autoShutdown.sh script which is also found here: | |
https://github.com/fmacrae/AI-Learning/blob/master/autoShutdown.sh | |
This is useful if you want to get your VM to shut down if you disconnect or training completes. | |
Other option is to just use shutdown command like this (parameter is number of minutes to train): | |
sudo shutdown +360 | |
Feed back if you have any issues. I've tested it a few times and seems to work OK. | |
Other useful info: | |
Also note, this by default will restart training from scratch each time you restart the VM. | |
Refer to this for instructions to reuse a model: | |
https://github.com/crr0004/deepracer/wiki/Retraining-a-Model | |
Basically tells you to remove comments from the pretrained | |
sed -i 's/#"pretrain/"pretrain/g' ~/deepracer/rl_coach/rl_deepracer_coach_robomaker.py | |
Then copy the last *.chk* and checkpoint file to bucket/rl-deepracer-pretrained/model | |
from bucket/rl-deepracer-sagemaker/model | |
And check out https://github.com/crr0004/deepracer/wiki/Uploading-to-Leaderboard for details on how | |
to put your model into AWS for racing. | |
If you find you can't get resources on your region, try opening a cloudshell from the GCP console | |
and run this to see where you can get the K80. | |
Or swap to a better GPU (does cost more but speeds up sagemaker) | |
Example: | |
deepracer_drunkenmonkey@cloudshell:~ (infinite-matter-253420)$ gcloud beta compute accelerator-types list | grep k80 | |
nvidia-tesla-k80 europe-west1-d NVIDIA Tesla K80 | |
... | |
nvidia-tesla-k80 us-central1-c NVIDIA Tesla K80 | |
You can move your instance to another zone following theses instructions: | |
https://googlecloud.tips/tips/004-moving-instances-between-zones-in-one-command/ | |
I did another gist to show moving zones a bit better than the instructions above: | |
https://gist.github.com/fmacrae/623650d7840c70474515e508b9022185 | |
If you want to log into the docker instances list the container ids: | |
docker container ls | |
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES | |
463640ecb1e5 crr0004/sagemaker-rl-tensorflow:nvidia "/bin/bash -c 'start…" 6 hours ago Up 6 hours 5800/tcp, 6006/tcp, 6379/tcp tmpw4nwk440_algo-1-3h6l9_1 | |
0e19ac3879fb crr0004/deepracer_robomaker:console "/bin/bash -c './run…" 6 hours ago Up 6 hours 0.0.0.0:8081->5900/tcp dr | |
The robomaker one is the one you probably want: | |
docker exec -it 0e19ac3879fb /bin/bash | |
Logs seem to be stored here: | |
root@0e19ac3879fb:/app/robomaker-deepracer/simulation_ws/log | |
and | |
/root/.ros/log | |
The logs you need for log analysis of the actual racing is got by this simle command: | |
docker logs 0e19ac3879fb > ~/aws-deepracer-workshops/log-analysis/logs/my-deepracer-sim-logs.log | |
Thanks for that one Tomasz Ptak | |
If you want to delete your older pb files while training this will help: | |
cd ~/deepracer/data/bucket/rl-deepracer-sagemaker/model | |
touch model_metadata.json | |
touch checkpoint | |
touch *.ckpt* | |
find . -mmin +59 -type f -exec rm -fv {} \; | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Broken again. Going to have to merge this with the main repo.