Skip to content

Instantly share code, notes, and snippets.

@fmacrae
Last active October 31, 2022 20:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fmacrae/c5bfe2e295bf2c3eec638de61e88fd9b to your computer and use it in GitHub Desktop.
Save fmacrae/c5bfe2e295bf2c3eec638de61e88fd9b to your computer and use it in GitHub Desktop.
Getting Deepracer 'local' training running on Google Cloud Compute
Follow instructions from older version of https://course.fast.ai/start_gcp.html
Step 1: Creating your account
Cloud computing allows users access to virtual CPU or GPU resources on an hourly rate, depending on the hardware configuration. Find more information in the Google Cloud Platform documentation. In case you don’t have a GCP account yet, you can create one here, which comes with $300 worth of usage credits for free.
Potential roadblock: Even though GCP provides a $300 initial credit, you must enable billing to use it. You can put a credit card or a bank account but the latter will take several days for the activation.
The project on which you are going to run the image needs to be linked with your billing account. For this navigate to the billing dashboard, click the ‘…’ menu and choose ‘change billing account’.
Step 2: Install Google CLI
To create then be able to connect to your instance, you’ll need to install Google Cloud’s command line interface (CLI) software from Google. For Windows user, we recommend that you use the Ubuntu terminal and follow the same instructions as Ubuntu users (see the link to learn how to paste into your terminal).
To install on Linux or Windows (in Ubuntu terminal), follow these four steps:
# Create environment variable for correct distribution
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
# Add the Cloud SDK distribution URI as a package source
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
# Import the Google Cloud Platform public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
# Update the package list and install the Cloud SDK
sudo apt-get update && sudo apt-get install google-cloud-sdk
You can find more details on the installation process here
To install Google CLI on MacOS, in the terminal run
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
In both cases, once the installation is done run this line
gcloud init
You should then be prompted with this message:
To continue, you must log in. Would you like to log in (Y/n)?
Type Y then copy the link and paste it to your browser. Choose the google account you used during step 1, click ‘Allow’ and you will get a confirmation code to copy and paste to your terminal.
Then, if you have more than one project (if already created on your GCP account), you’ll be prompted to choose one:
Pick cloud project to use:
[1] [my-project-1]
[2] [my-project-2]
...
Please enter your numeric choice:
Just enter the number next to the project you created on step 1. If you just created your account it will likely have a generated random name for its Project ID. If you select the choice “Create a new project”, you will be reminded you also have to run “gcloud projects create my-project-3”.
In order to set a default region you’ll need to enable the Compute Engine API, the CLI will output a link you can follow to do this.
If you’ve enabled the Compute Engine API you’ll be asked if you want to choose a default region, choose us-west1-b if you don’t have any particular preference, as it will make the command to connect to this server easier.
You can modify this later with gcloud config set compute/zone NAME
Once this is done, you should see this message on your terminal:
Your Google Cloud SDK is configured and ready to use!
* Commands that require authentication will use your.email@gmail.com by default
* Commands will reference project `my-project-1` by default
Run `gcloud help config` to learn how to change individual settings
This gcloud configuration is called [default].
but when you get to step 3 use these instructions:
#Use this instead of the fast AI image -
export IMAGE_FAMILY="tf-latest-gpu"
export ZONE="us-west1-b"
export INSTANCE_NAME="my-deepracer-instance-test"
export INSTANCE_TYPE="n1-highmem-8" # budget: "n1-highmem-4"
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator="type=nvidia-tesla-k80,count=1" \
--machine-type=$INSTANCE_TYPE \
--boot-disk-size=200GB \
--metadata="install-nvidia-driver=True" \
--preemptible
#Nip into VPC Network - Firewall Rules and open ports 9000, 8080, 6379, 8081, 5800, 5901
#connect via ssh (after about 5 mins to let it build) then run:
#Use the Canada track till I get the v1.1 enhancements working on GCP.
wget https://raw.githubusercontent.com/fmacrae/AI-Learning/master/GCPDeepracerSetup_Canada.sh
bash GCPDeepracerSetup_Canada.sh
It should install everything then give you the three sets of commands you need to run it
For the first time minio is already running so you can skip that line.
Second set of commands run sagemaker
Open another terminal /ssh connection to run the third set of commands
You can then monitor the gazeebo on port 8081 via VNC
You can use screen or nohup to run these in the background in case you disconnect from the VM during training.
Have a look at the autoShutdown.sh script which is also found here:
https://github.com/fmacrae/AI-Learning/blob/master/autoShutdown.sh
This is useful if you want to get your VM to shut down if you disconnect or training completes.
Other option is to just use shutdown command like this (parameter is number of minutes to train):
sudo shutdown +360
Feed back if you have any issues. I've tested it a few times and seems to work OK.
Other useful info:
Also note, this by default will restart training from scratch each time you restart the VM.
Refer to this for instructions to reuse a model:
https://github.com/crr0004/deepracer/wiki/Retraining-a-Model
Basically tells you to remove comments from the pretrained
sed -i 's/#"pretrain/"pretrain/g' ~/deepracer/rl_coach/rl_deepracer_coach_robomaker.py
Then copy the last *.chk* and checkpoint file to bucket/rl-deepracer-pretrained/model
from bucket/rl-deepracer-sagemaker/model
And check out https://github.com/crr0004/deepracer/wiki/Uploading-to-Leaderboard for details on how
to put your model into AWS for racing.
If you find you can't get resources on your region, try opening a cloudshell from the GCP console
and run this to see where you can get the K80.
Or swap to a better GPU (does cost more but speeds up sagemaker)
Example:
deepracer_drunkenmonkey@cloudshell:~ (infinite-matter-253420)$ gcloud beta compute accelerator-types list | grep k80
nvidia-tesla-k80 europe-west1-d NVIDIA Tesla K80
...
nvidia-tesla-k80 us-central1-c NVIDIA Tesla K80
You can move your instance to another zone following theses instructions:
https://googlecloud.tips/tips/004-moving-instances-between-zones-in-one-command/
I did another gist to show moving zones a bit better than the instructions above:
https://gist.github.com/fmacrae/623650d7840c70474515e508b9022185
If you want to log into the docker instances list the container ids:
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
463640ecb1e5 crr0004/sagemaker-rl-tensorflow:nvidia "/bin/bash -c 'start…" 6 hours ago Up 6 hours 5800/tcp, 6006/tcp, 6379/tcp tmpw4nwk440_algo-1-3h6l9_1
0e19ac3879fb crr0004/deepracer_robomaker:console "/bin/bash -c './run…" 6 hours ago Up 6 hours 0.0.0.0:8081->5900/tcp dr
The robomaker one is the one you probably want:
docker exec -it 0e19ac3879fb /bin/bash
Logs seem to be stored here:
root@0e19ac3879fb:/app/robomaker-deepracer/simulation_ws/log
and
/root/.ros/log
The logs you need for log analysis of the actual racing is got by this simle command:
docker logs 0e19ac3879fb > ~/aws-deepracer-workshops/log-analysis/logs/my-deepracer-sim-logs.log
Thanks for that one Tomasz Ptak
If you want to delete your older pb files while training this will help:
cd ~/deepracer/data/bucket/rl-deepracer-sagemaker/model
touch model_metadata.json
touch checkpoint
touch *.ckpt*
find . -mmin +59 -type f -exec rm -fv {} \;
@fmacrae
Copy link
Author

fmacrae commented Dec 9, 2019

Broken again. Going to have to merge this with the main repo.

@fmacrae
Copy link
Author

fmacrae commented Dec 18, 2019

Working again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment