Skip to content

Instantly share code, notes, and snippets.

@fmacrae fmacrae/Deep Racer on GCP
Last active Dec 18, 2019

Embed
What would you like to do?
Getting Deepracer 'local' training running on Google Cloud Compute
Follow instructions from https://course.fast.ai/start_gcp.html
but when you get to step 3 use these instructions:
#Use this instead of the fast AI image -
export IMAGE_FAMILY="tf-latest-gpu"
export ZONE="us-west1-b"
export INSTANCE_NAME="my-deepracer-instance-test"
export INSTANCE_TYPE="n1-highmem-8" # budget: "n1-highmem-4"
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator="type=nvidia-tesla-k80,count=1" \
--machine-type=$INSTANCE_TYPE \
--boot-disk-size=200GB \
--metadata="install-nvidia-driver=True" \
--preemptible
#Nip into VPC Network - Firewall Rules and open ports 9000, 8080, 6379, 8081, 5800, 5901
#connect via ssh (after about 5 mins to let it build) then run:
#Use the Canada track till I get the v1.1 enhancements working on GCP.
wget https://raw.githubusercontent.com/fmacrae/AI-Learning/master/GCPDeepracerSetup_Canada.sh
bash GCPDeepracerSetup_Canada.sh
It should install everything then give you the three sets of commands you need to run it
For the first time minio is already running so you can skip that line.
Second set of commands run sagemaker
Open another terminal /ssh connection to run the third set of commands
You can then monitor the gazeebo on port 8081 via VNC
You can use screen or nohup to run these in the background in case you disconnect from the VM during training.
Have a look at the autoShutdown.sh script which is also found here:
https://github.com/fmacrae/AI-Learning/blob/master/autoShutdown.sh
This is useful if you want to get your VM to shut down if you disconnect or training completes.
Other option is to just use shutdown command like this (parameter is number of minutes to train):
sudo shutdown +360
Feed back if you have any issues. I've tested it a few times and seems to work OK.
Other useful info:
Also note, this by default will restart training from scratch each time you restart the VM.
Refer to this for instructions to reuse a model:
https://github.com/crr0004/deepracer/wiki/Retraining-a-Model
Basically tells you to remove comments from the pretrained
sed -i 's/#"pretrain/"pretrain/g' ~/deepracer/rl_coach/rl_deepracer_coach_robomaker.py
Then copy the last *.chk* and checkpoint file to bucket/rl-deepracer-pretrained/model
from bucket/rl-deepracer-sagemaker/model
And check out https://github.com/crr0004/deepracer/wiki/Uploading-to-Leaderboard for details on how
to put your model into AWS for racing.
If you find you can't get resources on your region, try opening a cloudshell from the GCP console
and run this to see where you can get the K80.
Or swap to a better GPU (does cost more but speeds up sagemaker)
Example:
deepracer_drunkenmonkey@cloudshell:~ (infinite-matter-253420)$ gcloud beta compute accelerator-types list | grep k80
nvidia-tesla-k80 europe-west1-d NVIDIA Tesla K80
...
nvidia-tesla-k80 us-central1-c NVIDIA Tesla K80
You can move your instance to another zone following theses instructions:
https://googlecloud.tips/tips/004-moving-instances-between-zones-in-one-command/
I did another gist to show moving zones a bit better than the instructions above:
https://gist.github.com/fmacrae/623650d7840c70474515e508b9022185
If you want to log into the docker instances list the container ids:
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
463640ecb1e5 crr0004/sagemaker-rl-tensorflow:nvidia "/bin/bash -c 'start…" 6 hours ago Up 6 hours 5800/tcp, 6006/tcp, 6379/tcp tmpw4nwk440_algo-1-3h6l9_1
0e19ac3879fb crr0004/deepracer_robomaker:console "/bin/bash -c './run…" 6 hours ago Up 6 hours 0.0.0.0:8081->5900/tcp dr
The robomaker one is the one you probably want:
docker exec -it 0e19ac3879fb /bin/bash
Logs seem to be stored here:
root@0e19ac3879fb:/app/robomaker-deepracer/simulation_ws/log
and
/root/.ros/log
The logs you need for log analysis of the actual racing is got by this simle command:
docker logs 0e19ac3879fb > ~/aws-deepracer-workshops/log-analysis/logs/my-deepracer-sim-logs.log
Thanks for that one Tomasz Ptak
If you want to delete your older pb files while training this will help:
cd ~/deepracer/data/bucket/rl-deepracer-sagemaker/model
touch model_metadata.json
touch checkpoint
touch *.ckpt*
find . -mmin +59 -type f -exec rm -fv {} \;
@fmacrae

This comment has been minimized.

Copy link
Owner Author

fmacrae commented Dec 9, 2019

Broken again. Going to have to merge this with the main repo.

@fmacrae

This comment has been minimized.

Copy link
Owner Author

fmacrae commented Dec 18, 2019

Working again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.