fmacrae/Deep Racer on GCP

## Deep Racer on GCP
Follow instructions from older version of https://course.fast.ai/start_gcp.html

Step 1: Creating your account
Cloud computing allows users access to virtual CPU or GPU resources on an hourly rate, depending on the hardware configuration. Find more information in the Google Cloud Platform documentation. In case you don’t have a GCP account yet, you can create one here, which comes with $300 worth of usage credits for free.

Potential roadblock: Even though GCP provides a $300 initial credit, you must enable billing to use it. You can put a credit card or a bank account but the latter will take several days for the activation.

The project on which you are going to run the image needs to be linked with your billing account. For this navigate to the billing dashboard, click the ‘…’ menu and choose ‘change billing account’.

Step 2: Install Google CLI
To create then be able to connect to your instance, you’ll need to install Google Cloud’s command line interface (CLI) software from Google. For Windows user, we recommend that you use the Ubuntu terminal and follow the same instructions as Ubuntu users (see the link to learn how to paste into your terminal).

To install on Linux or Windows (in Ubuntu terminal), follow these four steps:

# Create environment variable for correct distribution
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"

# Add the Cloud SDK distribution URI as a package source
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

# Import the Google Cloud Platform public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

# Update the package list and install the Cloud SDK
sudo apt-get update && sudo apt-get install google-cloud-sdk
You can find more details on the installation process here

To install Google CLI on MacOS, in the terminal run

curl https://sdk.cloud.google.com | bash
exec -l $SHELL
In both cases, once the installation is done run this line

gcloud init
You should then be prompted with this message:

To continue, you must log in. Would you like to log in (Y/n)?
Type Y then copy the link and paste it to your browser. Choose the google account you used during step 1, click ‘Allow’ and you will get a confirmation code to copy and paste to your terminal.

Then, if you have more than one project (if already created on your GCP account), you’ll be prompted to choose one:

Pick cloud project to use:
 [1] [my-project-1]
 [2] [my-project-2]
 ...
 Please enter your numeric choice:
Just enter the number next to the project you created on step 1. If you just created your account it will likely have a generated random name for its Project ID. If you select the choice “Create a new project”, you will be reminded you also have to run “gcloud projects create my-project-3”.

In order to set a default region you’ll need to enable the Compute Engine API, the CLI will output a link you can follow to do this.

If you’ve enabled the Compute Engine API you’ll be asked if you want to choose a default region, choose us-west1-b if you don’t have any particular preference, as it will make the command to connect to this server easier.

You can modify this later with gcloud config set compute/zone NAME

Once this is done, you should see this message on your terminal:

Your Google Cloud SDK is configured and ready to use!

* Commands that require authentication will use your.email@gmail.com by default
* Commands will reference project `my-project-1` by default
Run `gcloud help config` to learn how to change individual settings

This gcloud configuration is called [default].


but when you get to step 3 use these instructions:
#Use this instead of the fast AI image -
export IMAGE_FAMILY="tf-latest-gpu"
export ZONE="us-west1-b"
export INSTANCE_NAME="my-deepracer-instance-test"
export INSTANCE_TYPE="n1-highmem-8" # budget: "n1-highmem-4"
gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator="type=nvidia-tesla-k80,count=1" \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=200GB \
        --metadata="install-nvidia-driver=True" \
        --preemptible
#Nip into VPC Network - Firewall Rules and open ports 9000, 8080, 6379, 8081, 5800, 5901
#connect via ssh (after about 5 mins to let it build) then run:

#Use the Canada track till I get the v1.1 enhancements working on GCP.
wget https://raw.githubusercontent.com/fmacrae/AI-Learning/master/GCPDeepracerSetup_Canada.sh
bash GCPDeepracerSetup_Canada.sh


It should install everything then give you the three sets of commands you need to run it
For the first time minio is already running so you can skip that line.
Second set of commands run sagemaker
Open another terminal /ssh connection to run the third set of commands
You can then monitor the gazeebo on port 8081 via VNC

You can use screen or nohup to run these in the background in case you disconnect from the VM during training.

Have a look at the autoShutdown.sh script which is also found here:
https://github.com/fmacrae/AI-Learning/blob/master/autoShutdown.sh
This is useful if you want to get your VM to shut down if you disconnect or training completes.
Other option is to just use shutdown command like this (parameter is number of minutes to train):
sudo shutdown +360


Feed back if you have any issues.  I've tested it a few times and seems to work OK.

Other useful info:

Also note, this by default will restart training from scratch each time you restart the VM.
Refer to this for instructions to reuse a model:
https://github.com/crr0004/deepracer/wiki/Retraining-a-Model
Basically tells you to remove comments from the pretrained
sed -i 's/#"pretrain/"pretrain/g' ~/deepracer/rl_coach/rl_deepracer_coach_robomaker.py
Then copy the last *.chk* and checkpoint file to bucket/rl-deepracer-pretrained/model
from bucket/rl-deepracer-sagemaker/model

And check out https://github.com/crr0004/deepracer/wiki/Uploading-to-Leaderboard for details on how
to put your model into AWS for racing.

If you find you can't get resources on your region, try opening a cloudshell from the GCP console
and run this to see where you can get the K80.
Or swap to a better GPU (does cost more but speeds up sagemaker)
Example:
deepracer_drunkenmonkey@cloudshell:~ (infinite-matter-253420)$ gcloud beta compute accelerator-types list | grep k80
nvidia-tesla-k80       europe-west1-d             NVIDIA Tesla K80
...
nvidia-tesla-k80       us-central1-c              NVIDIA Tesla K80

You can move your instance to another zone following theses instructions:
https://googlecloud.tips/tips/004-moving-instances-between-zones-in-one-command/
I did another gist to show moving zones a bit better than the instructions above:
https://gist.github.com/fmacrae/623650d7840c70474515e508b9022185


If you want to log into the docker instances list the container ids:
docker container ls
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                          NAMES
463640ecb1e5        crr0004/sagemaker-rl-tensorflow:nvidia   "/bin/bash -c 'start…"   6 hours ago         Up 6 hours          5800/tcp, 6006/tcp, 6379/tcp   tmpw4nwk440_algo-1-3h6l9_1
0e19ac3879fb        crr0004/deepracer_robomaker:console      "/bin/bash -c './run…"   6 hours ago         Up 6 hours          0.0.0.0:8081->5900/tcp         dr

The robomaker one is the one you probably want:
docker exec -it 0e19ac3879fb /bin/bash

Logs seem to be stored here:
root@0e19ac3879fb:/app/robomaker-deepracer/simulation_ws/log

and

/root/.ros/log


The logs you need for log analysis of the actual racing is got by this simle command:

docker logs 0e19ac3879fb > ~/aws-deepracer-workshops/log-analysis/logs/my-deepracer-sim-logs.log


Thanks for that one Tomasz Ptak

If you want to delete your older pb files while training this will help:
cd ~/deepracer/data/bucket/rl-deepracer-sagemaker/model
touch model_metadata.json
touch checkpoint
touch *.ckpt*
find . -mmin +59 -type f -exec rm -fv {} \;
	Follow instructions from older version of https://course.fast.ai/start_gcp.html

	Step 1: Creating your account
	Cloud computing allows users access to virtual CPU or GPU resources on an hourly rate, depending on the hardware configuration. Find more information in the Google Cloud Platform documentation. In case you don’t have a GCP account yet, you can create one here, which comes with $300 worth of usage credits for free.

	Potential roadblock: Even though GCP provides a $300 initial credit, you must enable billing to use it. You can put a credit card or a bank account but the latter will take several days for the activation.

	The project on which you are going to run the image needs to be linked with your billing account. For this navigate to the billing dashboard, click the ‘…’ menu and choose ‘change billing account’.

	Step 2: Install Google CLI
	To create then be able to connect to your instance, you’ll need to install Google Cloud’s command line interface (CLI) software from Google. For Windows user, we recommend that you use the Ubuntu terminal and follow the same instructions as Ubuntu users (see the link to learn how to paste into your terminal).

	To install on Linux or Windows (in Ubuntu terminal), follow these four steps:

	# Create environment variable for correct distribution
	export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"

	# Add the Cloud SDK distribution URI as a package source
	echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" \| sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

	# Import the Google Cloud Platform public key
	curl https://packages.cloud.google.com/apt/doc/apt-key.gpg \| sudo apt-key add -

	# Update the package list and install the Cloud SDK
	sudo apt-get update && sudo apt-get install google-cloud-sdk
	You can find more details on the installation process here

	To install Google CLI on MacOS, in the terminal run

	curl https://sdk.cloud.google.com \| bash
	exec -l $SHELL
	In both cases, once the installation is done run this line

	gcloud init
	You should then be prompted with this message:

	To continue, you must log in. Would you like to log in (Y/n)?
	Type Y then copy the link and paste it to your browser. Choose the google account you used during step 1, click ‘Allow’ and you will get a confirmation code to copy and paste to your terminal.

	Then, if you have more than one project (if already created on your GCP account), you’ll be prompted to choose one:

	Pick cloud project to use:
	[1] [my-project-1]
	[2] [my-project-2]
	...
	Please enter your numeric choice:
	Just enter the number next to the project you created on step 1. If you just created your account it will likely have a generated random name for its Project ID. If you select the choice “Create a new project”, you will be reminded you also have to run “gcloud projects create my-project-3”.

	In order to set a default region you’ll need to enable the Compute Engine API, the CLI will output a link you can follow to do this.

	If you’ve enabled the Compute Engine API you’ll be asked if you want to choose a default region, choose us-west1-b if you don’t have any particular preference, as it will make the command to connect to this server easier.

	You can modify this later with gcloud config set compute/zone NAME

	Once this is done, you should see this message on your terminal:

	Your Google Cloud SDK is configured and ready to use!

	* Commands that require authentication will use your.email@gmail.com by default
	* Commands will reference project `my-project-1` by default
	Run `gcloud help config` to learn how to change individual settings

	This gcloud configuration is called [default].


	but when you get to step 3 use these instructions:
	#Use this instead of the fast AI image -
	export IMAGE_FAMILY="tf-latest-gpu"
	export ZONE="us-west1-b"
	export INSTANCE_NAME="my-deepracer-instance-test"
	export INSTANCE_TYPE="n1-highmem-8" # budget: "n1-highmem-4"
	gcloud compute instances create $INSTANCE_NAME \
	--zone=$ZONE \
	--image-family=$IMAGE_FAMILY \
	--image-project=deeplearning-platform-release \
	--maintenance-policy=TERMINATE \
	--accelerator="type=nvidia-tesla-k80,count=1" \
	--machine-type=$INSTANCE_TYPE \
	--boot-disk-size=200GB \
	--metadata="install-nvidia-driver=True" \
	--preemptible
	#Nip into VPC Network - Firewall Rules and open ports 9000, 8080, 6379, 8081, 5800, 5901
	#connect via ssh (after about 5 mins to let it build) then run:

	#Use the Canada track till I get the v1.1 enhancements working on GCP.
	wget https://raw.githubusercontent.com/fmacrae/AI-Learning/master/GCPDeepracerSetup_Canada.sh
	bash GCPDeepracerSetup_Canada.sh



	It should install everything then give you the three sets of commands you need to run it
	For the first time minio is already running so you can skip that line.
	Second set of commands run sagemaker
	Open another terminal /ssh connection to run the third set of commands
	You can then monitor the gazeebo on port 8081 via VNC

	You can use screen or nohup to run these in the background in case you disconnect from the VM during training.

	Have a look at the autoShutdown.sh script which is also found here:
	https://github.com/fmacrae/AI-Learning/blob/master/autoShutdown.sh
	This is useful if you want to get your VM to shut down if you disconnect or training completes.
	Other option is to just use shutdown command like this (parameter is number of minutes to train):
	sudo shutdown +360


	Feed back if you have any issues. I've tested it a few times and seems to work OK.

	Other useful info:

	Also note, this by default will restart training from scratch each time you restart the VM.
	Refer to this for instructions to reuse a model:
	https://github.com/crr0004/deepracer/wiki/Retraining-a-Model
	Basically tells you to remove comments from the pretrained
	sed -i 's/#"pretrain/"pretrain/g' ~/deepracer/rl_coach/rl_deepracer_coach_robomaker.py
	Then copy the last .chk and checkpoint file to bucket/rl-deepracer-pretrained/model
	from bucket/rl-deepracer-sagemaker/model

	And check out https://github.com/crr0004/deepracer/wiki/Uploading-to-Leaderboard for details on how
	to put your model into AWS for racing.

	If you find you can't get resources on your region, try opening a cloudshell from the GCP console
	and run this to see where you can get the K80.
	Or swap to a better GPU (does cost more but speeds up sagemaker)
	Example:
	deepracer_drunkenmonkey@cloudshell:~ (infinite-matter-253420)$ gcloud beta compute accelerator-types list \| grep k80
	nvidia-tesla-k80 europe-west1-d NVIDIA Tesla K80
	...
	nvidia-tesla-k80 us-central1-c NVIDIA Tesla K80

	You can move your instance to another zone following theses instructions:
	https://googlecloud.tips/tips/004-moving-instances-between-zones-in-one-command/
	I did another gist to show moving zones a bit better than the instructions above:
	https://gist.github.com/fmacrae/623650d7840c70474515e508b9022185


	If you want to log into the docker instances list the container ids:
	docker container ls
	CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
	463640ecb1e5 crr0004/sagemaker-rl-tensorflow:nvidia "/bin/bash -c 'start…" 6 hours ago Up 6 hours 5800/tcp, 6006/tcp, 6379/tcp tmpw4nwk440_algo-1-3h6l9_1
	0e19ac3879fb crr0004/deepracer_robomaker:console "/bin/bash -c './run…" 6 hours ago Up 6 hours 0.0.0.0:8081->5900/tcp dr

	The robomaker one is the one you probably want:
	docker exec -it 0e19ac3879fb /bin/bash

	Logs seem to be stored here:
	root@0e19ac3879fb:/app/robomaker-deepracer/simulation_ws/log

	and

	/root/.ros/log


	The logs you need for log analysis of the actual racing is got by this simle command:

	docker logs 0e19ac3879fb > ~/aws-deepracer-workshops/log-analysis/logs/my-deepracer-sim-logs.log


	Thanks for that one Tomasz Ptak

	If you want to delete your older pb files while training this will help:
	cd ~/deepracer/data/bucket/rl-deepracer-sagemaker/model
	touch model_metadata.json
	touch checkpoint
	touch .ckpt
	find . -mmin +59 -type f -exec rm -fv {} \;