cgarbin/GoogleCloudMachineLearningEnvironment.md

## GoogleCloudMachineLearningEnvironment.md

      
    Raw
  

              GoogleCloudMachineLearningEnvironment.md
            
          
    Instructions to create a Google Cloud account and set up an environment for machine learning on it.
NOTE: the command line examples assume a Linux environment.
Setting up the cloud environment

Google Cloud environment

At this time (November 2018) Google gives a U$300 credit when opening an account. It may ask for a credit
card number when creating the account, but it won't be billed until the credits are used up.

Create a Google Cloud account at https://cloud.google.com/
Install the Cloud SDK - it will be needed later, to transfer files from/to the virtual machines and the local computer
Create a project (all resources you create in Google Cloud, e.g. a virtual machine, must be associated with a project)

Creating a virtual machine suitable for machine learning

Creating an instance (a virtual machine)

GPUs can speed machine learning tasks (especially training tasks) by an order of magnitude.
We will create a machine with GPU support in Google Cloud. This section has the basic instructions for creating a
machine. The next section shows how to add GPU support for the first time.

Choose Compute Engine on the left side
Choose Create Instance at the top
Click on Change in the Boot disk section
Pick an image prepare for machine learning, such as one of the Intel optimized Deep Learning...

Do not create the machine yet. Continue on to the next section to add GPU support.
Adding GPU support for the first time

Adding a GPU for the first time will require a quota update.
Updating the quota is done as part of creating the virtual machine. To continue the creation of the virtual machine
and trigger the quota update request:

Click on Customize in the Machine type section
Select a number of GPUs in Number of GPUs (suggest to start with one)

At this point the machine will be created, but you will get a quota error (the machine will fail to start).
A GPU quota has to the added to your account to start the machine.
To request the quota update you will have to first upgrade your account. Google Cloud shows an option to upgrade the
account at this point. Go through those steps to enable GPU quotas. It will ask for a credit card, but it won't start
billing it yet.
After upgrading the account:

Selet Quotas on the left side
Select Global in the Location filter at the top of the list (unselect all, select Global)
Find GPUs (all regions) and select it
Click on Edit quotas at the top of the page
Go through the steps to request the change

GPU access is somewhat restricted in Google (to prevent abuses, e.g. cryptocurreny mining, I guess).
It will ask for a reason to change that quota. Explain that you are using for machine learning research.
If you are attending classes in college related to that, add it to the explanation as well.
Within an hour or two you will get an email from someone in Google support. The email is a generic "we are
working on the request", but at the bottom it also invites you to explain in more details why you asked for
the quota change. I replied to it, explaining again that I'm using to experiment with machine learning for
college classes. Within an hour the quota was approved. I can't say replying to the email made a difference,
but it's a small gesture that can (potentially) speed up the request. It's worth doing it.
Once the quota has been updated you will be able to start the machine.
Making the most of the free credit

Stopping machines when not in use

Machines are billed while running. Once you are done with the experiments explicitly stop the machine to stop billing
Setting up a budget notification

Eventually the U$300 credit will run out.
To avoid the nasty surprise of a large bill, setup a budget and turn on notification for it.

Click on the "hamburger menu" (three vertical bars) on the top-left corner to bring up the main menu
Click on Billing
Click on Budgets & alerts
Configure the alerts as you wish

Developing with Google Cloud

Google offers SSH (command line) access to the machines. There is no way to run an IDE directly on the
machine that I know of (there is a code editor,
but it's not the same as running an IDE directly on the machine you are using for the experiments).
This is the workflow I use:

Create a project in GitHub (see below)
Clone that project on the virtual machines: git clone https://<url to the project>
Makes changes to the project on my computer, with a smaller/faster test configuration (e.g. train with fewer epochs, fewer layers, fewer nodes per layer, etc.)
Review results from those changes
If they are satisfactory, push the changes to GitHub
Pull the changes into the machine: git pull
Adjust the code, e.g. increase the number of epochs, number of layers, etc.
Run the experiment on the machine

Using GitHub to save code changes

All contents from a virtual machine disappear once you delete it.
If you make code changes on the virtual machine,
create a repository in GitHub to save those changes.
Then save changes frequently to that repository:

Save the local changes: git add ., followed by git commit -m "<descripton>"
Move changes to GitHub:  git push
Get the changes on your computer: git pull

Running tests in background, while disconnected

With long-running tests it's important to make sure that the tests will keep running even when the SSH connection
to the Google Cloud machine is lost.
nohup can be used to ensure that. It detaches the command from the terminal. When the terminal is closed, the program keeps running.
nohup <yourscript> &

Output from the script is stored in the file nohup.out. If you have a very verbose script you may run the risk
of running out of disk space. Run the script with a non-verbose option, if it has one, or delete nohup.out every
so ofter.
If you are not interested in the output at all, you can also run it as:
nohup command >/dev/null 2>&1

If you want to follow what the script is doing, let it write to nohup.out, then inspect it with more nohup.out
every so often, or follow it with tail -f nohup.out.
Especifically for Keras, it's a good idea to change to verbose mode 2. It will display progress for each epoch, but not progress withing each epoch. This reduces output by an order of magnitude or more:
model.fit(..., verbose=2)

Analyzing results offline

Analyzing experiment results on the machine itself is not very productive because:

It costs money: the machine will be running, and thus billed
It's incovenient: it doesn't have a graphic interface, so can't show graphs from e.g. using Python's matplotlib

This section explains how to analyze results offline, on your own computer.
The workflow for offline analysis:

Save results from the experiments in files
Download the files to your computer
Analyze the results on your computer

Saving results from the experiments in files

This section assumes you are using Keras. If you are not using Keras and knows how to do the equivalent of what is documented here, I'd appreciate a pull request to enhance this section.
When training and testing a model, there a few pieces of information we need to save for offline analysis.
Training history

The training history shows the how training and validation loss/accuracy behave during training. This shows if the model is underfitting or overfitting.
To collect the history during trainig we need to pass validation data to model.fit() and save the History object it returns:
history = model.fit(train_images, train_labels, epochs=p.epochs,
          batch_size=p.batch_size,
          validation_data=(test_images, test_labels), # <- This is needed for history
          verbose=verbose)

Now we can save it for offline analysis, on our local machine:
with open('history.json', 'w') as f:
    json.dump(history.history, f)

Once the file is copied to our local machine (see section below on how to use google compute scp ... for that), we can load the file and use it:
with open('history.json') as f:
    history = json.load(f)

The model

Once created and trained, a model can be saved with save():
model.save('model.h5')

The model is loaded with load():
model = load_model('model.h5')

If the model has been trained, it will include the weights calculated during training. We can now use that model to predict output with model.evaluate().
Tensorboard data

Tensorboard is Tensorflow's analysis tool. It can display a wealth of detailed data for the training process.
Keras can save data to read in Tensorboard in real-time or offline.
Data is saved using a Keras callback. See details in Keras callback documentation.
Note that it generates a lot of data. You may want to monitor the directory on the remote machine check during training.
Downloading the files to your computer

Google Cloud SDK has an scp command to transfer files to and from virtual machines. Install the Cloud SDK if you haven't done so yet.
Once the Cloud SDK is installed, use the scp command to tranfers a file from the virtual machine to your computer.
The syntax to transfer one file is:
gcloud compute scp <cloud user>@<instance name>:<remote file name> <local file name>

Example:
gcloud compute scp john_doe@myinstance:~/test.txt .

To make the process faster and simpler, compress files before transferring them.
On the virtual machine:

tar cvzf alltxt.tar.gz *.txt (adjust file pattern as needed)

On your local computer:

gcloud compute scp john_doe@myinstance:~/alltxt.tar.gz .
tar xvzf alltxt.tar.gz