Instructions to create a Google Cloud account and set up an environment for machine learning on it.
NOTE: the command line examples assume a Linux environment.
At this time (November 2018) Google gives a U$300 credit when opening an account. It may ask for a credit card number when creating the account, but it won't be billed until the credits are used up.
- Create a Google Cloud account at https://cloud.google.com/
- Install the Cloud SDK - it will be needed later, to transfer files from/to the virtual machines and the local computer
- Create a project (all resources you create in Google Cloud, e.g. a virtual machine, must be associated with a project)
GPUs can speed machine learning tasks (especially training tasks) by an order of magnitude.
We will create a machine with GPU support in Google Cloud. This section has the basic instructions for creating a machine. The next section shows how to add GPU support for the first time.
- Choose
Compute Engine
on the left side - Choose
Create Instance
at the top - Click on
Change
in theBoot disk
section - Pick an image prepare for machine learning, such as one of the
Intel optimized Deep Learning...
Do not create the machine yet. Continue on to the next section to add GPU support.
Adding a GPU for the first time will require a quota update.
Updating the quota is done as part of creating the virtual machine. To continue the creation of the virtual machine and trigger the quota update request:
- Click on
Customize
in theMachine type
section - Select a number of GPUs in
Number of GPUs
(suggest to start with one)
At this point the machine will be created, but you will get a quota error (the machine will fail to start).
A GPU quota has to the added to your account to start the machine.
To request the quota update you will have to first upgrade your account. Google Cloud shows an option to upgrade the account at this point. Go through those steps to enable GPU quotas. It will ask for a credit card, but it won't start billing it yet.
After upgrading the account:
- Selet
Quotas
on the left side - Select
Global
in theLocation
filter at the top of the list (unselect all, selectGlobal
) - Find
GPUs (all regions)
and select it - Click on
Edit quotas
at the top of the page - Go through the steps to request the change
GPU access is somewhat restricted in Google (to prevent abuses, e.g. cryptocurreny mining, I guess).
It will ask for a reason to change that quota. Explain that you are using for machine learning research. If you are attending classes in college related to that, add it to the explanation as well.
Within an hour or two you will get an email from someone in Google support. The email is a generic "we are working on the request", but at the bottom it also invites you to explain in more details why you asked for the quota change. I replied to it, explaining again that I'm using to experiment with machine learning for college classes. Within an hour the quota was approved. I can't say replying to the email made a difference, but it's a small gesture that can (potentially) speed up the request. It's worth doing it.
Once the quota has been updated you will be able to start the machine.
Machines are billed while running. Once you are done with the experiments explicitly stop the machine to stop billing
Eventually the U$300 credit will run out.
To avoid the nasty surprise of a large bill, setup a budget and turn on notification for it.
- Click on the "hamburger menu" (three vertical bars) on the top-left corner to bring up the main menu
- Click on
Billing
- Click on
Budgets & alerts
- Configure the alerts as you wish
Google offers SSH (command line) access to the machines. There is no way to run an IDE directly on the machine that I know of (there is a code editor, but it's not the same as running an IDE directly on the machine you are using for the experiments).
This is the workflow I use:
- Create a project in GitHub (see below)
- Clone that project on the virtual machines:
git clone https://<url to the project>
- Makes changes to the project on my computer, with a smaller/faster test configuration (e.g. train with fewer epochs, fewer layers, fewer nodes per layer, etc.)
- Review results from those changes
- If they are satisfactory, push the changes to GitHub
- Pull the changes into the machine:
git pull
- Adjust the code, e.g. increase the number of epochs, number of layers, etc.
- Run the experiment on the machine
All contents from a virtual machine disappear once you delete it.
If you make code changes on the virtual machine, create a repository in GitHub to save those changes.
Then save changes frequently to that repository:
- Save the local changes:
git add .
, followed bygit commit -m "<descripton>"
- Move changes to GitHub:
git push
- Get the changes on your computer:
git pull
With long-running tests it's important to make sure that the tests will keep running even when the SSH connection to the Google Cloud machine is lost.
nohup
can be used to ensure that. It detaches the command from the terminal. When the terminal is closed, the program keeps running.
nohup <yourscript> &
Output from the script is stored in the file nohup.out
. If you have a very verbose script you may run the risk
of running out of disk space. Run the script with a non-verbose option, if it has one, or delete nohup.out
every
so ofter.
If you are not interested in the output at all, you can also run it as:
nohup command >/dev/null 2>&1
If you want to follow what the script is doing, let it write to nohup.out
, then inspect it with more nohup.out
every so often, or follow it with tail -f nohup.out
.
Especifically for Keras, it's a good idea to change to verbose mode 2. It will display progress for each epoch, but not progress withing each epoch. This reduces output by an order of magnitude or more:
model.fit(..., verbose=2)
Analyzing experiment results on the machine itself is not very productive because:
- It costs money: the machine will be running, and thus billed
- It's incovenient: it doesn't have a graphic interface, so can't show graphs from e.g. using Python's
matplotlib
This section explains how to analyze results offline, on your own computer.
The workflow for offline analysis:
- Save results from the experiments in files
- Download the files to your computer
- Analyze the results on your computer
This section assumes you are using Keras. If you are not using Keras and knows how to do the equivalent of what is documented here, I'd appreciate a pull request to enhance this section.
When training and testing a model, there a few pieces of information we need to save for offline analysis.
The training history shows the how training and validation loss/accuracy behave during training. This shows if the model is underfitting or overfitting.
To collect the history during trainig we need to pass validation data to model.fit()
and save the History
object it returns:
history = model.fit(train_images, train_labels, epochs=p.epochs,
batch_size=p.batch_size,
validation_data=(test_images, test_labels), # <- This is needed for history
verbose=verbose)
Now we can save it for offline analysis, on our local machine:
with open('history.json', 'w') as f:
json.dump(history.history, f)
Once the file is copied to our local machine (see section below on how to use google compute scp ...
for that), we can load the file and use it:
with open('history.json') as f:
history = json.load(f)
Once created and trained, a model can be saved with save()
:
model.save('model.h5')
The model is loaded with load()
:
model = load_model('model.h5')
If the model has been trained, it will include the weights calculated during training. We can now use that model to predict output with model.evaluate()
.
Tensorboard is Tensorflow's analysis tool. It can display a wealth of detailed data for the training process.
Keras can save data to read in Tensorboard in real-time or offline.
Data is saved using a Keras callback. See details in Keras callback documentation.
Note that it generates a lot of data. You may want to monitor the directory on the remote machine check during training.
Google Cloud SDK has an scp
command to transfer files to and from virtual machines. Install the Cloud SDK if you haven't done so yet.
Once the Cloud SDK is installed, use the scp
command to tranfers a file from the virtual machine to your computer.
The syntax to transfer one file is:
gcloud compute scp <cloud user>@<instance name>:<remote file name> <local file name>
Example:
gcloud compute scp john_doe@myinstance:~/test.txt .
To make the process faster and simpler, compress files before transferring them.
On the virtual machine:
tar cvzf alltxt.tar.gz *.txt
(adjust file pattern as needed)
On your local computer:
gcloud compute scp john_doe@myinstance:~/alltxt.tar.gz .
tar xvzf alltxt.tar.gz