- Write your training script so that it can be killed, and then automatically resumes from the beginning of the current epoch when restarted. (See
train-example.py
for an example training loop incorporating these recommendations.)- Save checkpoints at every epoch... (See
utils.py
forsave_training_state
helper function.)- model(s)
- optimizer(s)
- any hyperparameter schedules — I usually write the epoch number to a JSON file and compute the hyperparameter schedules as a function of the epoch number.
- At the beginning of training, check for any saved training checkpoints and load all relevent info (models, optimizers, hyperparameter schedules). (See
utils.py
forload_training_state
helper function.) - Consider using smaller epochs by limiting the number of batches pulled from your (shuffled) dataloader during each epoch.
- This will cause your training to be checkpointed more often, so in the case that your spot instance is shut down, you will limit the amount of lost training.
- Consider only computing validation metrics every 3-5 epochs when using shorter epochs.
- Test by running your script, killing it in the middle of the 2nd epoch, and restarting your script with the same command. Verify that it loads the model checkpoint and continues at the beginning of the second epoch.
- Save checkpoints at every epoch... (See
- Make sure your training script will not continue training if re-started after training has completed in the past.
- Your instance will shutdown when your training script completes (assuming you follow the rest of these instructions), and then AWS will attempt to restart or relaunch your instance in order to maintain your spot "fleet". When this happens, you would like the training script to recognize that training is completed and quickly return, so that your instance will shutdown again. Then AWS will give-up on maintaining your fleet when it fails a few times to startup (i.e. when it shuts down right away after starting up).
- One idea is to make your training schedule based on the epoch number (as mentioned above), and then make sure that your training script will abort if started on the final epoch. If you follow the other examples here, it will already work like that, but if you do it a different way (or use early stopping, for example), you will need to set this up another way.
- Another idea would be to write a final checkpoint
checkpoint_final.pt
after your training completes, or even just write an empty filetraining_completed.txt
, and check for these files before launching training.
- Create a machine image (AMI) with your repo and training data on it OR put your code on github and data in S3 so that you can download them on startup.
- Make sure
git status
doesn't show any changes which need to be committed — verify that you can checkout a different branch or commit without git complaining. - Make sure that any training checkpoints are removed — verify that if you were to restart your training script now, it would start at epoch 0.
- Make sure that git credentials are available on the system if you need to push to or pull from a remote git repository at any point. Use
git config --global credential.helper store
or inject your username and developer access token in the remote url, e.g.https://username:abc123@github.com/orgname/reponame.git
. - Make sure that AWS credentials are available on the system (at
~/.aws/credentials
) if you need to push to or pull from S3.
- Make sure
- Open the AWS EC2 service console.
- Go to
Spot Requests
in the left sidebar menu, then clickRequest Spot Instances
in the header. - Select
Load balancing workloads
- Choose your AMI (click
Search for AMI
). - Ignore the
Minimum Compute Unit
settings. - Choose your key pair.
- Expand the
Additional Configurations
section.- Leave the
Delete
column checked next to your volume. - Check the box for
EBS-optimized
instances if read/write performance is important to your application. - Add security groups for SSH, tensorboard, etc.
- Add tags such as project=lulc, owner=collin
- Find the
User Data
field at the bottom, which is where you enter the startup script for your instance.- An example startup script is provided in this gist (See
user-data-template.txt
anduser-data-example.sh
). - Start by pasting the contents of
user-data-template.txt
as plain text. - Then scroll to the bottom and replace the contents of the
userdata.txt
file starting with#!/bin/bash
and up to but not including--//
with your customized bash script, based on the example provided inuser-data-example.sh
. - Make sure the first line in your script is
#!/bin/bash
(or alternative). - Make sure the last line in your script is
sudo shutdown
. This ensures that the instance will shut down after your training is completed. When AWS attempts to restart or relaunch your instance, the training should hopefu - Make sure the last line in the textarea after your script is
--//
. - This clunky format will ensure that your script is run every time the instance is restarted. If you directly paste in the contents of
user-data-example.sh
without the template fromuser-data-template.txt
, then your User Data script will only run when the instance is launched (for the first time).
- An example startup script is provided in this gist (See
- Leave the
- Target capacity
- How many spot instances do you want to keep running simultaneously? Typically the answer is
1
, unless you're trying to do distributed training across multiple instances. - Check the box for
Maintain target capacity
. - Change interruption behavior to
Stop
for easier debugging via access to boot logs, which will contain any console logs during execution of your User Data script. Also useful in case there were any hiccups in persisting your checkpoints after training.
- How many spot instances do you want to keep running simultaneously? Typically the answer is
- Fleet request settings
- Uncheck
Apply Recommendations
- Remove all recommended instance types
- Click
Select Instance Types
and add your preferred instance type(s). - For
Fleet allocation strategy
, chooseLowest price
.
- Uncheck
- Click 🎉
- What's next?
- Monitor your spot requests in the
Spot Requests
section in the left sidebar of the EC2 console. Expand the spot request to see launched instances, or select it to see logs, status, and other details in the bottom pane. - Any instance(s) launched by your spot requests also appear in the
Instances
section in the EC2 console, but with limited control. - Check logs and/or your tensorboard server to make sure training starts successfully.
- Monitor your spot requests in the
- User Data documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
- How can I execute user data to automatically run with every restart of my Amazon EC2 instance?
- System Logs from User Data
- First access your system logs:
- From the EC2 console under
Instances
, right-click the instance, expandInstance Settings
, chooseSystem Logs
. - SSH into the instance and find logs in
/var/log/cloud-init-output.log
. - From the command-line: `aws ec2 get-console-output --region us-east-1 --instance-id i-123 | python -c 'import sys, json; print json.load(sys.stdin)["Output"]'
- From the EC2 console under
- Search them for the text
login:
which will take you to the console output from your User Data script.
- First access your system logs:
- Training not running?
- Check system logs for errors in the startup script.
- Instances are terminated?
- Check spot request details for clues about why they were terminated.
- A bug in your training script would cause the rest of your User Data script to be executed, which usually ends in a
sudo shutdown
command. So if it looks like the instance starts up then quickly shuts down again, this could be the problem. Check the system logs for error messages from your training script.