Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Tutorial shows how to make backups to Google Cloud Storage.

Google Cloud Storage backup tutorial

Introduction

This tutorial shows how to make backups to Google Cloud Storage. The backups are:

  • automatic
  • stored off site
  • incremental
  • versioned
  • encrypted

The tutorial and backup script are intended for single-user machines. The backup system uses Cloud Platform Console, gsutil tool, command line, bash script, cron, JSON, and Python regular expressions. You don't have to be proficient in any of that. The tutorial will guide you through the setup and backup process step-by-step. Line commands and script are provided.

I have been using this system for my personal backups since July 2015 and really like it. The tutorial and script were tested on Linux Fedora 24. It wouldn't be terribly hard to port it to Windows and Mac OS.

Why Google Cloud Storage (GCS)

Compared to backup to USB drive, the main advantage of backups to the cloud is automatic off-site storage.

Google Cloud Storage is cost effective:     Google Cloud Storage logo

Alternatives

Amazon's AWS S3 and Google Cloud Storage with gsutil are similar. For backups, the main advantages of Google Cloud Storage are:

  • Google Nearline has a 3 second delay, compared to AWS Glacier 5 hours delay
  • Google rsync excludes patterns recursively, so you can exclude some files types from the backup

There is GUI backup software that can backup your files to Google Cloud Storage. CloudBerry is a Backup software:

  • Backs up to Google Cloud Storage, Google Drive, or AWS.
  • CloudBerry Desktop Backup Free edition is designed for personal use only.
  • Windows, Linux, Mac OS.

I have not tried Cloudberry, but the reviews are generally positive.

Encryption

Google Cloud Storage features:

  • server side encryption
  • client side encryption

The main advantage of server side encryption is that it is simple to set up. Client side encryption increases complexity and is beyond the scope of this tutorial.

More about Google Cloud Storage encryption is at:

Create a test_source directory

The tutorial's example backup_script will backup up a small test_source directory. Create and populate the test_source directory:

$ mkdir test_source 
$ cd test_source
$ mkdir Documents
$ touch .bashrc Documents/f.txt Documents/f.exe Documents/f.txt

Display the test_source directory:

$ tree -a --dirsfirst
.
├── Documents
│   ├── f.exe
│   └── f.txt
└── .bashrc 

By the end of the tutorial the test_source directory will look like this:

.
├── Documents
│   ├── f.exe
│   └── f.txt
├── .bashrc
├── backup.log
├── backup_script
└── lifecycle_config.json

lifecycle_config.json and backup_script files are at the end of this page.

Set up Google Cloud project and install gsutil

Get a google account if you don't already have one. https://accounts.google.com/signUp

Create a Cloud Platform Console project on https://console.cloud.google.com/project Make a note of your project ID. I'll use "backup-proj-140016" for the remainder of this tutorial.

Enable billing for your project on https://support.google.com/cloud/answer/6293499#enable-billing

gsutil is a Python application that accesses Google Cloud Storage from the command line. Follow the "Install gsutil" instructions on https://cloud.google.com/storage/docs/gsutil_install

If it is not already installed, install Python 2.7 https://www.python.org/downloads/release/python-2711/

(The preceding instructions were modified from https://cloud.google.com/storage/docs/quickstart-gsutil)

Create a bucket

A bucket is the basic container used to store files.

The following gsutil command makes a bucket of the "Nearline" storage class. The bucket is located in a US data center and the bucket's name is "wolfv-backup-tutorial":

$ gsutil mb -c nearline -l US -p backup-proj-140016 gs://wolfv-backup-tutorial

Of course you will have to modify the parameters for your situation. All bucket names share the global name space across all of Google Cloud. You will need to choose your own, unique, bucket name.

List the files in the bucket:

$ gsutil ls -l gs://wolfv-backup-tutorial

gsutil commands are documented at https://cloud.google.com/storage/docs/gsutil/commands/mb.

Alternatively, the bucket can be created and viewed from the Cloud Platform Console Browser.

gsutil versioning

Object versioning allows you to restore objects if you accidentally delete them. It is turned on or off at the bucket level. If versioning is turned on, uploading to an existing object creates a new version. If versioning is turned off, uploading to an existing object overwrites the current version. Enable object versioning in your bucket:

$ gsutil versioning set on gs://wolfv-backup-tutorial

gsutil versioning command is documented at https://cloud.google.com/storage/docs/object-versioning.

Object Lifecycle Management

Time to Live (TTL) automatically deletes older versions of objects. TTL is turned on or off at the bucket level. In a bucket with versioning enabled, deleting a live object creates an archived object; deleting an archived object deletes the object permanently.

The following conditions are supported:

  • Age (days)
  • CreatedBefore (date)
  • NumberOfNewerVersions
  • IsLive (false if object is archived)

Typical backup lifecyles are in months or years. The following example uses very short lifecycles and small numNewerVersions so you can see the effects of conditions in a day. It will delete a archive version of a file if there are 2 newer versions, or if the version is more than a day old. There may be a lag between when the conditions are satisfied and when the object is deleted.

Example test_source/lifecycle_config.json:

{
  "rule":
  [
   {
    "action": { "type": "Delete" },
    "condition": { "isLive": false, "numNewerVersions": 2 }
   },
   {
    "action": { "type": "Delete" },
    "condition": { "isLive": false, "age": 1 }
   }
  ]
}

Copy the 2_lifecycle_config.json to your test_source directory and rename it lifecycle_config.json.

Verify that versioning is turned on:

$ gsutil versioning get gs://wolfv-backup-tutorial

Enable Lifecycle Management:

$ gsutil lifecycle set ~/test_source/lifecycle_config.json gs://wolfv-backup-tutorial

Verify the Lifecycle Configuration:

$ gsutil lifecycle get gs://wolfv-backup-tutorial

If you ever want to disable a bucket's lifecycle management, use an empty lifecycle configuration file:

$ gsutil lifecycle set ~/test_source/lifecycle_config_empty.json gs://wolfv-backup-tutorial

where test_source/lifecycle_config_empty.json file is:

{}

gsutil lifecycle command is documented at https://cloud.google.com/storage/docs/lifecycle.

Create backup script

Copy the 4_backup_script file to your test_source directory and rename it backup_script.

Give yourself permission to execute the script:

$ chmod u+x backup_script

The backup_script takes two arguments:

SOURCE - will be the test_source directory.
DESTINATION - will be the gs://wolfv-backup-tutorial bucket.

The backup_script uses gsutil rsync to synchronize DESTINATION to SOURCE.

The gsutil rsync command is documented in https://cloud.google.com/storage/docs/gsutil/commands/rsync.

Backup named files

Display the test_source directory:

$ tree -a --dirsfirst
.
├── Documents
│   ├── f.exe
│   └── f.txt
├── .bashrc 
├── backup_script
└── lifecycle_config.json

We want to backup these files:

.bashrc 
backup_script
lifecycle_config.json

Other files that might be added to the test_source directory will not be backed up. Logs are not backed up because logs grow.

gsutil rsync does not currently have an --include option. A workaround is to exclude the negation of file names. The backup_script defines the "negation of file names" in a Python regular expression named EXCLUDE_HOME_FILES:

EXCLUDE_HOME_FILES='^(?!(backup_script|\.bashrc|lifecycle_config\.json|lifecycle_config_empty\.json)$).*'

where

"^" matches only at the beginning of the string.
"?!" is negative lookahead assertion described in https://docs.python.org/2.7/howto/regex.html#lookahead-assertions
the inner parenthesis contains a list of files to backup.
"|" delineates file names.
"\" back slash escapes the "." in file names.
"$" matches only at the end of the string.
".*" at the end of the regex is needed to match strings rejected by "?!".

The backup_script excludes EXCLUDE_HOME_FILES via the rsync -x exclude option:

########################## directories to backup #########################
# Backup files in ~/ home directory
$GSUTIL rsync $DRYRUN -c -C -x $EXCLUDE_HOME_FILES $SOURCE/ $DESTINATION/

The gsutil rsync -x exclude option is documented in https://cloud.google.com/storage/docs/gsutil/commands/rsync.

Dry run backup_script

After modifying a backup script, its a good idea to test a dry-run. rsync -n option causes rsync to just output what would be copied or deleted without actually doing any copying/deleting.

Run the backup_script as a dry run:

$ ~/test_source/backup_script ~/test_source gs://wolfv-backup-tutorial -n

The gsutil rsync -n option is documented in https://cloud.google.com/storage/docs/gsutil/commands/rsync.

Live run backup_script

If the dryrun was as expected, run the backup_script again without the "-n" option:

$ ~/test_source/backup_script ~/test_source gs://wolfv-backup-tutorial

The rsync command copies files from SOURCE to DESTINATION. Verify the contents of the wolfv-backup-tutorial bucket:

gsutil ls -lr gs://wolfv-backup-tutorial

Alternatively, use the Cloud Platform Console Browser to view the contents of your bucket.

View the backup.log:

$ more backup.log

gsutil will suggest the -m multi-threaded option:

==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m rsync ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

The multi-threaded option uploads faster on high bandwidth, but causes slower uploads on residential bandwidth.

Schedule a cron job

Example cron job is in backup_script comments. Schedule your backup for times when your computer is usually on. If the computer is off when the cron job is scheduled, then that cron job will be missed.

For testing cron jobs, you can schedule cron jobs one or two minutes into the future. This cron job will run every day at 23:19:

$ crontab -e
19 23 * * * ~/test_source/backup_script ~/test_source/ gs://wolfv-backup-tutorial

Linux cron job information is on http://www.adminschoice.com/crontab-quick-reference

Restore a previous version of a file

You can brows backed up files in Cloud Platform Console Browser; but it only displays the most recent file versions.

You can list the previous versions of a file from the terminal:

$ gsutil ls -al gs://wolfv-backup-tutorial/Documents/f.txt
         0  2016-08-11T19:02:48Z  gs://wolfv-backup-tutorial/Documents/f.txt#1470942169019000  metageneration=1
         0  2016-08-11T19:05:25Z  gs://wolfv-backup-tutorial/Documents/f.txt#1470942325689000  metageneration=1
         3  2016-08-12T01:58:59Z  gs://wolfv-backup-tutorial/Documents/f.txt#1470967139801000  metageneration=1
TOTAL: 3 objects, 3 bytes (3 B)

Copy the version you want, to a new file name:

$ gsutil cp gs://wolfv-backup-tutorial/Documents/f.txt#1470967139801000 ~/test_source/Documents/f_2016-08-12T01.58.59.txt

If you omit the destination file name, cp will overwrite the newer file of the same name.

The gsutil cp command is documented at https://cloud.google.com/storage/docs/gsutil/commands/cp

Restore all your data

If you want to copy an entire directory tree, use the gsutil cp -r option:

$ gsutil cp -r gs://wolfv-backup-tutorial/ ~/
$ cd wolfv-backup-tutorial
$ tree -a --dirsfirst
.
├── Documents
│   └── f.txt
├── .bashrc 
├── backup_script
└── lifecycle_config.json

The gsutil cp command is documented at https://cloud.google.com/storage/docs/gsutil/commands/cp

Deploy your backup system on real data

Now that you see how the backup system works, repeat the tutorial steps "Create a bucket" through "Schedule a cron job" using backup parameters for your real data.

5_backup_script_real is the backup script my cronjob calls twice a day.

Monitor your backups

Your data should automatically backup at the scheduled cron job times. Its a good idea to periodically confirm that your automatic system is actually saving data. Check the date of a recently changed file in the Cloud Platform Console Browser.

You can also download logs via the command line and then view the logs on a spreadsheet. That's too much work for me, but can be useful for trouble shooting.

(backup.log only shows when the backup_script ran, it does not say if the data was successfully saved to Google Cloud)

Trouble shooting

Check the backup.log:

$ tail ~/backup.log

Check the system logs for time of backup e.g. use journalctl to view Systemd logs.

Close crontab -e editor. Just saving the contents of vi or nano is not enough; you must quit the editor.

See what happens when backup_script is run from the command line instead of from cron job.

Clean up

Avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial. From the terminal, delete the bucket and its contents recursively:

gsutil rm -r gs://wolfv-backup-tutorial

If successful, the command returns a message similar to:

Removing gs://wolfv-backup-tutorial/just-a-folder/cloud-storage.logo.png#1456530077282000...
Removing gs://wolfv-backup-tutorial/...

Delete or comment the test_source cron job:

$ crontab -e
#19 23 * * * ~/test_source/backup_script ~/test_source/ gs://wolfv-backup-tutorial

(The preceding instructions were modified from https://cloud.google.com/storage/docs/quickstart-gsutil)

Spread the word

If you think others can benefit from this low-cost online backup system, link to this page and tell others about it.

Creative Commons License

Creative Commons License
Google Cloud Storage backup tutorial by Wolfram Volpi is licensed under a Creative Commons Attribution 4.0 International License.
Permissions beyond the scope of this license may be available at https://gist.github.com/wolfv6/5e8d32495b9f4bf5224bbfe114c15864 in the comments below.

{
"rule":
[
{
"action": { "type": "Delete" },
"condition": { "isLive": false, "numNewerVersions": 2 }
},
{
"action": { "type": "Delete" },
"condition": { "isLive": false, "age": 1 }
}
]
}
#!/bin/bash
################################### usage ####################################
# Option "-n" is "dry run". Dry run example:
# $ ~/test_source/backup_script ~/test_source/ gs://wolfv-backup-tutorial -n
# Real backups omit the third argument. Live run example:
# $ ~/test_source/backup_script ~/test_source/ gs://wolfv-backup-tutorial
# Example cron job that backs up twice a day at 03:52 and 15:52:
# $ crontab -e
# 52 03,15 * * * ~/test_source/backup_script ~/test_source/ gs://wolfv-backup-tutorial
############################### configuration ################################
SOURCE=$1
DESTINATION=$2
DRYRUN=$3
BOTO_CONFIG="~/.boto"
# Google storage utility (requires full path, ~/gsutil/gsutil: No such file or directory).
GSUTIL="/home/wolfv/gsutil/gsutil"
# gsutil sends confirmation messages to stderr. The quite option -q suppresses confirmations.
# if not dryrun
if [[ "$DRYRUN" != "-n" ]]
then
GSUTIL="$GSUTIL -q"
fi
# Exclude patterns are Python regular expressions (not wildcards).
EXCLUDES='.+\.exe$'
# the inner parenthesis contains a list of files to backup.
EXCLUDE_HOME_FILES='^(?!(backup_script|\.bashrc|lifecycle_config\.json|lifecycle_config_empty\.json)$).*'
############################ directories to backup ##########################
# Backup files in ~/ home directory
$GSUTIL rsync $DRYRUN -c -C -x $EXCLUDE_HOME_FILES $SOURCE/ $DESTINATION/
# Backup ~/Documents
$GSUTIL rsync $DRYRUN -c -C -e -r -x $EXCLUDES $SOURCE/Documents/ $DESTINATION/Documents/
############################### confirmation #################################
# if not dryrun
if [[ "$DRYRUN" != "-n" ]]
then
CONFIRMATION="$(date) $SOURCE to $DESTINATION $DRYRUN"
echo $CONFIRMATION >> ~/test_source/backup.log
echo $CONFIRMATION
fi
#!/bin/bash
################################ usage ###################################
# Option "-n" is "dry run". Dry run example:
# $ ~/scripts/backup_script ~ gs://wolfv-backup -n
# Real backups omit the third argument. Live run example:
# $ ~/scripts/backup_script ~ gs://wolfv-backup
# Example cron job that backs up twice a day at 03:52 and 15:52:
# $ crontab -e
# 52 03,15 * * * ~/scripts/backup_script ~ gs://wolfv-backup
############################### configuration ################################
SOURCE=$1
DESTINATION=$2
DRYRUN=$3
BOTO_CONFIG="/home/wolfv/.boto"
# Google storage utility (requires full path, ~/gsutil/gsutil: No such file or directory).
GSUTIL="/home/wolfv/gsutil/gsutil"
# gsutil sends confirmation messages to stderr. The quite option -q suppresses confirmations.
# if not dryrun
if [[ "$DRYRUN" != "-n" ]]
then
GSUTIL="$GSUTIL -q"
fi
# These exclude patterns are Python regular expressions (not wildcards).
# The exclude patterns are named after the applications that generate them.
ARCHIVERS='.+\.7z$|.+\.dmg$|.+\.gz$|.+\.iso$|.+\.jar$|.+\.rar$|.+\.tar$|.+\.zip$'
ASTYLE='.+\.orig$'
COMPILERS='.+\.o$|.+\.exe$|.+\.hex$|.+\.out$'
DATABASES='.+\.sql$|.+\.sqlite$'
LOGS='.+\.log$'
MY_TAGS='.+_nobackup/|.+_nobackup$|.+_nobackup\..+|.+_old/|.+_old$|.+_old\..+|.+_book\..+'
NAUTILUS='.+copy\)'
VIM='.+~$|.+\.swp$|.+\.swo$|.+\.swn$'
EXCLUDES="$ARCHIVERS|$ASTYLE|$COMPILERS|$DATABASES|$LOGS|$MY_TAGS|$NAUTILUS|$VIM"
# the inner parenthesis contains a list of files to backup.
EXCLUDE_HOME_FILES='^(?!(\.ackrc|\.bashrc|\.gitconfig|\.gitignore_global|\.vimrc)$).*'
########################## directories to backup #############################
# ~/ home directory
$GSUTIL rsync $DRYRUN -c -C -x $EXCLUDE_HOME_FILES $SOURCE/ $DESTINATION/
# ~/Documents
$GSUTIL rsync $DRYRUN -c -C -e -r -x $EXCLUDES $SOURCE/Documents/ $DESTINATION/Documents/
# ~/scripts
$GSUTIL rsync $DRYRUN -c -C -e -r -x $EXCLUDES $SOURCE/scripts/ $DESTINATION/scripts/
# ~/Pictures
$GSUTIL rsync $DRYRUN -c -C -e -r -x $EXCLUDES $SOURCE/Pictures/ $DESTINATION/Pictures/
# bookmarks in Nautilus file manager, left pane
$GSUTIL rsync $DRYRUN -c -C -e -x $EXCLUDES $SOURCE/.config/gtk-3.0/ $DESTINATION/.config/gtk-3.0/
############################### confirmation #################################
# if not dryrun
if [[ "$DRYRUN" != "-n" ]]
then
CONFIRMATION="$(date) $SOURCE to $DESTINATION $DRYRUN"
echo $CONFIRMATION >> ~/test_source/backup.log
echo $CONFIRMATION
fi
@Rosain

This comment has been minimized.

Copy link

@Rosain Rosain commented Aug 17, 2018

Hey i've been following this tutorial and am having troubles with my gsutil path, you seem to be running gsutil on line 18 of 4_backup_script. I either have a different path and have no idea where it is or this just isnt working for me.

@ferengi82

This comment has been minimized.

Copy link

@ferengi82 ferengi82 commented Aug 23, 2018

nice tutorial, thx

try /usr/bin/gsutil

@KalpeshPopat

This comment has been minimized.

Copy link

@KalpeshPopat KalpeshPopat commented Mar 9, 2020

HI, thanks for this excellent tutorial. i am assuming that the utility will only sync files that are changed....is that correct ?

Thanks again.

@matteolavaggi

This comment has been minimized.

Copy link

@matteolavaggi matteolavaggi commented Mar 15, 2021

Hi, any idea on why get this exception?
CommandException: arg (/HIDE/SOURCE/PATH/) does not name a directory, bucket, or bucket subdir.
If there is an object with the same path, please add a trailing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment