Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Hbase backup solutions

Introduction

This is a proposed procedure for Hbase table backups in a secure Hbase cluster. Requirements:

  • Live backups (cannot disable table or take hbase offline)
  • Automated procedure (oozie controlled)
  • On secure cluster
  • On cluster with no world-readable /hbase folder
  • Off cluster backups
  • Off cluster backup location might not have an installed instance of Hbase, just HDFS
  • Off cluster backup location does not have credentials for hbase user
  • Off cluster backup location is secured (could be in a different Kerberos domain)

Setup

There are two clusters:

  • hbcluster is a live HBase cluster
  • bkcluster is an HDFS cluster

We want to periodically backup table snapshots from hbcluster to bkcluster and we want an automated procedure to restore such snapshots from bkcluster to hbcluster.

The backup and restore operations are going to be initiated by the table admin user foousr which is part of the same krb5 realm used by both clusters.

+-------------+              +--------------+
|             |    Backup    |              |
| Live Hbase/ +-------------->  Backup HDFS |
| HDFS Cluster|              |  Cluster     |
|             |    Restore   |              |
| 'hbcluster' <--------------+  'bkcluster' |
|             |              |              |
+-------------+              +--------------+

Proposed solutions

Solution A - ExportSnapshot-based solution

Backup procedure

  1. On hbcluster foousr creates snapshot of table tableA on hbcluster called tableA-snapshot-2015-11-23 by running command foousr@hbcluster-worker $> echo 'create_snapshot 'tableA', 'tableA-snapshot-2015-11-23' | hbase shell
  2. On hbcluster foousr calls export Master Coprocessor Endpoint RPC on Hbase master with arguments: ('table-snapshot-name', 'remote-hdfs-uri') -> ('tableA-snapshot-2015-11-23', 'hdfs://bkcluster/user/foousr/hbase-backups')
  3. On hbcluster Hbase Master Coprocessor Endpoint runs ExportSnapshot MapReduce job to export snapshot to target remote URI. The Hbase master checks that foousr is allowed to perform this operation.

Restore procedure

  1. On hbcluster foousr calls import Master Coprocessor Endpoint RPC on Hbase master with arguments ('table-snapshot-name', 'remote-hdfs-uri') -> ('tableA-snapshot-2015-11-23', 'hdfs://bkcluster/user/foousr/hbase-backups')
  2. On hbcluster Hbase Master Coprocessor Endpoint runs ExportSnapshot MapReduce job to import snapshot from target remote URI. The Hbase master checks that foousr is allowed to perform this operation.
  3. On hbcluster foousr calls the clone_snapshot or export_snapshot HBaseAdmin Api to restore the snapshot.

Work items

  • In order for HBase to run ExportSnapshot, the hbase user needs to be enabled to run YARN jobs (we can hadnle that with some chef-bach recipe adjustment)
  • We also need to be able to read the files as the hbase user and write them as the foousr user (and vice versa). As far as I know it's not possible to hold two UGI objexts. We could instead leverage WebHDFS as a destination endpoint. https://issues.apache.org/jira/browse/HDFS-7984 could allow us to do this by obtaining the delegation token before calling into HBase.
  • We need to write the two coprocessor endpoints for import/export

Pro's

  • No extra load on region servers
  • No extra copy of the data on source cluster
  • All HDFS operations on the /hbase folder are performed by hbase user

Con's

  • There's no clear way to do cross-authentication zone exports
  • Have two write two coprocessor endpoints and a client
  • Security is at a table-level granularity (no cell-level backups)

Solution B - Export/Import-based solution

Backup procedure

  1. On hbcluster foousr creates snapshot of table tableA on hbcluster called tableA-snapshot-2015-11-23 by running command foousr@hbcluster-worker $> echo 'create_snapshot 'tableA', 'tableA-snapshot-2015-11-23' | hbase shell
  2. On hbcluster foousr calls the clone_snapshot command to create a new table from the snapshot. foousr@hbcluster-worker $> echo 'clone_snapshot 'tableA-snapshot-2015-11-23', 'tableA-2015-11-23' | hbase shell
  3. On hbcluster foousr exports the data with an IdentityTableMapper-based (like Export) MapReduce job to the remote hdfs location.
  4. On hbcluster foousr deletes the tableA-2015-11-23 table

Restore procedure

  1. On hbcluster foousr imports the data with the Import MapReduceJob into a table

Work items

  • If the backup cluster has another krb5 domain, we need to make sure Export and Import would work with two domains. HDFS-7984 could also potentially fix this by using a webhdfs endpoint.

Pro's

  • Simpler implementation
  • Supports cell-level security
  • No extra security checks needed
  • No extra copy of the data on source cluster

Con's

  • Extra load on region servers
  • Region servers will have to service twice the regions
  • Potentially slow

Solution C - ExportSnapshot in Oozie action

Backup procedure

  1. On hbcluster foousr creates snapshot of table tableA on hbcluster called tableA-snapshot-2015-11-23 by running command foousr@hbcluster-worker $> echo 'create_snapshot 'tableA', 'tableA-snapshot-2015-11-23' | hbase shell
  2. On hbcluster foousr triggers exportSnapshot custom Oozie Action with arguments: ('table-snapshot-name', 'hdfs-uri') -> ('tableA-snapshot-2015-11-23', '/user/foousr/hbase-backups')
  3. On hbcluster Oozie runs ExportSnapshot MapReduce job to export snapshot to target local URI. Oozie has HDFS admin priviledges and can read all files on HDFS, including files in the /hbase folder. Oozie will chown the newly created files to foousr. (/usr/bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot 'tableA-snapshot-2015-11-23' -copy-to '/user/foousr/hbase-backups' -chuser 'foousr')
  4. foousr can now copy the files in /user/foousr/hbase-backups off cluster with either distcp or ExportSnapshot.

Restore procedure

  1. foousr populates /user/foousr/hbase-backups with the previously backed-up snapshot (using distcp or ExportSnapshot)
  2. On hbcluster foousr triggers importSnapshot custom Oozie action with arguments ('table-snapshot-name', 'hdfs-uri') -> ('tableA-snapshot-2015-11-23', '/user/foousr/hbase-backups')
  3. On hbcluster Oozie runs ExportSnapshot MapReduce job to import snapshot from target remote URI and changes ownership of the files to the hbase user (/usr/bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot 'tableA-snapshot-2015-11-23' -copy-from '/user/foousr/hbase-backups' -copy-to '/hbase' -chuser 'hbase')
  4. On hbcluster foousr calls the clone_snapshot or restore_snapshot HBaseAdmin Api to restore the snapshot.

Work items

  • In order for HBase to run ExportSnapshot, the oozie user needs to be enabled to run YARN jobs (we can hadnle that with some chef-bach recipe adjustment)
  • We need to write the two custom oozie actions for import/export

Pro's

  • No extra load on region servers
  • No hbase code change needed

Con's

  • Extra copy of the data on source cluster
  • Have two write two custom oozie actions
  • Oozie has to replicate security checks for foousr

Oozie action pseudocode

ExportSnapshot Action

Parameters: user (inferred), destinationPath, snapshotName, numMappers

<< Check that <user> has write permissions on <destinationPath> >>
/usr/bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot '<snapshotName>' -copy-to '<destinationPath>' -chuser '<user>' -mappers <numMappers>

ImportSnapshot Action

Parameters: user (inferred), sourcePath, snapshotName, numMappers

Parameters: user, sourcePath, snapshotName
<< Check that <user> has read permissions on <sourcePath> >>
/usr/bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot '<snapshotName>' -copy-from '<sourcePath>' -copy-to '/hbase' -chuser 'hbase' -mappers <numMappers>
@bijugs
Copy link

bijugs commented Mar 16, 2016

Nice write-up @mlongob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment