rberger/Proposed HBase Backup using HBASE-1684

## Proposed HBase Backup using HBASE-1684
I'm in the midst of trying to wrangle an HBase backup/restore to/from S3 or HDFS
built around export/backup of 1 table at a time
using org.apache.hadoop.hbase.mapreduce.Export from HBASE-1684.

Just a reminder:
Usage: Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

In the psuedo code below:

persistant_store is some kind of non-HBase store in the Cloud that you can just
push stuff onto.
all_my_Hbase_tables_to_be_backedup is a list of table names
create_table is a function that would properly create a new HBase Table based on
the schema passed in as an argument

Can I assume that if I do the following (psuedo_code) on HBase 0.20.3 or 0.90.x
to get an initial full backup to S3:

starttime = begining_of_time
endtime = NOW_Minus_60_seconds
versions = 100000 (the largest number of versions we keep, we do some weird
things with versions in some tables)

for table in all_my_Hbase_tables_to_be_backedup
do
	$HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar export \
		$table \
		s3n://somebucket/$table/ \
		$versions \
		$starttime \
		$endtime

	store_times_for_table_in_persistant_store( $table $starttime $endtime )
	store_schema_for_table_in_persistant_store( $table
get_schema_from_HBase($table) )
done

Then do incremental backups from that point on:

endtime = NOW_Minus_60_seconds
versions = 100000

for table in all_my_Hbase_tables_to_be_backedup
do
	starttime = get_last_endtime_from_persistant_store( $table )

	$HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar export \
		$table \
		s3n://somebucket/$table/ \
		$versions \
		$starttime \
		$endtime

	store_times_for_table_in_persistant_store( $table $starttime $endtime )
	store_schema_for_table_in_persistant_store( $table
get_schema_from_HBase($table) )
done

The Import usage:
Usage: Import <tablename> <inputdir>

If I wanted to restore a backed up table (table_foo) to a destination table
(table_bar) in the HBase that is running this command which may or may not be
the same HBase the table was originally backed up from from the exports to S3 I
can do:

create_table( get_schema_from_persistant_store(table_bar) )

$HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar import \
	table_bar \
	s3n://somebucket/table_foo/

If I wanted to do a full restore I would just loop thru all the tables  the
above import process on an HBase cluster that didn't yet have those tables.

Would I pretty much be guaranteed to get a proper backup snapshotted at the
specified endtime of each run?

Could this be used to copy an the data from one HBase cluster to another (in
particular to go from a production HBase 0.20.3 to a fresh new 0.90.1)?

One normal backup/restore  thing that is missing is there is no easy way to get
a restore at a point in time as opposed to the last backup. I presume the worse
case would be to restore everything and then delete rows with timestamps after
the early date one wanted?

Please let me know what I might be missing or what the down sides would be to
such a way to do backups.

Thanks!
Rob

__________________
Robert J Berger - CTO
Runa Inc.
520 San Antonio Rd Suite 210, Mountain View, CA 94040
+1 408-838-8896
http://blog.ibd.com

http://workatruna.com
	I'm in the midst of trying to wrangle an HBase backup/restore to/from S3 or HDFS
	built around export/backup of 1 table at a time
	using org.apache.hadoop.hbase.mapreduce.Export from HBASE-1684.

	Just a reminder:
	Usage: Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

	In the psuedo code below:

	persistant_store is some kind of non-HBase store in the Cloud that you can just
	push stuff onto.
	all_my_Hbase_tables_to_be_backedup is a list of table names
	create_table is a function that would properly create a new HBase Table based on
	the schema passed in as an argument

	Can I assume that if I do the following (psuedo_code) on HBase 0.20.3 or 0.90.x
	to get an initial full backup to S3:

	starttime = begining_of_time
	endtime = NOW_Minus_60_seconds
	versions = 100000 (the largest number of versions we keep, we do some weird
	things with versions in some tables)

	for table in all_my_Hbase_tables_to_be_backedup
	do
	$HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar export \
	$table \
	s3n://somebucket/$table/ \
	$versions \
	$starttime \
	$endtime

	store_times_for_table_in_persistant_store( $table $starttime $endtime )
	store_schema_for_table_in_persistant_store( $table
	get_schema_from_HBase($table) )
	done

	Then do incremental backups from that point on:

	endtime = NOW_Minus_60_seconds
	versions = 100000

	for table in all_my_Hbase_tables_to_be_backedup
	do
	starttime = get_last_endtime_from_persistant_store( $table )

	$HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar export \
	$table \
	s3n://somebucket/$table/ \
	$versions \
	$starttime \
	$endtime

	store_times_for_table_in_persistant_store( $table $starttime $endtime )
	store_schema_for_table_in_persistant_store( $table
	get_schema_from_HBase($table) )
	done

	The Import usage:
	Usage: Import <tablename> <inputdir>

	If I wanted to restore a backed up table (table_foo) to a destination table
	(table_bar) in the HBase that is running this command which may or may not be
	the same HBase the table was originally backed up from from the exports to S3 I
	can do:

	create_table( get_schema_from_persistant_store(table_bar) )

	$HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar import \
	table_bar \
	s3n://somebucket/table_foo/

	If I wanted to do a full restore I would just loop thru all the tables the
	above import process on an HBase cluster that didn't yet have those tables.

	Would I pretty much be guaranteed to get a proper backup snapshotted at the
	specified endtime of each run?

	Could this be used to copy an the data from one HBase cluster to another (in
	particular to go from a production HBase 0.20.3 to a fresh new 0.90.1)?

	One normal backup/restore thing that is missing is there is no easy way to get
	a restore at a point in time as opposed to the last backup. I presume the worse
	case would be to restore everything and then delete rows with timestamps after
	the early date one wanted?

	Please let me know what I might be missing or what the down sides would be to
	such a way to do backups.

	Thanks!
	Rob

	__________________
	Robert J Berger - CTO
	Runa Inc.
	520 San Antonio Rd Suite 210, Mountain View, CA 94040
	+1 408-838-8896
	http://blog.ibd.com

	http://workatruna.com