MLnick/pyspark-hbase.py

## pyspark-hbase.py
$ sbt/sbt assembly/assembly
$ sbt/sbt examples/assembly
$ SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 ./bin/pyspark

...

14/06/03 15:34:11 INFO SparkUI: Started SparkUI at http://10.0.0.4:4040
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
      /_/

Using Python version 2.7.6 (default, Jan 10 2014 11:23:15)
SparkContext available as sc.

In [1]: conf = {"hbase.zookeeper.quorum": "localhost","hbase.mapreduce.inputtable": "data"}

In [2]: rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io
.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf)
14/06/03 15:34:54 INFO MemoryStore: ensureFreeSpace(33603) called with curMem=0, maxMem=309225062
14/06/03 15:34:54 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.8 KB, free 294.9 MB)

In [3]: rdd.collect()
14/06/03 15:35:07 INFO ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
14/06/03 15:35:07 INFO ZooKeeper: Client environment:host.name=localhost
14/06/03 15:35:07 INFO ZooKeeper: Client environment:java.version=1.7.0_60

...

14/06/03 16:38:40 INFO NewHadoopRDD: Input split: localhost:,
14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as key: ImmutableBytesWritable;
                    Error: couldn't pickle object of type class org.apache.hadoop.hbase.io.ImmutableBytesWritable
14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as value: Result;
                    Error: couldn't pickle object of type class org.apache.hadoop.hbase.client.Result
14/06/03 16:38:40 INFO Executor: Serialized size of result for 0 is 738
14/06/03 16:38:40 INFO Executor: Sending result for 0 directly to driver
14/06/03 16:38:40 INFO Executor: Finished task ID 0
14/06/03 16:38:40 INFO TaskSetManager: Finished TID 0 in 80 ms on localhost (progress: 1/1)
14/06/03 16:38:40 INFO DAGScheduler: Completed ResultTask(0, 0)
14/06/03 16:38:40 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/06/03 16:38:40 INFO DAGScheduler: Stage 0 (collect at <ipython-input-3-20868699513c>:1) finished in 0.093 s
14/06/03 16:38:41 INFO SparkContext: Job finished: collect at <ipython-input-3-20868699513c>:1, took 0.197537 s

Out[3]:
[(u'72 6f 77 31', u'keyvalues={row1/f1:/1401639141180/Put/vlen=5/ts=0}'),
 (u'72 6f 77 32', u'keyvalues={row2/f2:/1401639169638/Put/vlen=6/ts=0}')]


======

I created a test table in Hbase (0.94.6 to match Spark examples):

hbase(main):002:0> scan 'data'
ROW                           COLUMN+CELL
 row1                         column=f1:, timestamp=1401639141180, value=value
 row2                         column=f2:, timestamp=1401639169638, value=value2
2 row(s) in 0.4190 seconds
	$ sbt/sbt assembly/assembly
	$ sbt/sbt examples/assembly
	$ SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 ./bin/pyspark

	...

	14/06/03 15:34:11 INFO SparkUI: Started SparkUI at http://10.0.0.4:4040
	Welcome to
	____ __
	/ __/__ ___ _____/ /__
	_\ \/ _ \/ _ `/ __/ '_/
	/__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT
	/_/

	Using Python version 2.7.6 (default, Jan 10 2014 11:23:15)
	SparkContext available as sc.

	In [1]: conf = {"hbase.zookeeper.quorum": "localhost","hbase.mapreduce.inputtable": "data"}

	In [2]: rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io
	.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf)
	14/06/03 15:34:54 INFO MemoryStore: ensureFreeSpace(33603) called with curMem=0, maxMem=309225062
	14/06/03 15:34:54 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.8 KB, free 294.9 MB)

	In [3]: rdd.collect()
	14/06/03 15:35:07 INFO ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
	14/06/03 15:35:07 INFO ZooKeeper: Client environment:host.name=localhost
	14/06/03 15:35:07 INFO ZooKeeper: Client environment:java.version=1.7.0_60

	...

	14/06/03 16:38:40 INFO NewHadoopRDD: Input split: localhost:,
	14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as key: ImmutableBytesWritable;
	Error: couldn't pickle object of type class org.apache.hadoop.hbase.io.ImmutableBytesWritable
	14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as value: Result;
	Error: couldn't pickle object of type class org.apache.hadoop.hbase.client.Result
	14/06/03 16:38:40 INFO Executor: Serialized size of result for 0 is 738
	14/06/03 16:38:40 INFO Executor: Sending result for 0 directly to driver
	14/06/03 16:38:40 INFO Executor: Finished task ID 0
	14/06/03 16:38:40 INFO TaskSetManager: Finished TID 0 in 80 ms on localhost (progress: 1/1)
	14/06/03 16:38:40 INFO DAGScheduler: Completed ResultTask(0, 0)
	14/06/03 16:38:40 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
	14/06/03 16:38:40 INFO DAGScheduler: Stage 0 (collect at <ipython-input-3-20868699513c>:1) finished in 0.093 s
	14/06/03 16:38:41 INFO SparkContext: Job finished: collect at <ipython-input-3-20868699513c>:1, took 0.197537 s

	Out[3]:
	[(u'72 6f 77 31', u'keyvalues={row1/f1:/1401639141180/Put/vlen=5/ts=0}'),
	(u'72 6f 77 32', u'keyvalues={row2/f2:/1401639169638/Put/vlen=6/ts=0}')]


	======

	I created a test table in Hbase (0.94.6 to match Spark examples):

	hbase(main):002:0> scan 'data'
	ROW COLUMN+CELL
	row1 column=f1:, timestamp=1401639141180, value=value
	row2 column=f2:, timestamp=1401639169638, value=value2
	2 row(s) in 0.4190 seconds