Skip to content

Instantly share code, notes, and snippets.

@MLnick
Last active June 29, 2020 04:14
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MLnick/6ec916b646c3004d7523 to your computer and use it in GitHub Desktop.
Save MLnick/6ec916b646c3004d7523 to your computer and use it in GitHub Desktop.
$ sbt/sbt assembly/assembly
$ sbt/sbt examples/assembly
$ SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 ./bin/pyspark
...
14/06/03 15:34:11 INFO SparkUI: Started SparkUI at http://10.0.0.4:4040
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT
/_/
Using Python version 2.7.6 (default, Jan 10 2014 11:23:15)
SparkContext available as sc.
In [1]: conf = {"hbase.zookeeper.quorum": "localhost","hbase.mapreduce.inputtable": "data"}
In [2]: rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io
.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf)
14/06/03 15:34:54 INFO MemoryStore: ensureFreeSpace(33603) called with curMem=0, maxMem=309225062
14/06/03 15:34:54 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.8 KB, free 294.9 MB)
In [3]: rdd.collect()
14/06/03 15:35:07 INFO ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
14/06/03 15:35:07 INFO ZooKeeper: Client environment:host.name=localhost
14/06/03 15:35:07 INFO ZooKeeper: Client environment:java.version=1.7.0_60
...
14/06/03 16:38:40 INFO NewHadoopRDD: Input split: localhost:,
14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as key: ImmutableBytesWritable;
Error: couldn't pickle object of type class org.apache.hadoop.hbase.io.ImmutableBytesWritable
14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as value: Result;
Error: couldn't pickle object of type class org.apache.hadoop.hbase.client.Result
14/06/03 16:38:40 INFO Executor: Serialized size of result for 0 is 738
14/06/03 16:38:40 INFO Executor: Sending result for 0 directly to driver
14/06/03 16:38:40 INFO Executor: Finished task ID 0
14/06/03 16:38:40 INFO TaskSetManager: Finished TID 0 in 80 ms on localhost (progress: 1/1)
14/06/03 16:38:40 INFO DAGScheduler: Completed ResultTask(0, 0)
14/06/03 16:38:40 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/06/03 16:38:40 INFO DAGScheduler: Stage 0 (collect at <ipython-input-3-20868699513c>:1) finished in 0.093 s
14/06/03 16:38:41 INFO SparkContext: Job finished: collect at <ipython-input-3-20868699513c>:1, took 0.197537 s
Out[3]:
[(u'72 6f 77 31', u'keyvalues={row1/f1:/1401639141180/Put/vlen=5/ts=0}'),
(u'72 6f 77 32', u'keyvalues={row2/f2:/1401639169638/Put/vlen=6/ts=0}')]
======
I created a test table in Hbase (0.94.6 to match Spark examples):
hbase(main):002:0> scan 'data'
ROW COLUMN+CELL
row1 column=f1:, timestamp=1401639141180, value=value
row2 column=f2:, timestamp=1401639169638, value=value2
2 row(s) in 0.4190 seconds
@xiao2227
Copy link

xiao2227 commented Jun 28, 2017

@MLnick: I hope to use hadoopRDD(not newAPIHadoopRDD) for reading hbase table in python,
××××××××××××××××××××××××××××××××××××××××××××××××
××××××××××××××××××××××××××××
can you give me some suggest...
tks very much

@iamparv
Copy link

iamparv commented Jan 26, 2018

We're also struggling with same issue in Kerberized cluster. @MLnick, are you also facing it in kerberized cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment