Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@MLnick
Last active June 29, 2020 04:14
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MLnick/6ec916b646c3004d7523 to your computer and use it in GitHub Desktop.
Save MLnick/6ec916b646c3004d7523 to your computer and use it in GitHub Desktop.
$ sbt/sbt assembly/assembly
$ sbt/sbt examples/assembly
$ SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 ./bin/pyspark
...
14/06/03 15:34:11 INFO SparkUI: Started SparkUI at http://10.0.0.4:4040
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT
/_/
Using Python version 2.7.6 (default, Jan 10 2014 11:23:15)
SparkContext available as sc.
In [1]: conf = {"hbase.zookeeper.quorum": "localhost","hbase.mapreduce.inputtable": "data"}
In [2]: rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io
.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf)
14/06/03 15:34:54 INFO MemoryStore: ensureFreeSpace(33603) called with curMem=0, maxMem=309225062
14/06/03 15:34:54 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.8 KB, free 294.9 MB)
In [3]: rdd.collect()
14/06/03 15:35:07 INFO ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
14/06/03 15:35:07 INFO ZooKeeper: Client environment:host.name=localhost
14/06/03 15:35:07 INFO ZooKeeper: Client environment:java.version=1.7.0_60
...
14/06/03 16:38:40 INFO NewHadoopRDD: Input split: localhost:,
14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as key: ImmutableBytesWritable;
Error: couldn't pickle object of type class org.apache.hadoop.hbase.io.ImmutableBytesWritable
14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as value: Result;
Error: couldn't pickle object of type class org.apache.hadoop.hbase.client.Result
14/06/03 16:38:40 INFO Executor: Serialized size of result for 0 is 738
14/06/03 16:38:40 INFO Executor: Sending result for 0 directly to driver
14/06/03 16:38:40 INFO Executor: Finished task ID 0
14/06/03 16:38:40 INFO TaskSetManager: Finished TID 0 in 80 ms on localhost (progress: 1/1)
14/06/03 16:38:40 INFO DAGScheduler: Completed ResultTask(0, 0)
14/06/03 16:38:40 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/06/03 16:38:40 INFO DAGScheduler: Stage 0 (collect at <ipython-input-3-20868699513c>:1) finished in 0.093 s
14/06/03 16:38:41 INFO SparkContext: Job finished: collect at <ipython-input-3-20868699513c>:1, took 0.197537 s
Out[3]:
[(u'72 6f 77 31', u'keyvalues={row1/f1:/1401639141180/Put/vlen=5/ts=0}'),
(u'72 6f 77 32', u'keyvalues={row2/f2:/1401639169638/Put/vlen=6/ts=0}')]
======
I created a test table in Hbase (0.94.6 to match Spark examples):
hbase(main):002:0> scan 'data'
ROW COLUMN+CELL
row1 column=f1:, timestamp=1401639141180, value=value
row2 column=f2:, timestamp=1401639169638, value=value2
2 row(s) in 0.4190 seconds
@Raider06
Copy link

Excuse me but i have a problem about the line "rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf)" when i execute the line, Spark say me "AtributeError: 'SparkContext' object has no attribute 'newAPIHadoopRDD' " Can you help me?

@MLnick
Copy link
Author

MLnick commented Aug 25, 2014

@Raider06 this was more of a sketch for new functionality that will be released in Spark 1.1 in a few weeks time.

It is in Spark master branch currently

@bvarghese1
Copy link

@Raider06 I tried your example above however, I am getting the following exception. (I am a beginner in spark)

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 14.0 in stage 6.0 (TID 23) had a not serializable result: org.apache.hadoop.hbase.io.ImmutableBytesWritable
Serialization stack:
    - object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 6c 61 73 74 5f 65 6e 74 69 74 79 5f 62 61 74 63 68)
    - field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
    - object (class scala.Tuple2, (6c 61 73 74 5f 65 6e 74 69 74 79 5f 62 61 74 63 68,keyvalues={last_entity_batch/c:d/1441414881172/Put/vlen=5092/mvcc=0}))
    - element of array (index: 0)
    - array (class [Lscala.Tuple2;, size 1)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

@xiao2227
Copy link

xiao2227 commented Jun 28, 2017

@MLnick: I hope to use hadoopRDD(not newAPIHadoopRDD) for reading hbase table in python,
××××××××××××××××××××××××××××××××××××××××××××××××
××××××××××××××××××××××××××××
can you give me some suggest...
tks very much

@iamparv
Copy link

iamparv commented Jan 26, 2018

We're also struggling with same issue in Kerberized cluster. @MLnick, are you also facing it in kerberized cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment