Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created May 28, 2014 00:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/e1aac64af6c59eb2a0af to your computer and use it in GitHub Desktop.
Save rjurney/e1aac64af6c59eb2a0af to your computer and use it in GitHub Desktop.
Failure to read Avro RDD
scala> val avroRdd = sc.newAPIHadoopFile("hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/27/19/*", classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable])
14/05/27 17:02:49 INFO storage.MemoryStore: ensureFreeSpace(167954) called with curMem=0, maxMem=308713881
14/05/27 17:02:49 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 164.0 KB, free 294.3 MB)
avroRdd: org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)] = NewHadoopRDD[0] at newAPIHadoopFile at <console>:23
scala> avroRdd.take(1)
14/05/27 17:03:05 INFO input.FileInputFormat: Total input paths to process : 21
14/05/27 17:03:05 INFO spark.SparkContext: Starting job: take at <console>:26
14/05/27 17:03:05 INFO scheduler.DAGScheduler: Got job 0 (take at <console>:26) with 1 output partitions (allowLocal=true)
14/05/27 17:03:05 INFO scheduler.DAGScheduler: Final stage: Stage 0 (take at <console>:26)
14/05/27 17:03:05 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/05/27 17:03:05 INFO scheduler.DAGScheduler: Missing parents: List()
14/05/27 17:03:05 INFO scheduler.DAGScheduler: Computing the requested partition locally
14/05/27 17:03:05 INFO rdd.NewHadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/27/19/part-m-00000.avro:0+3864
Exception in thread "Local computation of job 0" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:94)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:84)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:48)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
[hivedata@hivecluster2 ~]$ ^C
@zlgonzalez
Copy link

I am also running into this issue with Spark 1.0.1. In order to get out of this problem from eclipse, I had to remove the spark assembly from the classpath and that allows the execution to complete. Were you able to find a solution to this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment