Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created May 28, 2014 02:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/4dfbd7b96336253205f2 to your computer and use it in GitHub Desktop.
Save rjurney/4dfbd7b96336253205f2 to your computer and use it in GitHub Desktop.
Line 15 for me...
scala> val avroRdd = sc.newAPIHadoopFile("hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/27/19/part-m-00019.avro",
| classOf[AvroKeyInputFormat[GenericRecord]],
| classOf[AvroKey[GenericRecord]],
| classOf[NullWritable])
14/05/27 19:44:58 INFO storage.MemoryStore: ensureFreeSpace(167562) called with curMem=369637, maxMem=308713881
14/05/27 19:44:58 INFO storage.MemoryStore: Block broadcast_3 stored as values to memory (estimated size 163.6 KB, free 293.9 MB)
avroRdd: org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)] = NewHadoopRDD[7] at newAPIHadoopFile at <console>:41
scala> val genericRecords = avroRdd.map{case (ak, _) => ak.datum()}
genericRecords: org.apache.spark.rdd.RDD[org.apache.avro.generic.GenericRecord] = MappedRDD[8] at map at <console>:43
scala> genericRecords.first()
14/05/27 19:45:07 INFO input.FileInputFormat: Total input paths to process : 1
14/05/27 19:45:07 INFO spark.SparkContext: Starting job: first at <console>:46
14/05/27 19:45:07 INFO scheduler.DAGScheduler: Got job 0 (first at <console>:46) with 1 output partitions (allowLocal=true)
14/05/27 19:45:07 INFO scheduler.DAGScheduler: Final stage: Stage 0 (first at <console>:46)
14/05/27 19:45:07 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/05/27 19:45:07 INFO scheduler.DAGScheduler: Missing parents: List()
14/05/27 19:45:07 INFO scheduler.DAGScheduler: Computing the requested partition locally
14/05/27 19:45:07 INFO rdd.NewHadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/27/19/part-m-00019.avro:0+35046663
Exception in thread "Local computation of job 0" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:94)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:84)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:48)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment