akiatoji/Complex Avro in Hadoop.md

## Complex Avro in Hadoop.md

      
    Raw
  

              Complex Avro in Hadoop.md
            
          
    Complex Avro Objects in Hadoop

I've been working on Avro objects that's fairly complex, generated from heterogeneous systems (i.e. C#/Java).
The objects have arrays of maps, maps with arrays, maps with complex values, maps with complex values that has arrays of maps... and so on.
In trying to query these objects, I ran into surprising number of issues that took some time and effort to investigate.
Hive

Issue

LATERAL VIEW cannot be used with JOIN, which means multi-step queries.
The Avro object has an embedded array representing 1-N relationship like x -> [y1, y2, y3, y4], and we want to explode this into [(x,y1), (x,y2), (x,y3), x,y4)].
In Hive, this can be done using EXPLODE and LATERAL VIEW.  We also want to join the result with other tables, but LATERAL VIEW cannot be combined with JOIN. This means generation of intermediate table which can be quite large.
Spark SQL on Hive

If HiveQL can't deal with LATERAL VIEW and JOIN at the same time, may be Spark SQL can.  So let's try running Spark SQL on Hive tables.
Issue

If Avro schema is in mixed case, SparkSQL cannot query it through Hive store.   This is because HCatalog converts Avro schema into lower case.
When SparkSQL reads Avro files, it reads the schema which may be in mixed case (it was in my case - no pun intended), and complains that it can't find it in HCatalog.
Lesson learned - keep Avro schema lower cased if you want to stick it in Hive.  (Too late for us)
Pig on Avro

Then how about Pig?   Pig always works, don't it?
Issue

Pig's AvroStorage() does not handle Avro maps of complex values.
This looks like an omission, it was just fixed in PIG-4448.   Try again in Pig 0.15.
SparkSQL on raw Avro

So let's try running SparkSQL directly on Avro.   DataStax Spark-Avro library lets Spark open Avro files and create SQLContext.  This is pretty neat.
Issue

You need Spark 1.3 to even get this to work, but SparkSQL doesn't seem fully functional yet.
First, Spark-Avro 0.1 does not have FIXED type support.  DataStax just fixed this with 0.2, but this requires Spark 1.3.   Even with SparkSQL 1.3, some simple queries were getting errors, however.
SparkSQL can also run HiveQL, but we run into same issue with LATERAL VIEW.
Raw Spark on raw Avro

So it comes down to writing low level Scala code.  Let's do this on Spark 1.2 (current version in production) to process Avro data directly.
Issues


Default Java SerDe doesn't work with complex Avro types. Avro records and fields do not support Java SerDe that Spark uses as default.

Okay, so lets try Kryo SerDe which Spark also supports.

Kryo crashes once it sees an array of complex types.

Not sure why but Kryo doesn't add registrator for GenericData.Array, so add one that maps to ArrayList, and....  It works!
So we got it going in local mode.  Let's run it in a standalone Spark cluster...!

Kryo crashes in Spark cluster.  Spark 1.2 uses different lass loader and it can't load Avro classes in assembly jar...!   Use Spark 1.3

<---- いまココ