File format benchmark: Avro, JSON, ORC, and Parquet (slides: https://cdn.oreillystatic.com/en/assets/1/event/160/File%20format%20benchmark_%20Avro,%20JSON,%20ORC,%20and%20Parquet%20Presentation%201.pptx)
- ORC has some built-in tuning for better performance with double and timestamp types
- Both ORC and Parquet support predicate pushdown
- Avro was a good choice for very wide tables with lots of text fields
- For future investigation: look into “schema evolution” for both columnar formats
- Snappy is faster than Zlib at the cost of more disk space
Data science at eHarmony: A generalized framework for personalization