aslotnick/Strata+Hadoop World 2016 Notes.md

## Strata+Hadoop World 2016 Notes.md

      
    Raw
  

              Strata+Hadoop World 2016 Notes.md
            
          
    File format benchmark: Avro, JSON, ORC, and Parquet
(slides: https://cdn.oreillystatic.com/en/assets/1/event/160/File%20format%20benchmark_%20Avro,%20JSON,%20ORC,%20and%20Parquet%20Presentation%201.pptx)

ORC has some built-in tuning for better performance with double and timestamp types
Both ORC and Parquet support predicate pushdown
Avro was a good choice for very wide tables with lots of text fields
For future investigation: look into “schema evolution” for both columnar formats
Snappy is faster than Zlib at the cost of more disk space

Data science at eHarmony: A generalized framework for personalization
https://cdn.oreillystatic.com/en/assets/1/event/160/Data%20science%20at%20eHarmony_%20A%20generalized%20framework%20for%20personalization%20Presentation.pdf

Aloha DSL gives a high-level abstraction layer for data scientists
E-Harmony makes use of features extracted from member photos
Metric for evaluating success called “affinity”.  Was previously defined as P(communication | match). Revised to P( (communication OR view) AND NOT close | match) which was a more specific and useful measure.
Biggest days for their business are valentine’s, new year, and Christmas
Using “contextual bandits”

Tuning Spark machine-learning workloads
https://cdn.oreillystatic.com/en/assets/1/event/160/Tuning%20Spark%20machine-learning%20workloads%20Presentation.pdf

Methodology v1: Brute-force through every possibly combination of options (executor-memory, executor-cores, GC-cores, etc.)
Methodology v2: intelligently pick which options to use by iteratively profiling the job
Capture metrics with default settings to find a base line
Understand the cluster “floorplan” - memory, CPUs/cores/threads, network
Start changing settings to optimize the easiest metric. Then keep those settings and change others to optimize other metrics
Goal should be to make the workload compute-bound because once you are up against the compute limit, network and memory are probably well-utilized
Specific learnings


Num-executors should be based on # threads * # CPU per node


Reserve some threads for garbage collection, they found 1:1 executor:GC worked well in shuffle-heavy workloads


When using Tungsten sort, no need to make so many GC threads available as more work happens “off-heap” which relieves GC pressure


Big data processing with Hadoop and Spark, the Uber way
https://cdn.oreillystatic.com/en/assets/1/event/160/Big%20data%20processing%20with%20Hadoop%20and%20Spark,%20the%20Uber%20way%20Presentation.pdf

They have a tool which tells the user approximately how much money their query cost
Transitioned from an EMR/S3 setup to using HDFS (not sure what Hadoop distro)
Ingest data via streams only (using Streamific)
Geospatial of great importance - added support to Presto and created UDF’s which can take advantage of geospatial indexes
Employ strict schema management
If data does not comply, it’s not allowed into the data lake
Producers must inform the schema registry, and consumer read from it
The centralized schema registry is versions and allows for schema evolution
Need to make the process easy for producers or else they won’t participate
Use Parquet for most data on HDFS


Historical reasons and better handling of nested data


Use Presto for interactive queries, but wrapped in an abstraction layer to allow choosing the correct YARN queue etc.

The Netflix data platform: Now and in the future
https://drive.google.com/file/d/0B72kok3RkcZDLTl5aXpadkpNdEE/view?usp=sharing

All etl-type jobs are submitted to Genie using the same REST call (job types include Spark, Presto, Pig, etc). Genie know what environment the job is running in.
All metadata is searchable in a central repository
One portal includes metadata, DDL, job dependencies, data, etc.
Support use of both Jupyter and Zeppelin

Caravel: An open source data exploration and visualization platform

Data analysis platform based around visualization, easy to bring in existing JS viz
Very similar to Re:dash in architecture but less mature as a query editor
Good tool if you want to build very specific visualizations

Ask me anything with Cloudera engineers

When using S3 as a data store, compress as much as possible to reduce network traffic
Python vs.Scala for Spark:


Strongly recommend Scala if it’s feasible, but there are certain situations when Python will be better (if you have Python-only engineers, or existing Python code being ported to run on Spark).


PySpark is retaining the name “DataFrame” in 2.0 while Scala is moving to the more general “DataSet”.


How to approach multi-tenancy in the cloud (one big cluster vs. many small)


Think about who is sharing a resource. If developers are going to fight over prioritization multiple clusters makes it easier


Not much experience sharing between Presto and YARN


Notes on AWS


Strongly recommended use of reserved instances


Common configuration is to have a Core group made of on-demand/reserved, supplemented with a Task group of spot instances


Data modeling for microservices with Cassandra and Spark
https://cdn.oreillystatic.com/en/assets/1/event/160/Data%20modeling%20for%20microservices%20with%20Cassandra%20and%20Spark%20Presentation.pptx

Standardize the way different services represent similar data (e.g. addresses, geo, financial, time)
Data model for a services has several physical representations (Java API, REST API, JSON, C*). Pick one as the primary (they chose JSON)
When modeling for C*, create different representations for the most common ways of querying, but not for every possible way


C* 3.0 is bringing materialized views which will help this a lot.