File format benchmark: Avro, JSON, ORC, and Parquet (slides: https://cdn.oreillystatic.com/en/assets/1/event/160/File%20format%20benchmark_%20Avro,%20JSON,%20ORC,%20and%20Parquet%20Presentation%201.pptx)
- ORC has some built-in tuning for better performance with double and timestamp types
- Both ORC and Parquet support predicate pushdown
- Avro was a good choice for very wide tables with lots of text fields
- For future investigation: look into “schema evolution” for both columnar formats
- Snappy is faster than Zlib at the cost of more disk space
Data science at eHarmony: A generalized framework for personalization https://cdn.oreillystatic.com/en/assets/1/event/160/Data%20science%20at%20eHarmony_%20A%20generalized%20framework%20for%20personalization%20Presentation.pdf
- Aloha DSL gives a high-level abstraction layer for data scientists
- E-Harmony makes use of features extracted from member photos
- Metric for evaluating success called “affinity”. Was previously defined as P(communication | match). Revised to P( (communication OR view) AND NOT close | match) which was a more specific and useful measure.
- Biggest days for their business are valentine’s, new year, and Christmas
- Using “contextual bandits”
Tuning Spark machine-learning workloads https://cdn.oreillystatic.com/en/assets/1/event/160/Tuning%20Spark%20machine-learning%20workloads%20Presentation.pdf
- Methodology v1: Brute-force through every possibly combination of options (executor-memory, executor-cores, GC-cores, etc.)
- Methodology v2: intelligently pick which options to use by iteratively profiling the job
- Capture metrics with default settings to find a base line
- Understand the cluster “floorplan” - memory, CPUs/cores/threads, network
- Start changing settings to optimize the easiest metric. Then keep those settings and change others to optimize other metrics
- Goal should be to make the workload compute-bound because once you are up against the compute limit, network and memory are probably well-utilized
- Specific learnings
-
- Num-executors should be based on # threads * # CPU per node
-
- Reserve some threads for garbage collection, they found 1:1 executor:GC worked well in shuffle-heavy workloads
-
- When using Tungsten sort, no need to make so many GC threads available as more work happens “off-heap” which relieves GC pressure
Big data processing with Hadoop and Spark, the Uber way https://cdn.oreillystatic.com/en/assets/1/event/160/Big%20data%20processing%20with%20Hadoop%20and%20Spark,%20the%20Uber%20way%20Presentation.pdf
- They have a tool which tells the user approximately how much money their query cost
- Transitioned from an EMR/S3 setup to using HDFS (not sure what Hadoop distro)
- Ingest data via streams only (using Streamific)
- Geospatial of great importance - added support to Presto and created UDF’s which can take advantage of geospatial indexes
- Employ strict schema management
- If data does not comply, it’s not allowed into the data lake
- Producers must inform the schema registry, and consumer read from it
- The centralized schema registry is versions and allows for schema evolution
- Need to make the process easy for producers or else they won’t participate
- Use Parquet for most data on HDFS
-
- Historical reasons and better handling of nested data
- Use Presto for interactive queries, but wrapped in an abstraction layer to allow choosing the correct YARN queue etc.
The Netflix data platform: Now and in the future https://drive.google.com/file/d/0B72kok3RkcZDLTl5aXpadkpNdEE/view?usp=sharing
- All etl-type jobs are submitted to Genie using the same REST call (job types include Spark, Presto, Pig, etc). Genie know what environment the job is running in.
- All metadata is searchable in a central repository
- One portal includes metadata, DDL, job dependencies, data, etc.
- Support use of both Jupyter and Zeppelin
Caravel: An open source data exploration and visualization platform
- Data analysis platform based around visualization, easy to bring in existing JS viz
- Very similar to Re:dash in architecture but less mature as a query editor
- Good tool if you want to build very specific visualizations
Ask me anything with Cloudera engineers
- When using S3 as a data store, compress as much as possible to reduce network traffic
- Python vs.Scala for Spark:
-
- Strongly recommend Scala if it’s feasible, but there are certain situations when Python will be better (if you have Python-only engineers, or existing Python code being ported to run on Spark).
-
- PySpark is retaining the name “DataFrame” in 2.0 while Scala is moving to the more general “DataSet”.
- How to approach multi-tenancy in the cloud (one big cluster vs. many small)
-
- Think about who is sharing a resource. If developers are going to fight over prioritization multiple clusters makes it easier
-
- Not much experience sharing between Presto and YARN
- Notes on AWS
-
- Strongly recommended use of reserved instances
-
- Common configuration is to have a Core group made of on-demand/reserved, supplemented with a Task group of spot instances
Data modeling for microservices with Cassandra and Spark https://cdn.oreillystatic.com/en/assets/1/event/160/Data%20modeling%20for%20microservices%20with%20Cassandra%20and%20Spark%20Presentation.pptx
- Standardize the way different services represent similar data (e.g. addresses, geo, financial, time)
- Data model for a services has several physical representations (Java API, REST API, JSON, C*). Pick one as the primary (they chose JSON)
- When modeling for C*, create different representations for the most common ways of querying, but not for every possible way
-
- C* 3.0 is bringing materialized views which will help this a lot.