Skip to content

Instantly share code, notes, and snippets.

@aslotnick
Last active November 16, 2016 21:34
Show Gist options
  • Save aslotnick/42c9a8ebdad26b7093e9443c6be07c9b to your computer and use it in GitHub Desktop.
Save aslotnick/42c9a8ebdad26b7093e9443c6be07c9b to your computer and use it in GitHub Desktop.
Strata+Hadoop World 2016 Notes

File format benchmark: Avro, JSON, ORC, and Parquet (slides: https://cdn.oreillystatic.com/en/assets/1/event/160/File%20format%20benchmark_%20Avro,%20JSON,%20ORC,%20and%20Parquet%20Presentation%201.pptx)

  • ORC has some built-in tuning for better performance with double and timestamp types
  • Both ORC and Parquet support predicate pushdown
  • Avro was a good choice for very wide tables with lots of text fields
  • For future investigation: look into “schema evolution” for both columnar formats
  • Snappy is faster than Zlib at the cost of more disk space

Data science at eHarmony: A generalized framework for personalization https://cdn.oreillystatic.com/en/assets/1/event/160/Data%20science%20at%20eHarmony_%20A%20generalized%20framework%20for%20personalization%20Presentation.pdf

  • Aloha DSL gives a high-level abstraction layer for data scientists
  • E-Harmony makes use of features extracted from member photos
  • Metric for evaluating success called “affinity”. Was previously defined as P(communication | match). Revised to P( (communication OR view) AND NOT close | match) which was a more specific and useful measure.
  • Biggest days for their business are valentine’s, new year, and Christmas
  • Using “contextual bandits”

Tuning Spark machine-learning workloads https://cdn.oreillystatic.com/en/assets/1/event/160/Tuning%20Spark%20machine-learning%20workloads%20Presentation.pdf

  • Methodology v1: Brute-force through every possibly combination of options (executor-memory, executor-cores, GC-cores, etc.)
  • Methodology v2: intelligently pick which options to use by iteratively profiling the job
  • Capture metrics with default settings to find a base line
  • Understand the cluster “floorplan” - memory, CPUs/cores/threads, network
  • Start changing settings to optimize the easiest metric. Then keep those settings and change others to optimize other metrics
  • Goal should be to make the workload compute-bound because once you are up against the compute limit, network and memory are probably well-utilized
  • Specific learnings
    • Num-executors should be based on # threads * # CPU per node
    • Reserve some threads for garbage collection, they found 1:1 executor:GC worked well in shuffle-heavy workloads
    • When using Tungsten sort, no need to make so many GC threads available as more work happens “off-heap” which relieves GC pressure

Big data processing with Hadoop and Spark, the Uber way https://cdn.oreillystatic.com/en/assets/1/event/160/Big%20data%20processing%20with%20Hadoop%20and%20Spark,%20the%20Uber%20way%20Presentation.pdf

  • They have a tool which tells the user approximately how much money their query cost
  • Transitioned from an EMR/S3 setup to using HDFS (not sure what Hadoop distro)
  • Ingest data via streams only (using Streamific)
  • Geospatial of great importance - added support to Presto and created UDF’s which can take advantage of geospatial indexes
  • Employ strict schema management
  • If data does not comply, it’s not allowed into the data lake
  • Producers must inform the schema registry, and consumer read from it
  • The centralized schema registry is versions and allows for schema evolution
  • Need to make the process easy for producers or else they won’t participate
  • Use Parquet for most data on HDFS
    • Historical reasons and better handling of nested data
  • Use Presto for interactive queries, but wrapped in an abstraction layer to allow choosing the correct YARN queue etc.

The Netflix data platform: Now and in the future https://drive.google.com/file/d/0B72kok3RkcZDLTl5aXpadkpNdEE/view?usp=sharing

  • All etl-type jobs are submitted to Genie using the same REST call (job types include Spark, Presto, Pig, etc). Genie know what environment the job is running in.
  • All metadata is searchable in a central repository
  • One portal includes metadata, DDL, job dependencies, data, etc.
  • Support use of both Jupyter and Zeppelin

Caravel: An open source data exploration and visualization platform

  • Data analysis platform based around visualization, easy to bring in existing JS viz
  • Very similar to Re:dash in architecture but less mature as a query editor
  • Good tool if you want to build very specific visualizations

Ask me anything with Cloudera engineers

  • When using S3 as a data store, compress as much as possible to reduce network traffic
  • Python vs.Scala for Spark:
    • Strongly recommend Scala if it’s feasible, but there are certain situations when Python will be better (if you have Python-only engineers, or existing Python code being ported to run on Spark).
    • PySpark is retaining the name “DataFrame” in 2.0 while Scala is moving to the more general “DataSet”.
  • How to approach multi-tenancy in the cloud (one big cluster vs. many small)
    • Think about who is sharing a resource. If developers are going to fight over prioritization multiple clusters makes it easier
    • Not much experience sharing between Presto and YARN
  • Notes on AWS
    • Strongly recommended use of reserved instances
    • Common configuration is to have a Core group made of on-demand/reserved, supplemented with a Task group of spot instances

Data modeling for microservices with Cassandra and Spark https://cdn.oreillystatic.com/en/assets/1/event/160/Data%20modeling%20for%20microservices%20with%20Cassandra%20and%20Spark%20Presentation.pptx

  • Standardize the way different services represent similar data (e.g. addresses, geo, financial, time)
  • Data model for a services has several physical representations (Java API, REST API, JSON, C*). Pick one as the primary (they chose JSON)
  • When modeling for C*, create different representations for the most common ways of querying, but not for every possible way
    • C* 3.0 is bringing materialized views which will help this a lot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment