Skip to content

Instantly share code, notes, and snippets.

@zak10
Last active January 11, 2017 13:56
Show Gist options
  • Save zak10/39ab37ebc2d6baf01d9632f85bd8731d to your computer and use it in GitHub Desktop.
Save zak10/39ab37ebc2d6baf01d9632f85bd8731d to your computer and use it in GitHub Desktop.

The Problem

The update process in hive that joins the product catalog data to f_product_performance has been causing out of memory failures among many other different types of errors that are nearly impossible to diagnose.

Proposal 1

The first proposal involves sending aggregated data to a key/value store for hydra to consume during the selection process.

  • Create aggregation script at the end of stats pipeline in pyspark for last 60 days of product performance
  • Send aggregated data to (in memory) key/value store for O(1) lookup in hydra's selection process
  • Update hydra's filter to retrieve performance data from key/value store

Pros: speed up update process and reliability in hive, stateless, keeps hydra's speed.
Cons: product_performance aggregation will take a long time and we still have to do it for every site regardless of whether they need it.

Proposal 2

The second proposal involves extending hydra to send products that need to be filtered by stats to an intermediary service for additional filtering/processing.

  • Remove aggregation from hive update script to improve speed and reliability
  • Create additional hydra service bound to a new queue (hydra-stats) that will process stats filters only and forward results to generation
  • Update medusa (if stats filters present, send to hydra-stats, else to generation)
  • Implement handler in go-stats to get aggregated performance data for a set of products

Pros: speed up update process and reliability in hive, only clients that have stats are doing additional processing
Cons: increases architectural complexity of hydra and additional complexity in go-stats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment