Skip to content

Instantly share code, notes, and snippets.

@elliot42
Last active August 26, 2021 20:44
Show Gist options
  • Save elliot42/510398bf0ea4d21751d60a9a85820173 to your computer and use it in GitHub Desktop.
Save elliot42/510398bf0ea4d21751d60a9a85820173 to your computer and use it in GitHub Desktop.
delta lake

I messed up explaining the value prop of Delta Lake, including the value prop for customers. Here is a better version:

  • Delta Lake allows you to build a dataset incrementally over time in S3, storing only the incremental changes (e.g. per-day), and then allowing you to look up the previous state of the dataset at a past point in time (“time travel”)
  • In our case, with a X00GB graph (e.g. GraphFrames of nodes + edges), we would not need to duplicate the entire dataset every day, we would only need to store the changes per day, which is a large space/cost savings over making duplicate datasets every day.
  • This functionality works for storing any dataset incrementally, which in our case would work for both graph representations, and feature value representations (from its perspective, it is simply storing a data table in parquet in s3, with metadata)
  • Multiple of our processes require historical state, e.g. graph traversals at a previous point in time, and looking up feature values at previous points in time for ML model training. Instead of our current process of manually reconstructing graph state for every day in the past year, Delta Lake would automatically allow us to reconstruct the dataset at the specified point in time (“time travel”) without additional compute or engineer work, because it knows which delta change sets fell into the requested time window or not.
  • We should highly consider not trying to reinvent this functionality such as time travel if it already exists, in a standard way that customers would be able to reuse/adopt
  • This all provides value to the final customer because we can give them graphs and feature values in Delta Lake format, which means that we give them a space/cost-effective format, built-in point-in-time lookup over the past, which they need for model training, in an industry-standard way.

Hope that helps clarify. Really recommend familiarizing yourself with the paper: https://cs.stanford.edu/people/matei/papers/2020/vldb_delta_lake.pdf

it is likely/already A Thing, e.g. the creators of Spark (Databricks) view Delta Lake as the foundational storage layer for the Spark compute layer, and it is already built into things like Tecton.

https://docs.databricks.com/delta/delta-intro.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment