I messed up explaining the value prop of Delta Lake, including the value prop for customers. Here is a better version:
- Delta Lake allows you to build a dataset incrementally over time in S3, storing only the incremental changes (e.g. per-day), and then allowing you to look up the previous state of the dataset at a past point in time (“time travel”)
- In our case, with a X00GB graph (e.g. GraphFrames of nodes + edges), we would not need to duplicate the entire dataset every day, we would only need to store the changes per day, which is a large space/cost savings over making duplicate datasets every day.
- This functionality works for storing any dataset incrementally, which in our case would work for both graph representations, and feature value representations (from its perspective, it is simply storing a data table in parquet in s3, with metadata)
- Multiple of our processes require historical state, e.g. graph traversals at a previous point in time, and looking up feature values at previous points in ti