Skip to content

Instantly share code, notes, and snippets.

@dataders
Created January 24, 2023 18:31
Show Gist options
  • Save dataders/f5807d7d3675eefe03d1131fa4c93c30 to your computer and use it in GitHub Desktop.
Save dataders/f5807d7d3675eefe03d1131fa4c93c30 to your computer and use it in GitHub Desktop.
thinking out loud about substrait

working title

Codd, Chomsky, McKinney, and Wickham walk into a bar… (maybe Chamberlain and Wittgenstein should also be included?)

background

  • the dream of substrait is true separation b/w query engines and transformation APIs
  • previously, particular APIs would give better performance due to their inextricable link to the architecture of the underlying compute engine
  • if the above benefit is removed, folks could use the API with which they are most familiar
  • given this, we could see an industry consolidation around the “best” transformation API.

hypothesis

while a substrait-adopted data industry would result in less churn of query-engine-specific ports of existing transformation APIs, I believe that the tradeoffs of using one API over another will become more heated, though considerably more esoteric. The example from SWE that comes to mind if the age-old debate of OOP vs functional programming.

research questions

  • if compute performance is comparable b/w APIs, what tradeoffs between options remain on the API layer?
  • what categorization and paradigm of APIs might exits?
  • is there formerly unmeasured utility in a transformation APIs design? in that they might:
    • encourage better reasoning about distributed transformations (e.g. pyspark)
    • method chaining vs nesting?
    • debugging experience?
    • abstraction laddering?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment