Codd, Chomsky, McKinney, and Wickham walk into a bar… (maybe Chamberlain and Wittgenstein should also be included?)
- the dream of substrait is true separation b/w query engines and transformation APIs
- previously, particular APIs would give better performance due to their inextricable link to the architecture of the underlying compute engine
- if the above benefit is removed, folks could use the API with which they are most familiar
- given this, we could see an industry consolidation around the “best” transformation API.
while a substrait-adopted data industry would result in less churn of query-engine-specific ports of existing transformation APIs, I believe that the tradeoffs of using one API over another will become more heated, though considerably more esoteric. The example from SWE that comes to mind if the age-old debate of OOP vs functional programming.
- if compute performance is comparable b/w APIs, what tradeoffs between options remain on the API layer?
- what categorization and paradigm of APIs might exits?
- is there formerly unmeasured utility in a transformation APIs design? in that they might:
- encourage better reasoning about distributed transformations (e.g. pyspark)
- method chaining vs nesting?
- debugging experience?
- abstraction laddering?