dataders/substrait_is_now.md

## substrait_is_now.md

      
    Raw
  

              substrait_is_now.md
            
          
    working title


Codd, Chomsky, McKinney, and Wickham walk into a bar…
(maybe Chamberlain and Wittgenstein should also be included?)

background


the dream of substrait is true separation b/w query engines and transformation APIs
previously, particular APIs would give better performance due to their inextricable link to the architecture of the underlying compute engine
if the above benefit is removed, folks could use the API with which they are most familiar
given this, we could see an industry consolidation around the “best” transformation API.

hypothesis

while a substrait-adopted data industry would result in less churn of query-engine-specific ports of existing transformation APIs, I believe that the tradeoffs of using one API over another will become more heated, though considerably more esoteric. The example from SWE that comes to mind if the age-old debate of OOP vs functional programming.
research questions


if compute performance is comparable b/w APIs, what tradeoffs between options remain on the API layer?
what categorization and paradigm of APIs might exits?
is there formerly unmeasured utility in a transformation APIs design? in that they might:

encourage better reasoning about distributed transformations (e.g. pyspark)
method chaining vs nesting?
debugging experience?
abstraction laddering?