jrevels/legolas-thoughts.md

## legolas-thoughts.md

      
    Raw
  

              legolas-thoughts.md
            
          
    Comprehensive row validation requires row construction

If I am given a row value x of unknown provenance that I'd like to comprehensively validate (i.e. validate field values as well as field types) against schema s, then my only option is to construct a new row y = Legolas.row(s, x), at which point I can use y as my "s-validated" row. There is no mechanism to fully validate x "in-place".
The primary reason that this is the case is that field assignments may introduce implicit value-level constraints on a schema's input/output domain, and there is no enforceable mechanism by which schema authors are made to explicitly surface independently computable predicates for these implicit constraints.
For example, take the schema:
@schema("foo@1",
        x::Int = clamp(x, 1, 100))
In this example, the value level constraint that 1 <= x <= 100 is implicit in the RHS of the field assignment statement. You could imagine a world where we enable/encourage schema authors to explicitly surface such implicit constraints via some mechanism like the below (xref beacon-biosignals/Legolas.jl#20):
@schema("foo@1",
        @field(x::Int = clamp(x, 1, 100)),
        @check(1 < x < 100))
However, even if Legolas did support have such functionality, Legolas still is not able to actually enforce that every implied constraint in a field assignment is surfaced as an explicit constraint; this is an inescapable result of allowing arbitrary Julia code on the RHS of field assignments. Thus, the only practical way for any downstream data consumer to comprehensively compute all of a given schema's constraints for a given row is via Legolas' validation-by-construction approach.
Legolas-aware code can't be simultaneously generic, validated, and cost-free

As a result of the previous section, it's generally impossible to write a function that achieves all three of these properties simultaneously (modulo artificial cop-outs like leveraging special knowledge about a given schema, performing some weird global state trickery, etc.):

The function is generic w.r.t. its arguments' types, as long as they are Tables.jl-compliant row values.
The function guarantees that its arguments comply with a given schema before actually using the arguments.
The function avoids computing "validated rows" from its arguments.

If two of the above hold for a given function, then the third is de facto ruled out:

If 1 & 2 hold, then the function must perform its own validation, and thus must validate by construction (as discussed in previous section).
If 1 & 3 hold, then the function has no guarantee that its arguments are valid w.r.t. a given schema.
If 2 & 3 hold, then the function must restrict its argument types to guarantee (at least nominally) that its arguments were created via validated construction. Note this restriction can manifest itself as an explicit dispatch constraint or some other implicit constraint (e.g. a trait-based constraint), but regardless, it'll have to pop up somehow within the function for 2 & 3 to hold.

Here's a maneuver that takes the above into account. Consider the current Legolas implementation on main and the statement r = Legolas.row(s::S, x). Here, let's say Legolas.row is implememented such that r's type (let's call it R) indicates that it was constructed in an s-validated manner. Though it's not currently the behavior on main, we could justifiably introduce a method like the following that bakes in Legolas' idempotency requirement (which in this PR, has been already been elevated to a requirement instead of a mere recommendation):
Legolas.row(s::S, r::R) = r
With this method in place, I could write a function like this:
function f(x)
    r = Legolas.row(s, x)
    return _f(r) # perform the actual computation `f` intends to perform
end
Here f fulfills 2 & 3 if x has already been validated (since it hits the no-op fallback), and fulfills 1 & 2 otherwise (note that it can't actually simultaneously fulfill 1 & 2 & 3, though).
The funny thing here is that if I encountered this code as a reviewer, I'd do two things:


First, I'd really try to confirm that full validation of x is "reasonably necessary" for _f. I don't have a concrete mental algorithm for this - it's kind of fuzzy, so maybe it's subjective - but I'd ask questions like "How many s-required fields does it really expect to be present/accessible?" If I find that validation seems reasonably unnecessary, I'd recommend removing the validation requirement entirely. In other words, I'd suggest that the code be changed to fit squarely in the 1 & 3 camp.


If it is found that full validation of x is, in fact, reasonably necessary for _f, then I'd recommend punting the fulfillment of that expectation to the caller on the basis of the single-responsibilty principle. I'd suggest setting the dispatch constraint x::R in the first place, such that f and _f become one and the same. In other words, I'd suggest that the code be changed to fit squarely in the 2 & 3 camp. In this case, the relevant code truly is non-generic, so feigning genericness is a bad idea, and thus the use of a Legolas.Row-like type really is warranted.


Either way, though, I'd suggest a refactor that punts dynamism downstream to callers.
Another possible pattern that "feigns genericness" (and thus is an antipattern) is the use of a validated construction method in a context where the relevant schema isn't actually known a priori (i.e. is passed into the invocation context). For example:
function g(s, stuff...)
    r = Legolas.row(s, some_transform(stuff...))
    return _g(r) # perform the actual computation `g` intends to perform
end
My hunch is that if you're in a situation where you don't know a priori what the schema might be, then you're in a situation where a validated constructor isn't actually useful to you at all anyway; you should refactor the code to fit more squarely into the 1 & 3 camp or the 2 & 3 camp.
...but I still want to avoid redundant validation/conversion and enable nice interop w/ schema-aware systems.

Consider this common data production scenario:

A data producer constructs table t via schema-aware validated construction methods (e.g. Legolas.row(s, _)). The data producer then serializes t from its native Julia format to some standardized interchange format (JSON, Arrow, etc.) and uploads it to a schema-aware system for downstream consumption.

In this scenario, the producer did a lot of upfront work by constructing t via validated s-aware methods. As such, the serialization step should be able to avoid perfoming extra work to revalidate t and/or infer its schema.
Legolas on main is currently subpar on this front. Let's say I've defined const Foo = @row("foo@1", a::Int, b::Union{Float64,String}); I might use Foo[] to try to allocate an empty mutable valid-by-construction row table. However, this eltype restriction is actually too loose to provide the guarantees we desire to enable efficient serialization; since each Foo instance may contain different additional non-required fields, determining the validity/schema of Foo[] as a table requires inspecting every row. Annoying. You could use a more concrete eltype upfront, but currently actually expressing that eltype is cumbersome.
I see two ways to improve this situation:

Define better utilities for dynamically constructing more concrete row types from extensions of known schemas, e.g. row_type(Foo; b=Float64, c=Symbol) returns a concrete subtype that corresponds to a row that has all required fields of Foo, but with b::Float64 and an additional non-required field c::Symbol.
Disallow this kind of "dynamic extension" entirely. If you want Legolas-generated validated construction utilities for an extension to a schema, that means you've entered a regime where you should explicitly declare that subschema ahead-of-time.

I think we should go with option 2. It's more constrained than option 1 (i.e. we can introduce option 1 later if it's desirable w/o inducing a breaking change) and option 2 induces nice second-order effects: it motivates users to declare schemas in a manner that is registerable by downstream Legolas-aware systems, and thus downstream schema-aware systems can feasibly require ahead-of-time schema registration (and thus have stronger type-checking guarantees) w/o pushing their own idiosyncratic limitations onto schema-aware data producers (e.g. avoids the "Yes, Legolas does help you emit this kind of data, but this particular system doesn't support ingesting it" annoyance).
Furthermore, one might argue that option 2 nicely excises an extraneous concern. Why should a validated construction utility be concened with "forwarding" fields that the utility isn't otherwise concerned with? If a caller wants to tack on additional fields to a constructed row, they probably should employ a more targeted method post-hoc to do just that (e.g. rowmerge).
Some conclusions I've drawn

(some of these follow from the above, some are other related observations)

Legolas' API should encourage Legolas-aware code to inhabit the most suitable edge of the "generic x validated x cost-free" triangle given the code's context
We do, in fact, want a "row type" for each declared schema version to facilitate reuse of prior validation and well-typed table construction
However, Legolas.Row has some warts. Ideally, our mechanism for generating row types should...

...not enable users to express non-instantiable/incomplete/invalid types (e.g. indicated schema does not match indicated field structure)
...convey the full available field structure of the instantiated row in an inferrable manner (whereas e.g. Legolas.Row{<:Schema} doesn't have an inferrable Table.Schema)
...discourage generic use where it's not necessary/applicable (i.e. avoid "feigning genericness")
...encourages ahead-of-time (sub)schema declaration rather than "dynamic schema extension"
...avoid the problems discussed in beacon-biosignals/Legolas.jl#49


I think to achieve the above, I want to change the API to look roughly like this:
# Declares the `example.foo` schema and specifies that `Foo` is to be used
# as the prefix for types generated via subsequent `@version` declarations
@schema("example.foo", Foo)

# Replaces `main`'s `@row` macro. Generates:
# - a struct definition for a `FooV1 <: Tables.AbstractRow` type
# - a type alias `FooSchemaV1 = Legolas.SchemaVersion{Symbol("example.foo"),1}`
@version("example.foo@1", ...)

# Similar to above; generates:
# - a struct definition for a `FooV2 <: Tables.AbstractRow` type
# - a type alias `FooSchemaV2 = Legolas.SchemaVersion{Symbol("example.foo"),2}`
@version("example.foo@2", ...)
where the generated row types (e.g. FooV1, FooV2) ONLY contain the corresponding schema version's required fields, including those inherited from parent schemas. In other words, these types do not propagate "extra" fields.
This approach provides ergonomic types/constructors for declared schema versions while also...

...more explicitly separating the notion of a "schema" from its "versions"
...rendering it impossible to express an invalid/incomplete subtype of a row type
...discouraging inappropriate attempts at genericness, i.e. avoids enabling row types to be dynamically derived from a schema type or vice versa
...excising the notion of "dynamic schema extension" from our dedicated validation construction utilities
...avoiding the problems discussed in beacon-biosignals/Legolas.jl#49

What type parameters should be exposed on generated row types?

Okay, so let's say we go with the aforementioned proposal. Should generated row types have type parameters, and if so, how should they be defined?
Take, for example:
@schema("example.record", Record)

@version("example.record@1",
         file_path::Any,
         file_id::UUID
         file_metadata::AbstractDict{String,String})
What should the struct definition for RecordV1 actually look like?
Option 1 is to simply not expose any type parameters:
struct RecordV1
    file_path::Any,
    file_id::UUID
    file_metadata::AbstractDict{String,String}
end
This option discourages use of non-concrete field types in general, to the point where I'd almost be tempted to enforce that each field have a concrete type constraint - a bit extreme feeling, but perhaps not unreasonable if the intent is to encourage serializability. Such a rule would also induce heavier use of schema extensions (e.g. leave file_path out entirely in the parent schema, and instead only include it with a concrete type constraint in extension schemas), which might be a good idea sometimes, but also might be a nightmare in some circumstances.
Option 2 is to uniformly expose a type parameter for every field:
struct RecordV1{F1<:Any,F2<:UUID,F3<:AbstractDict{String,String}}
    file_path::F1,
    file_id::F2
    file_metadata::F3
end
This option is nice in that parameters positionally align with fields, but annoying if a field already has a concrete type constraint.
Option 3 is to only expose parameters for fields w/o concrete type constraints:
struct RecordV1{F1<:Any,F3<:AbstractDict{String,String}}
    file_path::F1,
    file_id::UUID
    file_metadata::F3
end
This option maps more closely to how one might often implement such a struct outside of Legolas-land. But the mapping of parameter position to applicable field requires users to understand just a tad more about what's going on under the hood than I'd normally like. Kinda makes me wish Julia had kwarg-based type parameters for structs, a la RecordV1{file_path::F1}.
Option 4 is to let the schema author decide. For example, if you declare something like this:
@version("example.record@1",
         file_path::@parameter(Any),
         file_id::UUID
         file_metadata::AbstractDict{String,String})
Your generated type will be:
struct RecordV1{T<:Any}
    file_path::T,
    file_id::UUID
    file_metadata::AbstractDict{String,String}
end
Schema authors can then decide if/how to include these parameters in their schema's documented "API".
Option 4 seems like the safest bet to me, so let's go with this one.
Note that I think this means that authors of child schemas would have to explicitly re-declare any of the parent's @parameter declarations that they'd like to preserve, e.g.:
@schema("example.child-record", ChildRecord)

@version("example.child-record@1 > example.record@1",
         file_path::@parameter(Any), # without this redeclaration, the `file_path` field
                                     # would still be inherited, but its field type would
                                     # not be exposed as a `ChildRecordV1` type parameter
         another_thing::Int
         some_other_thing::@parameter(AbstractString))
What if my generated row type's field types are more tightly constrained than required column types for schema validation purposes?

Imagine the following:
@schema("example.baz", Baz)

@version("example.baz@1", id::UUID = UUID(id))
Here's a case where the author clearly wants BazV1(fields).id to be a UUID. However, what if we tried to validate this schema version against an Arrow table serialized from another language? In this case, the table wouldn't have the Julia-relevant metadata to associate the id column with the UUID Julia type; it'd probably have a type that maps to Julia's UInt128 type instead.
How should Legolas handle this? We could just force the schema author to change id's type constraint to Union{UInt128,UUID}, but this is annoying if the generated row type's field really is expected to be UUID.
Instead, Legolas should define a hook like:
accepted_field_type(::SchemaVersion, T::DataType) = T
...that is invoked by our validation code to preprocess fields' declared types before comparing them against provided types.
The author of example.baz@1 could then add an overload accepted_field_type:
accepted_field_type(::BazV1Schema, ::Type{UUID}) = Union{UUID,UInt128}
We could define sane default accepted_field_type methods to enable portability for common types, and authors can define their own methods to override those defaults or expand validation behavior for custom types.