Skip to content

Instantly share code, notes, and snippets.

@jrevels
Last active October 20, 2022 20:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jrevels/fdfe939109bee23566d425440b7c759e to your computer and use it in GitHub Desktop.
Save jrevels/fdfe939109bee23566d425440b7c759e to your computer and use it in GitHub Desktop.

Comprehensive row validation requires row construction

If I am given a row value x of unknown provenance that I'd like to comprehensively validate (i.e. validate field values as well as field types) against schema s, then my only option is to construct a new row y = Legolas.row(s, x), at which point I can use y as my "s-validated" row. There is no mechanism to fully validate x "in-place".

The primary reason that this is the case is that field assignments may introduce implicit value-level constraints on a schema's input/output domain, and there is no enforceable mechanism by which schema authors are made to explicitly surface independently computable predicates for these implicit constraints.

For example, take the schema:

@schema("foo@1",
        x::Int = clamp(x, 1, 100))

In this example, the value level constraint that 1 <= x <= 100 is implicit in the RHS of the field assignment statement. You could imagine a world where we enable/encourage schema authors to explicitly surface such implicit constraints via some mechanism like the below (xref beacon-biosignals/Legolas.jl#20):

@schema("foo@1",
        @field(x::Int = clamp(x, 1, 100)),
        @check(1 < x < 100))

However, even if Legolas did support have such functionality, Legolas still is not able to actually enforce that every implied constraint in a field assignment is surfaced as an explicit constraint; this is an inescapable result of allowing arbitrary Julia code on the RHS of field assignments. Thus, the only practical way for any downstream data consumer to comprehensively compute all of a given schema's constraints for a given row is via Legolas' validation-by-construction approach.

Legolas-aware code can't be simultaneously generic, validated, and cost-free

As a result of the previous section, it's generally impossible to write a function that achieves all three of these properties simultaneously (modulo artificial cop-outs like leveraging special knowledge about a given schema, performing some weird global state trickery, etc.):

  1. The function is generic w.r.t. its arguments' types, as long as they are Tables.jl-compliant row values.
  2. The function guarantees that its arguments comply with a given schema before actually using the arguments.
  3. The function avoids computing "validated rows" from its arguments.

If two of the above hold for a given function, then the third is de facto ruled out:

  • If 1 & 2 hold, then the function must perform its own validation, and thus must validate by construction (as discussed in previous section).
  • If 1 & 3 hold, then the function has no guarantee that its arguments are valid w.r.t. a given schema.
  • If 2 & 3 hold, then the function must restrict its argument types to guarantee (at least nominally) that its arguments were created via validated construction. Note this restriction can manifest itself as an explicit dispatch constraint or some other implicit constraint (e.g. a trait-based constraint), but regardless, it'll have to pop up somehow within the function for 2 & 3 to hold.

Here's a maneuver that takes the above into account. Consider the current Legolas implementation on main and the statement r = Legolas.row(s::S, x). Here, let's say Legolas.row is implememented such that r's type (let's call it R) indicates that it was constructed in an s-validated manner. Though it's not currently the behavior on main, we could justifiably introduce a method like the following that bakes in Legolas' idempotency requirement (which in this PR, has been already been elevated to a requirement instead of a mere recommendation):

Legolas.row(s::S, r::R) = r

With this method in place, I could write a function like this:

function f(x)
    r = Legolas.row(s, x)
    return _f(r) # perform the actual computation `f` intends to perform
end

Here f fulfills 2 & 3 if x has already been validated (since it hits the no-op fallback), and fulfills 1 & 2 otherwise (note that it can't actually simultaneously fulfill 1 & 2 & 3, though).

The funny thing here is that if I encountered this code as a reviewer, I'd do two things:

  • First, I'd really try to confirm that full validation of x is "reasonably necessary" for _f. I don't have a concrete mental algorithm for this - it's kind of fuzzy, so maybe it's subjective - but I'd ask questions like "How many s-required fields does it really expect to be present/accessible?" If I find that validation seems reasonably unnecessary, I'd recommend removing the validation requirement entirely. In other words, I'd suggest that the code be changed to fit squarely in the 1 & 3 camp.

  • If it is found that full validation of x is, in fact, reasonably necessary for _f, then I'd recommend punting the fulfillment of that expectation to the caller on the basis of the single-responsibilty principle. I'd suggest setting the dispatch constraint x::R in the first place, such that f and _f become one and the same. In other words, I'd suggest that the code be changed to fit squarely in the 2 & 3 camp. In this case, the relevant code truly is non-generic, so feigning genericness is a bad idea, and thus the use of a Legolas.Row-like type really is warranted.

Either way, though, I'd suggest a refactor that punts dynamism downstream to callers.

Another possible pattern that "feigns genericness" (and thus is an antipattern) is the use of a validated construction method in a context where the relevant schema isn't actually known a priori (i.e. is passed into the invocation context). For example:

function g(s, stuff...)
    r = Legolas.row(s, some_transform(stuff...))
    return _g(r) # perform the actual computation `g` intends to perform
end

My hunch is that if you're in a situation where you don't know a priori what the schema might be, then you're in a situation where a validated constructor isn't actually useful to you at all anyway; you should refactor the code to fit more squarely into the 1 & 3 camp or the 2 & 3 camp.

...but I still want to avoid redundant validation/conversion and enable nice interop w/ schema-aware systems.

Consider this common data production scenario:

A data producer constructs table t via schema-aware validated construction methods (e.g. Legolas.row(s, _)). The data producer then serializes t from its native Julia format to some standardized interchange format (JSON, Arrow, etc.) and uploads it to a schema-aware system for downstream consumption.

In this scenario, the producer did a lot of upfront work by constructing t via validated s-aware methods. As such, the serialization step should be able to avoid perfoming extra work to revalidate t and/or infer its schema.

Legolas on main is currently subpar on this front. Let's say I've defined const Foo = @row("foo@1", a::Int, b::Union{Float64,String}); I might use Foo[] to try to allocate an empty mutable valid-by-construction row table. However, this eltype restriction is actually too loose to provide the guarantees we desire to enable efficient serialization; since each Foo instance may contain different additional non-required fields, determining the validity/schema of Foo[] as a table requires inspecting every row. Annoying. You could use a more concrete eltype upfront, but currently actually expressing that eltype is cumbersome.

I see two ways to improve this situation:

  1. Define better utilities for dynamically constructing more concrete row types from extensions of known schemas, e.g. row_type(Foo; b=Float64, c=Symbol) returns a concrete subtype that corresponds to a row that has all required fields of Foo, but with b::Float64 and an additional non-required field c::Symbol.
  2. Disallow this kind of "dynamic extension" entirely. If you want Legolas-generated validated construction utilities for an extension to a schema, that means you've entered a regime where you should explicitly declare that subschema ahead-of-time.

I think we should go with option 2. It's more constrained than option 1 (i.e. we can introduce option 1 later if it's desirable w/o inducing a breaking change) and option 2 induces nice second-order effects: it motivates users to declare schemas in a manner that is registerable by downstream Legolas-aware systems, and thus downstream schema-aware systems can feasibly require ahead-of-time schema registration (and thus have stronger type-checking guarantees) w/o pushing their own idiosyncratic limitations onto schema-aware data producers (e.g. avoids the "Yes, Legolas does help you emit this kind of data, but this particular system doesn't support ingesting it" annoyance).

Furthermore, one might argue that option 2 nicely excises an extraneous concern. Why should a validated construction utility be concened with "forwarding" fields that the utility isn't otherwise concerned with? If a caller wants to tack on additional fields to a constructed row, they probably should employ a more targeted method post-hoc to do just that (e.g. rowmerge).

Some conclusions I've drawn

(some of these follow from the above, some are other related observations)

  • Legolas' API should encourage Legolas-aware code to inhabit the most suitable edge of the "generic x validated x cost-free" triangle given the code's context
  • We do, in fact, want a "row type" for each declared schema version to facilitate reuse of prior validation and well-typed table construction
  • However, Legolas.Row has some warts. Ideally, our mechanism for generating row types should...
    • ...not enable users to express non-instantiable/incomplete/invalid types (e.g. indicated schema does not match indicated field structure)
    • ...convey the full available field structure of the instantiated row in an inferrable manner (whereas e.g. Legolas.Row{<:Schema} doesn't have an inferrable Table.Schema)
    • ...discourage generic use where it's not necessary/applicable (i.e. avoid "feigning genericness")
    • ...encourages ahead-of-time (sub)schema declaration rather than "dynamic schema extension"
    • ...avoid the problems discussed in beacon-biosignals/Legolas.jl#49

I think to achieve the above, I want to change the API to look roughly like this:

# Declares the `example.foo` schema and specifies that `Foo` is to be used
# as the prefix for types generated via subsequent `@version` declarations
@schema("example.foo", Foo)

# Replaces `main`'s `@row` macro. Generates:
# - a struct definition for a `FooV1 <: Tables.AbstractRow` type
# - a type alias `FooSchemaV1 = Legolas.SchemaVersion{Symbol("example.foo"),1}`
@version("example.foo@1", ...)

# Similar to above; generates:
# - a struct definition for a `FooV2 <: Tables.AbstractRow` type
# - a type alias `FooSchemaV2 = Legolas.SchemaVersion{Symbol("example.foo"),2}`
@version("example.foo@2", ...)

where the generated row types (e.g. FooV1, FooV2) ONLY contain the corresponding schema version's required fields, including those inherited from parent schemas. In other words, these types do not propagate "extra" fields.

This approach provides ergonomic types/constructors for declared schema versions while also...

  • ...more explicitly separating the notion of a "schema" from its "versions"
  • ...rendering it impossible to express an invalid/incomplete subtype of a row type
  • ...discouraging inappropriate attempts at genericness, i.e. avoids enabling row types to be dynamically derived from a schema type or vice versa
  • ...excising the notion of "dynamic schema extension" from our dedicated validation construction utilities
  • ...avoiding the problems discussed in beacon-biosignals/Legolas.jl#49

What type parameters should be exposed on generated row types?

Okay, so let's say we go with the aforementioned proposal. Should generated row types have type parameters, and if so, how should they be defined?

Take, for example:

@schema("example.record", Record)

@version("example.record@1",
         file_path::Any,
         file_id::UUID
         file_metadata::AbstractDict{String,String})

What should the struct definition for RecordV1 actually look like?

Option 1 is to simply not expose any type parameters:

struct RecordV1
    file_path::Any,
    file_id::UUID
    file_metadata::AbstractDict{String,String}
end

This option discourages use of non-concrete field types in general, to the point where I'd almost be tempted to enforce that each field have a concrete type constraint - a bit extreme feeling, but perhaps not unreasonable if the intent is to encourage serializability. Such a rule would also induce heavier use of schema extensions (e.g. leave file_path out entirely in the parent schema, and instead only include it with a concrete type constraint in extension schemas), which might be a good idea sometimes, but also might be a nightmare in some circumstances.

Option 2 is to uniformly expose a type parameter for every field:

struct RecordV1{F1<:Any,F2<:UUID,F3<:AbstractDict{String,String}}
    file_path::F1,
    file_id::F2
    file_metadata::F3
end

This option is nice in that parameters positionally align with fields, but annoying if a field already has a concrete type constraint.

Option 3 is to only expose parameters for fields w/o concrete type constraints:

struct RecordV1{F1<:Any,F3<:AbstractDict{String,String}}
    file_path::F1,
    file_id::UUID
    file_metadata::F3
end

This option maps more closely to how one might often implement such a struct outside of Legolas-land. But the mapping of parameter position to applicable field requires users to understand just a tad more about what's going on under the hood than I'd normally like. Kinda makes me wish Julia had kwarg-based type parameters for structs, a la RecordV1{file_path::F1}.

Option 4 is to let the schema author decide. For example, if you declare something like this:

@version("example.record@1",
         file_path::@parameter(Any),
         file_id::UUID
         file_metadata::AbstractDict{String,String})

Your generated type will be:

struct RecordV1{T<:Any}
    file_path::T,
    file_id::UUID
    file_metadata::AbstractDict{String,String}
end

Schema authors can then decide if/how to include these parameters in their schema's documented "API".

Option 4 seems like the safest bet to me, so let's go with this one.

Note that I think this means that authors of child schemas would have to explicitly re-declare any of the parent's @parameter declarations that they'd like to preserve, e.g.:

@schema("example.child-record", ChildRecord)

@version("example.child-record@1 > example.record@1",
         file_path::@parameter(Any), # without this redeclaration, the `file_path` field
                                     # would still be inherited, but its field type would
                                     # not be exposed as a `ChildRecordV1` type parameter
         another_thing::Int
         some_other_thing::@parameter(AbstractString))

What if my generated row type's field types are more tightly constrained than required column types for schema validation purposes?

Imagine the following:

@schema("example.baz", Baz)

@version("example.baz@1", id::UUID = UUID(id))

Here's a case where the author clearly wants BazV1(fields).id to be a UUID. However, what if we tried to validate this schema version against an Arrow table serialized from another language? In this case, the table wouldn't have the Julia-relevant metadata to associate the id column with the UUID Julia type; it'd probably have a type that maps to Julia's UInt128 type instead.

How should Legolas handle this? We could just force the schema author to change id's type constraint to Union{UInt128,UUID}, but this is annoying if the generated row type's field really is expected to be UUID.

Instead, Legolas should define a hook like:

accepted_field_type(::SchemaVersion, T::DataType) = T

...that is invoked by our validation code to preprocess fields' declared types before comparing them against provided types.

The author of example.baz@1 could then add an overload accepted_field_type:

accepted_field_type(::BazV1Schema, ::Type{UUID}) = Union{UUID,UInt128}

We could define sane default accepted_field_type methods to enable portability for common types, and authors can define their own methods to override those defaults or expand validation behavior for custom types.

@ericphanson
Copy link

Kinda makes me wish Julia had kwarg-based type parameters for structs, a la RecordV1{file_path::F1}.

BTW you might know this, but there's an @NamedTuple{field_path::String} constructor for creating NamedTuple types easily.

file_path::@parameter(Any),

I think it would be nicer if the syntax was file_path::@parameter(<:Any) instead. To me file_path::@parameter(Any) looks/feels more like ::Any, whereas adding the subtype symbol feels more like a concrete subtype of Any.

accepted_field_type(::BazV1Schema, ::Type{UUID}) = Union{UUID,UInt128}

Should accepted_field_type do the conversion too? That could be convenient for common cases, but maybe it too hidden. Like

field_type_conversion(::SchemaVersion, ::Type{UUID}, instance::UInt128) = UUID(instance)

I think maybe the explicit way of accepted_field_type is better though.

@ericphanson
Copy link

where the generated row types (e.g. FooV1, FooV2) ONLY contain the corresponding schema version's required fields, including those inherited from parent schemas. In other words, these types do not propagate "extra" fields.

So to make sure I understand, before, an example.record@1 would be a Legolas.Row{} with a type parameter indicating a schema, and wrapping a NamedTuple to hold the content. Now, an example.record@1 would be it's own struct, Example.RecordV1 (living in the package, not in Legolas?), and would not capture the "extra" columns. In particular, passing to the Example.RecordV1 constructor would drop extra columns (or error if they are present?). This would encourage folks to use schema extensions more, if they need to keep those extra columns around through serialization or something else that involves creating the Example.RecordV1 struct, and make a stronger separation between a constructed-and-therefore-validated Example.RecordV1 row and just a Tables.jl-compatible row with some columns.

I think this sounds reasonable. I think it would also perhaps improve serialization of nested schemas. E.g. one could make the schema-to-be-nested fully concrete, and have much fewer issues serializing a row or table of them as a field in another schema, since the layout should be fixed.


Going one step further, to me this almost feels like just a "better struct" rather than specifically a row. We were already starting to blur that boundary using Legolas.@row to define configuration objects (which could be put in a homogenous table if you need to loop over configs for some reason, but mostly were kept flat and just used for convienence and serialization help). And by "better struct" I don't mean better for every use, but better for a lot of the common uses, where the struct is semantically a collection of fields (as opposed to a struct defining a dictionary, which is semantically a collection of key-value pairs, and whose fields are very internal).

For example, there's already macros like AutoHashEquals (which I think is somewhat weird in that it lowers == to recursive isequal instead of recursive ==) to help with defining "a struct which is defined by its fields", since that implies a particular notion of equality, hash, etc, and Base.@kwdef for helping construction. I feel like with this, Legolas is kind of taking the idea further, by providing a much more comprehensive construction system, a notion of extension, versioning, and some help with serialization. In other words, I feel like rather than necessarily being directly about "rows", this is more generally about "record types" (which can be rows or can be configuration or something else). I like it!

@jrevels
Copy link
Author

jrevels commented Oct 17, 2022

BTW you might know this, but there's an @NamedTuple{field_path::String} constructor for creating NamedTuple types easily.

oh actually i didn't - dope!

I think it would be nicer if the syntax was file_path::@parameter(<:Any) instead. To me file_path::@parameter(Any) looks/feels more like ::Any, whereas adding the subtype symbol feels more like a concrete subtype of Any.

Hmm but actually I think that ::Any feeling is fine? Unless I'm misunderstanding. The field's type constraint is indeed ::Any, its type parameter constraint is <:Any. Note that this proposal does not force instantiators to pick concrete subtypes for exposed type parameters (and neither does it enforce the flip side: forcing schema authors to pick concrete subtypes for unparameterized field types)

Should accepted_field_type do the conversion too?

The callers I have in mind for accepted_field_type (e.g. Legolas' Table.Schema validation methods) are only handling types, not values, so I think if we did something like this it'd be via a separate new "hook". Worth consideration, though.

So to make sure I understand, before, an example.record@1 would be a Legolas.Row{} with a type parameter indicating a schema, and wrapping a NamedTuple to hold the content. Now, an example.record@1 would be it's own struct, Example.RecordV1 (living in the package, not in Legolas?), and would not capture the "extra" columns.

That's correct.

In particular, passing to the Example.RecordV1 constructor would drop extra columns (or error if they are present?).

IMO we should have them drop instead of error. Bakes in slightly more "responsibility" but the convenience is worth it IMO, and i wouldn't necessarily want to provide a separate "unvalidated" mechanism for extracting a schema version's required fields from a given row value

This would encourage folks to use schema extensions more, if they need to keep those extra columns around through serialization or something else that involves creating the Example.RecordV1 struct, and make a stronger separation between a constructed-and-therefore-validated Example.RecordV1 row and just a Tables.jl-compatible row with some columns.

Yup. And for clarity, I don't think of it as "I can only consider a row to be example.record@1-compliant if it's an Example.RecordV1", but rather "I can construct and pass around a validated bundle of required fields via Example.RecordV1 type".

I guess in some abstract sense it strengthens the notion that the point of declaring a schema version is to facilitate interactions with those required fields; additional non-required fields aren't disallowed, but they're ignorable

I think this sounds reasonable. I think it would also perhaps improve serialization of nested schemas. E.g. one could make the schema-to-be-nested fully concrete, and have much fewer issues serializing a row or table of them as a field in another schema, since the layout should be fixed.

Going one step further, to me this almost feels like just a "better struct" rather than specifically a row. We were already starting to blur that boundary using Legolas.@row to define configuration objects (which could be put in a homogenous table if you need to loop over configs for some reason, but mostly were kept flat and just used for convienence and serialization help). And by "better struct" I don't mean better for every use, but better for a lot of the common uses, where the struct is semantically a collection of fields (as opposed to a struct defining a dictionary, which is semantically a collection of key-value pairs, and whose fields are very internal).

For example, there's already macros like AutoHashEquals (which I think is somewhat weird in that it lowers == to recursive isequal instead of recursive ==) to help with defining "a struct which is defined by its fields", since that implies a particular notion of equality, hash, etc, and Base.@kwdef for helping construction. I feel like with this, Legolas is kind of taking the idea further, by providing a much more comprehensive construction system, a notion of extension, versioning, and some help with serialization. In other words, I feel like rather than necessarily being directly about "rows", this is more generally about "record types" (which can be rows or can be configuration or something else). I like it!

That's the dream! 😁 I agree about the "record type" terminology, I'll start using that more. I often wish I had a language like Julia but with more structural subtyping / data-oriented programming features, so in some ways Legolas is my outlet for that lol. I guess the weird part about it is the interplay between the "embedded (but hopefully somewhat portable/mappable?) IDL for (de)serializable tabular data" goal vs. the "expressive Julia-centric record type" goal. my hope/bet is that the best ergonomics lie at the intersection of the two

@jrevels
Copy link
Author

jrevels commented Oct 17, 2022

Hmm but actually I think that ::Any feeling is fine? Unless I'm misunderstanding. The field's type constraint is indeed ::Any, its type parameter constraint is <:Any.

oh maybe it's the choice of the word parameter for the macro that's bothersome - maybe @parameterize would be clearer, though a bit long. could also do something like @expose (i.e. "expose a parameter for this constraint")

@ericphanson
Copy link

I might’ve misunderstood but to me it sounds like the difference between

struct A{T<:Any}
x::T
end

and

struct A
x::Any
end

So @expose(Any) looks more like the second than the first.

@jrevels
Copy link
Author

jrevels commented Oct 18, 2022

in that case, I wonder if this syntax is too esoteric:

@version("example.child-record@1 > example.record@1",
         file_path::(<:Any)
         another_thing::Int
         some_other_thing::(<:AbstractString))

i kinda like that; it's at least no worse than @expose IMO

I thought about actually going full-on e.g. file_path::(T where T<:Any) but it's kinda verbose and I don't think I want folks to be able to reference these parameters on the RHS of field assignments (mainly because i don't want to care about generating the code to make that work correctly lol)

@ericphanson
Copy link

I like that! To me it is fairly clear.

The lack of commas also matches a struct definition, although with the use of parens, I'd kind of expect them. E.g. in terms of commas, I think I would prefer

@version("example.child-record@1 > example.record@1",
         file_path::(<:Any),
         another_thing::Int,
         some_other_thing::(<:AbstractString))

or

@version "example.child-record@1 > example.record@1" begin
     file_path::(<:Any)
     another_thing::Int
     some_other_thing::(<:AbstractString)
end

@jrevels
Copy link
Author

jrevels commented Oct 18, 2022

oh the lack of commas was just a typo lol. but i actually like the begin end version you proposed better anyway so now I'm glad I made the mistake 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment