Skip to content

Instantly share code, notes, and snippets.

@tantaman
Last active June 20, 2023 16:42
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tantaman/bd928ef93619e73365b07899da282996 to your computer and use it in GitHub Desktop.
Save tantaman/bd928ef93619e73365b07899da282996 to your computer and use it in GitHub Desktop.

Schema

Example of a storage agnostic schema layer that can drive queries, graphql, privacy, mutations and other concerns. We'll call it Ent (short for entity).

A Facebook team from Tel Aviv did open-source an Ent framework for Go. It is heavily inspired by the framework used by the rest of the company except that EntGo:

  1. Only supports Go as a target language
  2. Only supports SQL and Gremlin as backends
  3. Schema definitions are written in Go rather than a DSL or something like Yaml.

I never got around to asking them why they didn't use the existing framework.

In their docs, the section on graph traversals will give you a solid example of

  1. How schemas could be defined
  2. The mutators generated from schemas that would be used to create data
  3. The query layer generated from schemas that is used to traverse the graph of entities.

We'll use this TS pseudo-schema for reference in the rest of the document:

// Schema.ts
ent('Person').fields({
  id: primaryKey(),
  name: string(),
}).edges({
  friends: junctionEdge.to('Person')
  pets: foreignKeyEdge.to('Pet').through('ownerId')
}).storage({
  engine: 'sql',
  db: 'persondb',
  table: 'person'
});

ent('Pet').fields({
  id: fromFields('ownerId', 'name'),
  name: string(),
  ownerId: id.of('Person')
}).edges({
  owner: fieldEdge.to('Person').through('ownerId')
}).storage({
  engine: 'zippy',
  db: 'petdb',
  usecase: 'pet'
})
// Interfaces for generated queries (in reality, implementations are generated)
interface PersonQuery extends Query<Person> {
  // Nit: in reality `edge queries` are generated which allow fetching
  // of data stored on an edge
  queryFriends(): PersonQuery;
  queryPets(): PetQuery;

  // Predicates can be "hoisted" to the backend (e.g., turned into where
  // clauses) in a query optimization step -or-
  // applied in code after the data is laoded.
  // This latter point is important for later.
  whereName(p: Predicate<string>): PersonQuery;
  whereId(p: Predicate<ID<Person>>): PersonQuery;
}

interface PetQuery extends Query<Pet> {
  queryOwner(): PersonQuery;

  whereName(p: Predicate<string>): PetQuery;
  whereOwner(p: Predicate<Person>): PetQuery;
  whereId(p: Predicate<ID<Pet>>): PetQuery;
}

interface Query<T> {
  take(): this;
  union(...others: this[]): this;
  intersect(...others: this[]): this;
  concat(...others: this[]): this;
  after(cursor: Cursor): this;
  map(fn: (e: T) => R): Query<R>;
  ids(): Query<ID<T>>;
  // ... other common things we can do with all query types
  
  gen(): Promise<T[]>;
}
// Interfaces for generated entities
interface Person {
  readonly id: ID<Person>;
  readonly name: string;

  queryFriends(): PersonQuery;
  queryPets(): PetQuery;
}

interface Pet {
  readonly id: ID<Pet>;
  readonly name: string;

  queryOwner(): PersonQuery;
  genOwner(): Promise<Person>; 
}

Aside - Traversing Across Storage Backends

A natural question would be how does an ORM that supports multiple backends handle edges from an entity stored in one backend (e.g., SQL) to an entity stored in another (e.g., ZippyDB). This problem also exists for same backend entity types if the entity types are on two different physical databases.

A few things:

  1. Applying queries generates a list of queries and derived queries
  2. An optimization step rolls this list together into expressions (e.g., SQLExpression, ZippyExpression)
  3. If everything can't be rolled into a single expression, expressions are chained through a concept of "chunked iteration"

Example

person.queryFriends().queryFriends().whereName(P.equals("Susan")).gen();

The above would generate a sequence of queries to friends of friends name Susan. Since that sequence only hits the Person type and that type is all stored in a single DB (see the schema definition), the optimization step would roll this into a single SQL expression.

For a different example:

person.queryFriends().queryPets().take(5).gen();

where Person and Pets are on two different storage systems, we can't roll this into a single SQL or ZippyDB expression.

At this point we use a concept of "chunked iteration." The expression for the first query is executed, returns results in chunks and the second query is chained after the first one. This chaining causes the second query to kick off and process the chunks sent to it by the first query.

We use chunks because we want to operate on chunkSize items in parallel rather than all possible things in parallel. All possible things could be too large. Also -- if hops have limits and filters, we might fulfill the query before processing all data from the first hop. So might as well do it in batches.

Lastly, cross-storage indexing services do exist. If a type indexes its edges in one of these services, it can express it in its schema. The query layer can use that index instead.

Mutations & GraphQL

Besides just encoding fields & edges, schemas can encode "actions" or "mutations."

ent('Person').fields(...).edges(...)
.actions({
  addFriend: args('Person'),
  buyPet: args('Pet'),
})

Given the actions against a type are encoded in a declarative way, we can generate GraphQL input args and mutation calls and the boilerplate function defintions on the backend. args(EntityType) would generally encode to requiring the ID of the entity for the GraphQL mutation call.

We could theoretically encode more information into the action schema like what edges are changed. Maybe use this to drive subscriptions and re-fetch fragments?

Edges

We touched on the query layer earlier. A nice thing about this layer is that all queries implement the Query interface (defined at the top of the doc). Whenever an exposed field returns a subclass of Query we can generate the corresponding GraphQL connection definition.

interface Person {
  @GraphQLField('friends')
  queryFriends(): PersonQuery;
}

which would create a GraphQL connection that looks something like:

Person {
  friends(first: Int, last: Int, before: String, after: String) {
    edges {
      cursor: String
      node: Person
    }
    pageInfo {
      hasNextPage: Boolean
      hasPrevPage: Boolean
      cursor: String
    }
  }
}

Note: people can also create their own query classes on the backend that subclass Query and get this goodness rather than everything having to be ORM driven.

Limiting Exposure

Of course we don't want to expose everything in our schemas to GraphQL. To this end, we can declare which fields & edges are exposed and to which GraphQL schemas. The latter is probably an edge case in the wild? Or a design smell if a single entity type exposes fields to different GraphQL schemas.

ent('Person').fields(...).edges(...).actions(...)
.integrations({
  graphql: expose('name', 'friends', 'pets').to('ProdSchema')
})

Privacy

This is all pretty well and good but if you're directly exposing entities (people, pets), how do you ensure they're not loaded by the wrong individuals?

The schema layer can also encode privacy. Ideally people encode this in a declarative manner but that isn't possible in all cases. This privacy is checked every time an entity is loaded.

The EntGo docs have ok examples of this. A more condensed example below.

ent('Person').fields(...).edges(...).actions(...).integrations(...)
.privacy({
  read: [
    AllowIf((viewer, person) => person.id === viewer.id),
    AllowIf(
      (viewer, person) =>
        person.queryFriends().whereId(P.equals(viewer.id)).exists()
    ),
    AlwaysDeny(),
  ],
  write: [
    AllowIf(() => Person.queryById(Viewer.id).exists())
  ]
})

This would force that any reads of a person must be done by:

  • The viewer that is the person
  • A viewer that is a friend of the person

And writes can only be done by the person themselves.

The ORM checks these privacy rules upon loading the entities it receives back from the storage layer or before committing a write to the storage layer.

Policies / Purpose Use

Note: this is likely a foreign concept to most people.

In the face of GDPR and other such regulations that seek to enforce "purpose limitation" on data, privacy rules based on the current viewer aren't always enough.

Purpose limitation is the commitment that x data is only used for y purposes. An example of how this differs from viewer based privacy is the following.

Product X stores my phone number that I gave it for 2fac. I (the logged in presence) can see my phone number in Product X. Although the logged in presence can load the phone number in the product, that does not mean that the product can use it for anything it wants. I.e., the product can't use it for suggesting friends. It needs to know that this phone number that was able to be loaded should only be passed to 2fac code.

Schemas can encode purpose limitation of their fields and/or types. The code that then loads data through the entity layer needs to supply information about its execution context.

  • Is the execution context "friend suggestions"? Then phone numbers can't be read off the Person entity.
  • Is the execution context "2fac"? Then phone numbers can be read off the person entity.

Given this is probably an edge case for most groups I won't go into much more detail other than to link to this set of experiments for my future self.

Arbitrary Backends

Note -- we can even encode Thrift services or arbitrary code into in our schema layer. They're just another thing that has a custom SourceExpression (e.g., SQLExpression, ZippyExpression, ThriftServiceExpression) for fetching data.

Random Thoughts

Do we need two schema layers? The one described here and GraphQL? Can they just converge to the same thing?

The one here is strictly more powerful and exposes things you would never want to expose via GraphQL. GraphQL would need some sort of concept of "internal" / server-side? operations and "external" operaetions.

GraphQL also currently has no way of encoding privacy rules on reads and writes.

To make one thing the "end all and be all" -- is that just going to make it too complicated to fit into someone's head? And be a turnoff? Should we take a more Python style philosophy and have many small and focused tools?

I'd lean toward yes -- we need two layers. GraphQL being there to serve the single purpose of transport from server to client and client to server.

"Ent" as the full data description with integrations for extra concerns around that data. E.g., privacy, storage, transport (GraphQL), semantic type integrations.

@tantaman
Copy link
Author

tantaman commented Apr 1, 2022

Other things in the schema layer:

  • abuse protection (e.g., spam, account takeover) system integration
  • feature extraction (e.g., labeling what fields should be pulled for features for ML pipelines)
  • human content review (e.g., put semantic information on fields to derive meaningful visualizations)
  • rules / policy engine integration (similar to feature extraction)
  • ...

@tantaman
Copy link
Author

tantaman commented Apr 7, 2022

@tantaman
Copy link
Author

tantaman commented Apr 7, 2022

database migrations can also be described and automated at the schema level.

You instruct which field is new and which field is being deprecated.
The schema layer then:

  1. generates mutators which perform dual-writes to the new and old fields from the application layer
  2. generates a migration job to back-fill the new field which is run offline

Once all rows have the new field filled in, a new schema update is made that drops the old field from the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment