Metabase Whitepaper
A database is a persistant store of data. A metabase is a persistent store of databases. You can think of a metabase as a higher-order database, one which drops certain constraints around data integrity to create an ecosystem optimized for more powerful kinds of data operations.
Let's say you're modeling note-taking. In a database, you may add a new row for each note -- something like this, maybe:
ID | TIMESTAMP | SUBJECT_ID | THOUGHTS |
---|---|---|---|
1 | 10:30pm | baseball | god I find baseball boring |
2 | 11:21am | coffee | coffee is so good! |
3 | 11:45am | coffee | ugh I need to drink less coffee. |
I've made one leap here, which is assuming that each of our notes can have a specific subject ID. The benefit here is that queries in the future can get all notes about a given topic, right? But here's a question: let's say I decide that actually I love baseball. Do I write a new note, or do I go back and edit the old one? This is the eternal question. Am I creating a log of opinions over time, or am I creating an up-to-date artifact that reflects my current understandings? A datbase forces you to make that decision up-front.
A metabase says: sometimes you need a database with a single updating value, and sometimes you need a database with a record of values over time. Our metabase tracks all metadata from every operation so that it can generate a completely lossless database with fully historical records, or it can generate a lossy snapshot that ignores history in favor of an efficient summary.
We use the following techniques to achieve this capability:
- The metabase is append-only. Once data has been written it may never be removed.
- The metabase is immutable - once a record has been written it may never be changed.
- Every record in the metabase shares the same physical schema.
- Every query against the metabase includes a potentially unique logical schema.
- The relation between the physical schema of the underlying data and the logical schema of the queries is automatically resolved by following edges across a graph.
- Every record in the metabase is a complete and valid unit of information that can be integrated in query results, negated in a subsequent update or shared with other metabases.
Claim 1
timestamp: t
system_id: myk's metabase
claimant: myk
claim_details:
person_with_opinion: myk
on_property: opinions
subject: baseball
on_property: opinions_people_have_about_this
opinion: "baseball is boring"
on_property: null
Claim 2
timestamp: t2
system_id: myk's metabase
claimant: myk
claim_details:
person_with_opinion: myk
on_property: opinions
subject: coffee
on_property: popularity
opinion: "coffee is so good!"
on_property: null
Claim 3
timestamp: t3
system_id: myk's metabase
claimant: myk
claim_details:
person_with_opinion: myk
on_property: concerns
subject: coffee
on_property: concerns_people_have
opinions: "I should drink less coffee"
on_property: people_who_want_to_drink_less_of_something
The big benefit here is at query time. All of the following are valid queries that we can now answer:
- does myk have any opinions about coffee?
- does myk have any concerns about coffee?
- does myk have any thoughts about coffee? (where thoughts has been defined elsewhere as opinions + concerns)
- has anyone registered the concern that they'd like to drink less of anything?
- how do people with (sentiment > .5) opinions about coffee feel about baseball?
- how consistant are peoples opinions about coffee?
By actively seeking out specific metadata at writetime and maintaining it instead of discarding it we can perform all kinds of fancy queries!