mbilokonsky/intro.md Secret

## intro.md

      
    Raw
  

              intro.md
            
          
    Metabase Whitepaper
A database is a persistant store of data. A metabase is a persistent store of databases. You can think of a metabase as a higher-order database, one which drops certain constraints around data integrity to create an ecosystem optimized for more powerful kinds of data operations.
Let's say you're modeling note-taking. In a database, you may add a new row for each note -- something like this, maybe:


ID
TIMESTAMP
SUBJECT_ID
THOUGHTS


1
10:30pm
baseball
god I find baseball boring


2
11:21am
coffee
coffee is so good!


3
11:45am
coffee
ugh I need to drink less coffee.


I've made one leap here, which is assuming that each of our notes can have a specific subject ID. The benefit here is that queries in the future can get all notes about a given topic, right? But here's a question: let's say I decide that actually I love baseball. Do I write a new note, or do I go back and edit the old one? This is the eternal question. Am I creating a log of opinions over time, or am I creating an up-to-date artifact that reflects my current understandings? A datbase forces you to make that decision up-front.
A metabase says: sometimes you need a database with a single updating value, and sometimes you need a database with a record of values over time. Our metabase tracks all metadata from every operation so that it can generate a completely lossless database with fully historical records, or it can generate a lossy snapshot that ignores history in favor of an efficient summary.
We use the following techniques to achieve this capability:

The metabase is append-only. Once data has been written it may never be removed.
The metabase is immutable - once a record has been written it may never be changed.
Every record in the metabase shares the same physical schema.
Every query against the metabase includes a potentially unique logical schema.
The relation between the physical schema of the underlying data and the logical schema of the queries is automatically resolved by following edges across a graph.
Every record in the metabase is a complete and valid unit of information that can be integrated in query results, negated in a subsequent update or shared with other metabases.

Claim 1
  timestamp: t
  system_id: myk's metabase
  claimant: myk
  claim_details:
    person_with_opinion: myk
      on_property: opinions
    subject: baseball
      on_property: opinions_people_have_about_this
    opinion: "baseball is boring"
      on_property: null
    
Claim 2
  timestamp: t2
  system_id: myk's metabase
  claimant: myk
  claim_details:
    person_with_opinion: myk
      on_property: opinions
    subject: coffee
      on_property: popularity
    opinion: "coffee is so good!"
      on_property: null
    
Claim 3
  timestamp: t3
  system_id: myk's metabase
  claimant: myk
  claim_details:
    person_with_opinion: myk
      on_property: concerns
    subject: coffee
      on_property: concerns_people_have
    opinions: "I should drink less coffee"
      on_property: people_who_want_to_drink_less_of_something

The big benefit here is at query time. All of the following are valid queries that we can now answer:

does myk have any opinions about coffee?
does myk have any concerns about coffee?
does myk have any thoughts about coffee? (where thoughts has been defined elsewhere as opinions + concerns)
has anyone registered the concern that they'd like to drink less of anything?
how do people with (sentiment > .5) opinions about coffee feel about baseball?
how consistant are peoples opinions about coffee?

By actively seeking out specific metadata at writetime and maintaining it instead of discarding it we can perform all kinds of fancy queries!
ID	TIMESTAMP	SUBJECT_ID	THOUGHTS
1	10:30pm	baseball	god I find baseball boring
2	11:21am	coffee	coffee is so good!
3	11:45am	coffee	ugh I need to drink less coffee.