Skip to content

Instantly share code, notes, and snippets.

@weavejester
Created March 6, 2012 01:47
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save weavejester/1982807 to your computer and use it in GitHub Desktop.
Save weavejester/1982807 to your computer and use it in GitHub Desktop.
Initial thoughts on Datomic

Initial thoughts on Datomic

Rich Hickey (of Clojure fame) has released a cloud-based database called Datomic that has some interesting properties.

Datomic is an log of assertions and retractions of "facts", much as a DVCS like Git is a log of code diffs. The state of the database at any one time is the sum of all the assertions and retractions up to that date.

Unlike Git, you cannot remove an assertion or retraction from the database. The log of changes is persistent, and always accessible. Because historical data cannot currently be removed from Datomic, using it in juristictions with strong privacy laws (like the EU) might be problematic. The only way around this at present is to instruct an attribute to throw away all history, via the :db/noHistory schema option.

Each assertion and retraction is bound to a transaction, and transactions are applied to the database via a transactor, a server that ensures transactions are atomic and do not conflict. This means you don't get the consistency problems of many NoSQL databases, but it does mean all writes must pass through a single machine.

If writes are somewhat bottlenecked, reads are anything but. Data is stored in the cloud in Amazon's DynamoDB, but is also aggressively cached by Datomic clients, known as Peers. This has the interesting result that Peers can run queries against Datomic entirely within their local memory.

So Datomic has atomic writes (with all their associated advantages and disadvantages), and what looks like super-fast cached reads.

Datomic's query language is Datalog, which should be familiar to many Clojure users, as it is also used in Clojure's core.logic library, and for writing Hadoop queries via Cascalog. I won't say more more about Datalog, as there's a wealth of information on it online.

The queries themselves are constructed as simple data structures, which makes much more sense than parsing a string, as in SQL databases. It is also the approach taken by several NoSQL databases, such as MongoDB.

Datomic is not schemaless, which sets it apart from many NoSQL databases, but nor does it group entities into fixed tables, as in SQL. An entity consists of one or more attributes, and each attribute must be defined in the database schema. In this sense, schema attributes in Datomic have more in common with type definitions in a statically typed programming language.

Like many databases, Datomic also supports indexes and uniqueness constraints via the schema. Since these indexes will be coming from either Dynamo DB, or in-memory, querying data in Datomic seems like it should be very quick indeed.

Finally Datomic has partitions, which are a way of grouping entities in the database. These differ from tables in SQL or collections in MongoDB in that Datomic partitions appear to only affect performance. Queries act on all partitions, but work faster across entities stored in the same partition.

So do I like it?

Actually, I really do. I suspect the use of a single transactor server to guarantee atomicity is going to take a lot of flak, but it seems like a reasonable compromise to me. Because all its doing is managing transactions, rather than persisting data, the transactor should be more performant than a single-server database, and because it doesn't store any data, it'll matter less if it goes down. It's probably the optimum solution for maintaining atomicity.

I like the idea of a persistent transaction log (even though we'll definitely need a way to 'forget' data in future), and being able to retrieve snapshots of the database at any point in time. Querying using datalog seems extremely powerful, and like the relational model, is based on first-order logic; a solid theoretical base. I also really like the idea of running queries against an in-memory cache.

@andershessellund
Copy link

I wonder if the partitions are going to be used to support concurrent write transactions, such that if two transactions access different partitions only, then they can run in parallel?

@stuartsierra
Copy link

clojure.core.logic is not Datalog, but rather a dialect of Prolog. Strictly speaking, Datalog is a subset of Prolog, but the big differences are 1) order of expressions doesn't matter in Datalog; and 2) Datalog programs always terminate.

@maacl
Copy link

maacl commented Mar 7, 2012

Nice writeup. Actually git stores snapshots of the entire sourcetree. Of course (Packfiles)[http://book.git-scm.com/7_the_packfile.html] are used as an optimisation, but I don't think it is correct to say that Git works on the basis of a log of diffs.

@DanielJomphe
Copy link

Nor would I say Datomic works on the basis of a log of fact diffs. In the end, though, I see the original intention of @weavejester as meaning both Datomic and Git work as a log of facts/contents, and thus, diffs are naturally available in between two given instants in their timeline.

@hozumi
Copy link

hozumi commented Mar 9, 2012

Because all its doing is managing transactions, rather than persisting data,
the transactor should be more performant than a single-server database,
and because it doesn't store any data, it'll matter less if it goes down.

According to this illustration, I think the transactor do store data into memory or disk in index form to forward it to the storage service.

@luposlip
Copy link

The transactor is not a single point of failure.

This is from the Datomic FAQ at http://datomic.com/company/resources/faq:

Is the transactor a single point of failure?

No. Multiple transactors can be run in standby mode in a single auto-scaling group, each ready to take over in the case of a failing transactor. At no point can data be lost, as no transactions are acknowledged unless written through to the storage service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment