Skip to content

Instantly share code, notes, and snippets.

@sritchie

sritchie/edb.md Secret

Created December 1, 2011 18:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sritchie/76f6ea67a874bc7c0f45 to your computer and use it in GitHub Desktop.
Save sritchie/76f6ea67a874bc7c0f45 to your computer and use it in GitHub Desktop.

Pseudocode for new Interfaces

Cascading -> Persistence

The implementer uses Cascading to generate "Document" objects; these are kryo-serializable packets that contain the entire database record. For example, for key->val and search:

  • KeyValDocument(byte[] k, byte[]v)
  • SearchDocument(LuceneDoc d)

And for key->set:

  • KeySetDocument(byte[] setKey, byte[] setVal)
  • KeySetDocument(byte[] setKey, int size)

Using KeySet as an example, An Updater (see below) will convert the KeySetDocument into the proper key and value byte arrays for BerkeleyDB.

The user generates documents and sinks them into the ElephantTap. The "tail assembly" can be sugar over this process; the key-value tail assembly might accept key-value pairs and do the document conversion implicitly.

ElephantTap:

Constructor: ElephantTap(Args args)

User sinks Documents into this tap, EDB takes care of the rest.

Args specify:

  • PersistenceCoordinator<S implements ShardScheme> (required)
  • Number of Shards (required)
  • List of Kryo Serializations (optional) // register all with kryo
  • Target Filesystem Credentials (optional)
  • Cleanup parameters for versioned store (versions to keep? see DomainStore below for more thoughts.)
  • Updater<P implements LocalPersistence, D implements Document> (required)
    • bool loadOldVersion() Returns false by default. If true, edb calls pickVersion()
    • string pickVersion() return the version loaded by the updater. defaults to LATEST.

Kryo Serializations

Take the ArrayList of serializations from Args, register them with Kryo (in addition to the defaults; clojure, etc)

PersistenceCoordinator<S implements ShardScheme, D implements Document>

  • openPersistenceForRead
  • openPersistenceForWrite
  • createPersistence
  • byte[] shardDocument(D document) // returns a proper key.

TODO: This class also needs to be able to shard NON-document objects, so that the client can implement

  • int getShard(D document)
  • int getShard(...anything that con build a document internally...)

ShardScheme

  • Can be either HashMod or Range.
    • HashMod keys just have to be serializable.
    • Range keys have to be serializable AND comparable.

Nathan, I want to go over the interface here one more time. Does the above shardDocument method work? The user is responsible for the conversion from document to byte array for sharding and sorting, or should the user return some kryo-serializable thing?

Updater<P implements LocalPersistence, D implements Document>

Only required method is

  • index(P persistence, D document) // is this the right name?

This is enough; the persistence and document logic are application-specific, and the EDB library need not be involved.

Persistence -> Cascading

A cascading tap should be able to source all records from a given DomainStore. Not much else is required here. The customizations above configure serialization properly; the only required task here is iteration over all Documents in a given LocalPersistence.

ElephantTap:

Same tap as above. When sourcing, this tap produces Documents (Type determined by the PersistenceCoordinator's iterator). Again, the user can provide sugar over this to return key-value pairs directly.

Additional args necessary for sourcing:

  • domainVersion: either the version number, some offset (LATEST~1) or LATEST. (defaults to LATEST).

ClosableIterator extends Iterator

ElephantTap depends on this iterator to retrieve all documents. One problem with this approach is that it requires a single split for every shard. This is noted below under PROBLEMS.

Persistence -> Thrift

This section concerns the interaction between a given thrift service (or any other front-end) and the domain store. Once the user configures a given ClientService, she should be able to expect to hold on to an atomic reference to a given collection of shards (LocalPersistence instances).

The domain store needs to allow a bit more control than it currently does over HOW updating actually happens, and how file syncing between various domains occurs.

ClientService

Accepts a configuration map, configures the local DomainStore and the Updater.

  • Configure updater.
    • How often to check for new versions?
    • accepts pre- and post- update hooks.

DomainStore Extensions

Some of these functions would benefit from accepting callbacks. This would make it easy to hook in a given web interface, or email updates. Callbacks would make swapping persistences more intuitive; pull-version, with a callback to swap the domains.

(defprotocol IDomainStore
  (remote-versions [_] "Returns a sequence of available remote versions.")
  (local-versions [_] "Returns a sequence of available local version strings.")
  (pull-version [_ other-store other-version] "Transfers other version on other-store to this domain store.")
  (push-version [_ other-store this-version] "Transfers the supplied version to the remote domain store.")
  (sync [_ other-store & opts] "Sync versions; Opts can specify whether to sync two-ways or one-way (which way?)")
  (compact [_ keep-fn] keep-fn receives a sequence of versions and returns a sequence of versions-to-keep.")
  (cleanup [_ keep-n] "destroy all versions but the last keep-n.")
  (domain-spec [_] "Returns instanced of DomainSpec.")
  )

DomainSpec

A given service will instantiate a DomainSpec with domain-spec.yaml, downloaded from the distributed filesystem (information's in the global and local configurations). DomainSpec allows access to

  • The persistence coordinator
  • Sharding Scheme
  • shard count

The PersistenceCoordinator is discussed above, and exposes

openPersistenceForRead`
close
---app-specific methods; search, multiGet, etc---
int getShard(numShards, D document)

PROBLEMS

  • Sourcing shards is really expensive in cascading, since we can't provide extra splits. How to mitigate this?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment