The implementer uses Cascading to generate "Document" objects; these are kryo-serializable packets that contain the entire database record. For example, for key->val and search:
KeyValDocument(byte[] k, byte[]v)
SearchDocument(LuceneDoc d)
And for key->set:
KeySetDocument(byte[] setKey, byte[] setVal)
KeySetDocument(byte[] setKey, int size)
Using KeySet
as an example, An Updater (see below) will convert the KeySetDocument
into the proper key and value byte arrays for BerkeleyDB.
The user generates documents and sinks them into the ElephantTap. The "tail assembly" can be sugar over this process; the key-value tail assembly might accept key-value pairs and do the document conversion implicitly.
Constructor: ElephantTap(Args args)
User sinks Documents into this tap, EDB takes care of the rest.
Args specify:
PersistenceCoordinator<S implements ShardScheme>
(required)- Number of Shards (required)
- List of Kryo Serializations (optional) // register all with kryo
- Target Filesystem Credentials (optional)
- Cleanup parameters for versioned store (versions to keep? see
DomainStore
below for more thoughts.) Updater<P implements LocalPersistence, D implements Document>
(required)bool loadOldVersion()
Returns false by default. If true, edb callspickVersion()
string pickVersion()
return the version loaded by the updater. defaults to LATEST.
Take the ArrayList of serializations from Args, register them with Kryo (in addition to the defaults; clojure, etc)
openPersistenceForRead
openPersistenceForWrite
createPersistence
byte[] shardDocument(D document)
// returns a proper key.
TODO: This class also needs to be able to shard NON-document objects, so that the client can implement
int getShard(D document)
int getShard(...anything that con build a document internally...)
- Can be either HashMod or Range.
- HashMod keys just have to be serializable.
- Range keys have to be serializable AND comparable.
Nathan, I want to go over the interface here one more time. Does the above shardDocument method work? The user is responsible for the conversion from document to byte array for sharding and sorting, or should the user return some kryo-serializable thing?
Only required method is
index(P persistence, D document) // is this the right name?
This is enough; the persistence and document logic are application-specific, and the EDB library need not be involved.
A cascading tap should be able to source all records from a given DomainStore. Not much else is required here. The customizations above configure serialization properly; the only required task here is iteration over all Documents in a given LocalPersistence.
Same tap as above. When sourcing, this tap produces Documents (Type determined by the PersistenceCoordinator's iterator). Again, the user can provide sugar over this to return key-value pairs directly.
Additional args necessary for sourcing:
domainVersion
: either the version number, some offset (LATEST~1) or LATEST. (defaults to LATEST).
ElephantTap depends on this iterator to retrieve all documents. One problem with this approach is that it requires a single split for every shard. This is noted below under PROBLEMS.
This section concerns the interaction between a given thrift service (or any other front-end) and the domain store. Once the user configures a given ClientService
, she should be able to expect to hold on to an atomic reference to a given collection of shards (LocalPersistence instances).
The domain store needs to allow a bit more control than it currently does over HOW updating actually happens, and how file syncing between various domains occurs.
Accepts a configuration map, configures the local DomainStore and the Updater.
- Configure updater.
- How often to check for new versions?
- accepts pre- and post- update hooks.
Some of these functions would benefit from accepting callbacks. This would make it easy to hook in a given web interface, or email updates. Callbacks would make swapping persistences more intuitive; pull-version, with a callback to swap the domains.
(defprotocol IDomainStore
(remote-versions [_] "Returns a sequence of available remote versions.")
(local-versions [_] "Returns a sequence of available local version strings.")
(pull-version [_ other-store other-version] "Transfers other version on other-store to this domain store.")
(push-version [_ other-store this-version] "Transfers the supplied version to the remote domain store.")
(sync [_ other-store & opts] "Sync versions; Opts can specify whether to sync two-ways or one-way (which way?)")
(compact [_ keep-fn] keep-fn receives a sequence of versions and returns a sequence of versions-to-keep.")
(cleanup [_ keep-n] "destroy all versions but the last keep-n.")
(domain-spec [_] "Returns instanced of DomainSpec.")
)
A given service will instantiate a DomainSpec with domain-spec.yaml, downloaded from the distributed filesystem (information's in the global and local configurations). DomainSpec allows access to
- The persistence coordinator
- Sharding Scheme
- shard count
The PersistenceCoordinator is discussed above, and exposes
openPersistenceForRead`
close
---app-specific methods; search, multiGet, etc---
int getShard(numShards, D document)
- Sourcing shards is really expensive in cascading, since we can't provide extra splits. How to mitigate this?