bpot/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Dynotype

A write efficient implementation of large (consistent) sets on DynamoDB.
Why?

DynamoDB has built-in support for sets but if your sets are large (>1000 members) you quickly run into two issues:


When calculating the cost of an item update AWS charges you based on the size of the
item and not the size of the update. This makes even small additions to large sets expensive.


AWS enforces a maximum size limit of 64KB per item. Once you pass that size you'll
have to develop a scheme to store sets across multiple items.


In order to avoid these two issues a set is represented as an ordered series of
items (write only log) containing addition and removal events for the set.
What do you mean by Consistent?

By consistent I mean that for all set operations there is a strict ordering and
that all operations obey the following two rules:

A member is only added to the set if it is not currently a member.
A member is only removed from a set if it is already a member.

Updates to a set will fail if either of those two rules would be broken by the
update. Additionally, you may optionally request that an update fail if the set
has been modified since you last read it.
How do I use it?

# Setup the DynamoDB table
table = DynoType::ConsistentSetTable.new("table_name",
                                        :key_type => :number) # or :string
table.create_unless_exists

# Create a new set
user_id = 10
s = table.set_for(user_id)
s.update(:add => [1,2,3]) # writes the changes to DynamoDB
s.include?(1) # true
s.include?(5) # false
Caveats


Before adding or removing members from the set you must load the entire set
into memory from the store. This is required so we can ensure that the write
will not break the consistency rules above.


No set modification can be larger than 64KB, this means atomic set modifications
are limited to 64KB.


This implementation may perform poorly if you have many writers. Dynotype
uses a form of optimistic locking and if many processes are trying to update
the same set you may have many failures for each succesful write. Dynotype is
likely not the best solution in this case.


Each set is represented by one hash key and you may hit per key throughput
limits if you have very large sets or if they have many updates. Future versions
of Dynotype may support a partition set option that will allow you to spread
a single set across many hash keys.


Compaction Algorithm

Compaction is implemented in such a way that multiple processes can
simultaneously attempt compaction and a new process can pick up
a compaction attempt where another process left off.

Write a compaction event to the event log.
Start writing the new compacted snapshot of the data type to the event log.
Delete all versions less than (and including) the compaction event.

There are a couple constraints that ensure that all process seeing the
compaction will work towards the same final state:

There can only be one compaction event in a log at a given time.
Given an event log for a DataType all processes will generate
the same exact snapshot.