clayg/gist:84e11b94735016c1c341

## gistfile1.rst

      
    Raw
  

              gistfile1.rst
            
          
    Ultimately I decided to do the per policy object rows and per policy stats in
the containers.  It's a perfect representation of the state of the world, each
container can have multiple instances of the same object name in different
storage policies (that's just a fact regardless of your choice to adopt a
schema in the container that can represent it).  The per policy stats tracking
give us the most flexibility to correctly deal with this reality -
particularly when a container must change it's policy.
My first attempt do deal with the misplaced objects leaned heavily on the
objects async_update pattern.  When a container server received an object
update for a policy that did not match its policy_index it would respond
409 and the object server would drop an async pending.  The object updater
would try to udpate all primary nodes and enqueue the object for the
reconciler when it encountered a 409 in the response set.  This worked well
enough.  But basically fell apart dealing with the split brain containers who
had the wrong storage policy index.  On one hand you could almost just
delete these containers when you discovered during container replication that
it's policy_index was wrong, trusting that someone somewhere got a 409 and
wrote an async pending which should result in an en-queueing later.  But since
async pendings aren't replicated you can imagine a case where only one object
server wrote an async, and if that disk dies you would loose the record of the
write unless you drain the rows in the container into the reconciler queue.
If having the queue populated from TWO places wasn't bad enough, it turned out
this was only really palatable in the container-replicator.  When handling
REPLICATE requests in the container server it seemed like it make more sense
to just copy the containers out to a staging area and have another daemon load
them into queue - so now we're at three, and starting to think the
container-replicator is really the place where all this is going to go down no
matter what you do.
So, we let the rows in - add a storage_policy_index column to the objects
table - and try to figure out what to do with listings.

we let the name/timestamp in with storage_policy_index but only keep the most recent one
we let each name/timestamp in and display all the storage_policy_index rows in the listing
we let each name/timestamp in with storage_policy_index - but only display the rows matching the storage_policy_index in listings

With #1 you have to do something with overwrites so you can ensure to clean up
dark data.  Sam seemed pretty keen on a cleanups table that would copy rows
from the object table out before over-writing them.  He went to have a baby
before he ever got it finished, I got it close to working but never much cared
for it because the row copying introduced a bunch of logic and new failure
modes into merge_items, plus the cleanup table was not replicated.  There was
some logic to accept async updates from nodes that could place a row in the
cleanups table if it was older than the row in the object table so you should
eventually the cleanup row in all the containers if async updates were working
correctly - but it smelled like the same subtle problem with the previous
async based approach.  More over, you had to reconcile from both the objects
table (which might have objects in the wrong policy) and the cleanups table
(which was tracking old writes into a different storage policy).  I could
never get my head fully wrapped around this approach, the code is here ->
https://github.com/smerritt/swift/tree/keep-per-object-spi
I actually started with #2 when I started thinking about how to do something
besides the cleanup table.  It suddenly hit me that the unique "constraint" on
the object table was totally application enforced - so I played with the idea
of letting everything into the object table - and it worked out GREAT for
replication.  But my initial experiment didn't filter the object table on
storage_policy_index so you'd get duplicate names in the listing.  This
introducing an API change that I wasn't ready for.  Everyone expects names in
containers to be unique, plus with marker queries if you have a duplicate name
on a boundary you need to possibly return extra rows to ensure every item is
iterated.  It did have the nice benefit of punting on stats tracking - since
you were exposing the duplicate names just tracking all bytes and counts for
all policies in the single global counter meant object and byte counts reflect
whats in the listing.  Alas, I never fully fleshed this idea out - it seemed
almost entirely indefensible if someone called it out for what it was - a
change in the default behavior of the API that is almost certainly NOT a
simple additive change.   OTOH, I think with an API version bump it could be
a great feature... but how to support backwards compatibility for the object
and byte counts?
With #3 we apply a WHERE clause and listings work as expected.  It shared with
#2 the simplicity of using existing replication to ensure durability of record for
both misplaced objects and dark data.  The great part about keeping it in the
object table was that whole idea of "misplaced" doesn't even factor in yet -
it's just another row.  Object update comes in - you write
it down.  But I couldn't quite get over what do to do with the container_stat
entries.  If a container says I have three objects, but listing only has two
rows - whats the third object?  Am I paying for it's bytes?  How many objects
in my container aren't in my listing?  What if I have 10M rows - do I even
know there's a discrepancy - does the container know?  It's always easier to
throw out data later - tracking per policy stats allowed me to have all the
knowledge readily available (particularly useful when a database needs to
update it's storage_policy_index due to merge conflict resolution) and the
default data can always match the listing.
Took me forever to get there - and I fought against it most of the way - but
when I think about what folks are going to want to do in the future with
containers that HAVE to be able to store multiple entries for the same name in
different storage policies - it seems like the perfect representation of that
reality is really the only choice; but I do succumb to it begrudgingly.