Ultimately I decided to do the per policy object rows and per policy stats in the containers. It's a perfect representation of the state of the world, each container can have multiple instances of the same object name in different storage policies (that's just a fact regardless of your choice to adopt a schema in the container that can represent it). The per policy stats tracking give us the most flexibility to correctly deal with this reality - particularly when a container must change it's policy.
My first attempt do deal with the misplaced objects leaned heavily on the objects async_update pattern. When a container server received an object update for a policy that did not match its policy_index it would respond 409 and the object server would drop an async pending. The object updater would try to udpate all primary nodes and enqueue the object for the reconciler when it encountered a 409 in the response set. This worked well enough. But basically fell apart dealing with the split brain containers who had the wrong storage policy index. On one hand you could almost just delete these containers when you discovered during container replication that it's policy_index was wrong, trusting that someone somewhere got a 409 and wrote an async pending which should result in an en-queueing later. But since async pendings aren't replicated you can imagine a case where only one object server wrote an async, and if that disk dies you would loose the record of the write unless you drain the rows in the container into the reconciler queue. If having the queue populated from TWO places wasn't bad enough, it turned out this was only really palatable in the container-replicator. When handling REPLICATE requests in the container server it seemed like it make more sense to just copy the containers out to a staging area and have another daemon load them into queue - so now we're at three, and starting to think the container-replicator is really the place where all this is going to go down no matter what you do.
So, we let the rows in - add a storage_policy_index column to the objects table - and try to figure out what to do with listings.
- we let the name/timestamp in with storage_policy_index but only keep the most recent one
- we let each name/timestamp in and display all the storage_policy_index rows in the listing
- we let each name/timestamp in with storage_policy_index - but only display the rows matching the storage_policy_index in listings
With #1 you have to do something with overwrites so you can ensure to clean up dark data. Sam seemed pretty keen on a cleanups table that would copy rows from the object table out before over-writing them. He went to have a baby before he ever got it finished, I got it close to working but never much cared for it because the row copying introduced a bunch of logic and new failure modes into merge_items, plus the cleanup table was not replicated. There was some logic to accept async updates from nodes that could place a row in the cleanups table if it was older than the row in the object table so you should eventually the cleanup row in all the containers if async updates were working correctly - but it smelled like the same subtle problem with the previous async based approach. More over, you had to reconcile from both the objects table (which might have objects in the wrong policy) and the cleanups table (which was tracking old writes into a different storage policy). I could never get my head fully wrapped around this approach, the code is here -> https://github.com/smerritt/swift/tree/keep-per-object-spi
I actually started with #2 when I started thinking about how to do something besides the cleanup table. It suddenly hit me that the unique "constraint" on the object table was totally application enforced - so I played with the idea of letting everything into the object table - and it worked out GREAT for replication. But my initial experiment didn't filter the object table on storage_policy_index so you'd get duplicate names in the listing. This introducing an API change that I wasn't ready for. Everyone expects names in containers to be unique, plus with marker queries if you have a duplicate name on a boundary you need to possibly return extra rows to ensure every item is iterated. It did have the nice benefit of punting on stats tracking - since you were exposing the duplicate names just tracking all bytes and counts for all policies in the single global counter meant object and byte counts reflect whats in the listing. Alas, I never fully fleshed this idea out - it seemed almost entirely indefensible if someone called it out for what it was - a change in the default behavior of the API that is almost certainly NOT a simple additive change. OTOH, I think with an API version bump it could be a great feature... but how to support backwards compatibility for the object and byte counts?
With #3 we apply a WHERE clause and listings work as expected. It shared with #2 the simplicity of using existing replication to ensure durability of record for both misplaced objects and dark data. The great part about keeping it in the object table was that whole idea of "misplaced" doesn't even factor in yet - it's just another row. Object update comes in - you write it down. But I couldn't quite get over what do to do with the container_stat entries. If a container says I have three objects, but listing only has two rows - whats the third object? Am I paying for it's bytes? How many objects in my container aren't in my listing? What if I have 10M rows - do I even know there's a discrepancy - does the container know? It's always easier to throw out data later - tracking per policy stats allowed me to have all the knowledge readily available (particularly useful when a database needs to update it's storage_policy_index due to merge conflict resolution) and the default data can always match the listing.
Took me forever to get there - and I fought against it most of the way - but when I think about what folks are going to want to do in the future with containers that HAVE to be able to store multiple entries for the same name in different storage policies - it seems like the perfect representation of that reality is really the only choice; but I do succumb to it begrudgingly.