Skip to content

Instantly share code, notes, and snippets.

@smerritt
Last active August 29, 2015 13:56
container reconciler design notes
Overview A
==========
Each container DB knows the storage policy index for all its objects.
The container updater copies misplaced objects to the right place and
removes them from the wrong one. All the bookkeeping happens in the
container DBs.
Noticing Misplaced Objects
==========================
Misplaced objects can be located by looking at a container DB. Each
row in the 'object' table has a 'storage_policy_index' column
representing the policy index where the object lives. There is also a
'storage_policy_index' column in the 'container_stat' table
representing where all the objects in that container *should* live.
When they differ, that object is said to be misplaced.
In order to find such container DBs, we'll re-use the container
updater, as it already walks the filesystem looking at all the
container DBs. Upon finding a misplaced object, it will move it to the
right place (i.e. the right storage policy index).
Optimization note: the 'container_stat' table contains a
'misplaced_object_count' column which is kept up-to-date by triggers
on the 'object' table. When there are no misplaced objects (which will
be the vast majority of the time), there will be no need to query the
(sizable) 'object' table; a single query to the (1-row)
'container_stat' table will suffice.
Optimization note 2: if there is already an entry for this in the
'object_cleanup' table (explained later), then the container
reconciler can skip over the object since it knows the object has
already moved into the right storage policy.
Handling a Misplaced Object
===========================
When the container reconciler notices a misplaced object, it will move
it to the proper place. This happens by making a GET request to the
object servers in the wrong policy and three PUT requests to the
object servers in the right policy and streaming the object from one
to the other. The content-type, etag, and other metadata will be
preserved. X-Container-* headers will be sent so that the container
will be informed of the move via the normal async-update mechanism.
The async-update path already passes the SPI to the container server;
that's how the container learns where objects are in the first place.
The one piece of metadata that will be altered is the timestamp. The
timestamp of the moved object will be the original object's timestamp
plus one microsecond. This ensures that the container listings get
properly updated; if the timestamp were equal, the update would be
ignored.
Note that this may, in very rare cases, produce a client-visible
change: if you have an object in the wrong policy and its timestamp
ends with .99999, then the Last-Modified timestamp of the object will
be one second greater than it was at upload. Since the probability of
a .99999 timestamp is literally one in a million, and this only
affects misplaced objects (which should be extremely rare), this seems
like a reasonably small impact.
Tracking Misplaced Leftovers
============================
The object servers tell the container servers about the object move
via the normal async-update mechanism. When the container server
notices that an object has changed policy from incorrect to correct,
it creates a record in the 'object_cleanup' table with the moved
object's name, (incremented) timestamp, and original (wrong) SPI.
The container server will be smart and not make a duplicate
object_cleanup record.
Cleaning Up Leftovers
=====================
The container reconciler will read entries out of the 'object_cleanup'
table and will issue DELETE requests with the appropriate X-Timestamp
header. This will remove the leftover object data in the wrong policy
index. Even if the relevant object nodes are down, a tombstone will be
created and replication will propagate the deletion. After the DELETEs
are issued, the row is removed from the 'object_cleanup' table.
These DELETEs will not contain any X-Container-* headers, so there's
no worrying about the order in which the PUTs and DELETEs hit the
container.
Interesting Failure Cases
=========================
Crash while creating cleanup entry
----------------------------------
If the container server crashes after the update with the new SPI
comes in but before it can write to the 'object_cleanup' table, we'd
leave dark-matter objects in the wrong policy. To combat this, the
whole thing will be wrapped in a sqlite transaction.
Only Let The Right Objects In [rejected]
========================================
Each container server only accepts object PUTs with the right storage
policy index (409 otherwise). If container replication finds a
policy-index mismatch, all objects in the incorrect container are
enqueued for moving. If the object updater receives all 409 responses
when processing an async_pending, the object is enqueued for moving.
The queue is a set of containers under a .misplaced-objects account.
One instance of the container reconciler runs in the cluster; it
periodically scans the .misplaced-objects account and moves misplaced
objects.
Noticing Misplaced Objects
==========================
There are two places at which misplaced objects are noticed: in
the container replicator, and in the object updater.
The container replicator notices misplaced objects when the two
container DBs have different storage policy indices. In that case,
every object in the container with the wrong policy index gets put
into the work queue. The 'object' rows are then copied as normal.
If a synchronous object update fails, then it may be possible that the
container DBs all get the right policy index before the object updater
runs. Thus, the object updater will enqueue the object for moving if
it receives a full set of 409 responses, after which it will delete
the async_pending.
Handling a Misplaced Object
===========================
The container reconciler daemon periodically lists containers in the
.misplaced-objects account, and processes each one. The container
listing contains the name, timestamp, and etag for each object, and
the container name is "%(hash)-%(spi)", where "hash" is the md5 of
"/account/container", and "spi" is the storage policy index.
The original container name is stored in the container metadata. When
the container reconciler notices a misplaced object, it will move it
to the proper place. This happens by making a GET request to the
object servers in the wrong policy and three PUT requests to the
object servers in the right policy and streaming the object from one to
the other. The content-type, etag, and other metadata will be
preserved.
Cleaning Up Leftovers
=====================
After a successful PUT to the object servers in the right policy, the
original object is deleted.
Interesting Failure Cases
=========================
Crash mid-enqueue in replicator
-------------------------------
The replicator may have to enqueue multiple objects for moving; if it
crashes halfway through, the work queue will contain some entries but
not others. This can result in a single object being enqueued,
processed, and then enqueued again.
Also note that the replicator cannot fix the container's SPI until all
the rows have been successfully enqueued, otherwise objects may never
get enqueued and hence will be lost forever.
Crash after obj PUT but before DELETE
-------------------------------------
On the next run, the container reconciler will re-issue the PUT
requests. Since the timestamp will be the same, it'll get back 409
responses from the object servers, and it will recognize that as
indicating that the object already exists. Then it'll retry the DELETE
requests.
Double Enqueue
--------------
An object can be enqueued for moving, processed, and then enqueued
again later. There are a few ways this can happen; multiple container
DBs may have the wrong SPI so each one enqueues, or multiple
async_pendings may cause multiple enqueues.
If an object gets enqueued a second time post-move, then we have a
problem: we get a 404 from the object servers in the wrong policy.
Now, we can't remove the work queue entry because the object might
just be temporarily unavailable. However, we can't keep it because
then the queue grows without bound.
One might be tempted to check and see if the object has appeared in
the right policy, and if so, just issue a DELETE to the wrong one and
move on. However, if the user has deleted the object in the interim,
it won't be there, and you can't tell the difference between that and
a move that hasn't happened yet.
Keep timestamp, use X-Tiebreaker header [rejected]
==================================================
Instead of incrementing the moved object's X-Timestamp, keep it the
same. To ensure that the object row is updated, we pass along some
X-Tiebreaker header to the container server, and that changes its behavior
so that the new row is merged even if its timestamp is equal to the
row already there (change a < to a <=).
This has the advantage of not modifying the X-Timestamp value, which
means clients never see any change in Last-Modified. However, if you
get into a situation where two container DBs disagree about the SPI,
replication may never fix it. The container hash (container_stat.hash)
is based on the names and timestamps of everything in the objects
table (grep for "chexor"), so those two container DBs have the same
hash value, and so replication will consider them identical.
Also, when deleting the misplaced object from the old policy, we'd
have to pass in X-Tiebreaker to make sure the deletion took effect,
and then we'd have to persist that in the tombstone somehow so that
object replication could look for it to make sure the delete
propagates. Yikes.
Only store SPI per-container
============================
Instead of keeping per-object things, only store SPI per-container.
When a new object is PUT to the container, if the SPI is wrong, write
it to a "work queue" (set of containers, like .expiring_objects) for a
daemon to consume.
With this design, we have no way to make an object available while it
is misplaced. If the SPI is stored per-object, the proxy can get an
object 404 and ask the container server what SPI the object has.
Also, what if two containers have different ideas of the right SPI,
and updates only hit one of them so that the work queue does not get
updated? Replication will propagate the object listing to both
containers, but there will never have been a PUT that saw the SPI
mismatch, so no queue entry will be created, and the object will be
lost.
@clayg
Copy link

clayg commented Feb 22, 2014

I don't think it's quite fair to say "we can't tell the difference" between a object deleted after move and move hasn't happened yet. You know when the object to be moved was created and swift damn well knows when the tombstone was created if there is one:

https://gist.github.com/clayg/9148820

Just tell swift to cough up the goods, there's a comment right below that where X-Timestamp is added to HEAD responses for container sync, and something similar in DELETE for expiring objects.

More to the point I think LBYL is the better strategy generally for reconciliation since it's likely multiple actors will be at play and cheaper to see if someone already did the work (or if the work is negated by a newer PUT/DELETE) before even checking you can still get to the allegedly misplaced object. If the thing you have that is supposed to be in sp-RIGHT is already there then your only remaining job in life is to get some tombstones down in sp-WRONG - the more the better! (you could even send the .misplaced-objects container headers along with the DELETES?)

I think the cleanup table is risky if it's not going to be replicated. Quite possible that one of the three container's reconcilers sends out the new PUT with the modified timestamp and it's the first the other two containers are hearing of it (i.e. the original PUTs aysnc haven't come in yet, so there's nothing to move into cleanup, and when they do come in they'll be too late). So now we have one and only one piece of spinning rust that knows it needs to clean up the dark matter that's fully replicated.

Also I don't think you should stick any moving objects around inside the main container-updater loop. Those account stats are fully async so his cycle time is very sensitive to your clusters eventual consistency window for things like quotas etc.

Also for the spi in the container approach it's not clear to me what the strategy is when the container-replicator realizes it's stats table has the wrong spi? Obviously it would still want to replicate it's rows - but first it has to fix it misplaced object counts stats? Would his reconcilers be fighting the other guys in the meantime? The async-reject-queue strategy of dump your objects and shoot yourself in the head (maybe not even in that order) seems pretty safe - at least you can hope somewhere an object server (or two) was trying to talk to a primary that knew etter and ended up with an async that will eventually get enqueued...

Anyway if that "just change the spi on the container" thing "works" - you've sorta almost got transparent container storage policy updates/migrations!?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment