-
-
Save smerritt/2d7f46fd48426ef258c0 to your computer and use it in GitHub Desktop.
container reconciler design notes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Overview A | |
========== | |
Each container DB knows the storage policy index for all its objects. | |
The container updater copies misplaced objects to the right place and | |
removes them from the wrong one. All the bookkeeping happens in the | |
container DBs. | |
Noticing Misplaced Objects | |
========================== | |
Misplaced objects can be located by looking at a container DB. Each | |
row in the 'object' table has a 'storage_policy_index' column | |
representing the policy index where the object lives. There is also a | |
'storage_policy_index' column in the 'container_stat' table | |
representing where all the objects in that container *should* live. | |
When they differ, that object is said to be misplaced. | |
In order to find such container DBs, we'll re-use the container | |
updater, as it already walks the filesystem looking at all the | |
container DBs. Upon finding a misplaced object, it will move it to the | |
right place (i.e. the right storage policy index). | |
Optimization note: the 'container_stat' table contains a | |
'misplaced_object_count' column which is kept up-to-date by triggers | |
on the 'object' table. When there are no misplaced objects (which will | |
be the vast majority of the time), there will be no need to query the | |
(sizable) 'object' table; a single query to the (1-row) | |
'container_stat' table will suffice. | |
Optimization note 2: if there is already an entry for this in the | |
'object_cleanup' table (explained later), then the container | |
reconciler can skip over the object since it knows the object has | |
already moved into the right storage policy. | |
Handling a Misplaced Object | |
=========================== | |
When the container reconciler notices a misplaced object, it will move | |
it to the proper place. This happens by making a GET request to the | |
object servers in the wrong policy and three PUT requests to the | |
object servers in the right policy and streaming the object from one | |
to the other. The content-type, etag, and other metadata will be | |
preserved. X-Container-* headers will be sent so that the container | |
will be informed of the move via the normal async-update mechanism. | |
The async-update path already passes the SPI to the container server; | |
that's how the container learns where objects are in the first place. | |
The one piece of metadata that will be altered is the timestamp. The | |
timestamp of the moved object will be the original object's timestamp | |
plus one microsecond. This ensures that the container listings get | |
properly updated; if the timestamp were equal, the update would be | |
ignored. | |
Note that this may, in very rare cases, produce a client-visible | |
change: if you have an object in the wrong policy and its timestamp | |
ends with .99999, then the Last-Modified timestamp of the object will | |
be one second greater than it was at upload. Since the probability of | |
a .99999 timestamp is literally one in a million, and this only | |
affects misplaced objects (which should be extremely rare), this seems | |
like a reasonably small impact. | |
Tracking Misplaced Leftovers | |
============================ | |
The object servers tell the container servers about the object move | |
via the normal async-update mechanism. When the container server | |
notices that an object has changed policy from incorrect to correct, | |
it creates a record in the 'object_cleanup' table with the moved | |
object's name, (incremented) timestamp, and original (wrong) SPI. | |
The container server will be smart and not make a duplicate | |
object_cleanup record. | |
Cleaning Up Leftovers | |
===================== | |
The container reconciler will read entries out of the 'object_cleanup' | |
table and will issue DELETE requests with the appropriate X-Timestamp | |
header. This will remove the leftover object data in the wrong policy | |
index. Even if the relevant object nodes are down, a tombstone will be | |
created and replication will propagate the deletion. After the DELETEs | |
are issued, the row is removed from the 'object_cleanup' table. | |
These DELETEs will not contain any X-Container-* headers, so there's | |
no worrying about the order in which the PUTs and DELETEs hit the | |
container. | |
Interesting Failure Cases | |
========================= | |
Crash while creating cleanup entry | |
---------------------------------- | |
If the container server crashes after the update with the new SPI | |
comes in but before it can write to the 'object_cleanup' table, we'd | |
leave dark-matter objects in the wrong policy. To combat this, the | |
whole thing will be wrapped in a sqlite transaction. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Only Let The Right Objects In [rejected] | |
======================================== | |
Each container server only accepts object PUTs with the right storage | |
policy index (409 otherwise). If container replication finds a | |
policy-index mismatch, all objects in the incorrect container are | |
enqueued for moving. If the object updater receives all 409 responses | |
when processing an async_pending, the object is enqueued for moving. | |
The queue is a set of containers under a .misplaced-objects account. | |
One instance of the container reconciler runs in the cluster; it | |
periodically scans the .misplaced-objects account and moves misplaced | |
objects. | |
Noticing Misplaced Objects | |
========================== | |
There are two places at which misplaced objects are noticed: in | |
the container replicator, and in the object updater. | |
The container replicator notices misplaced objects when the two | |
container DBs have different storage policy indices. In that case, | |
every object in the container with the wrong policy index gets put | |
into the work queue. The 'object' rows are then copied as normal. | |
If a synchronous object update fails, then it may be possible that the | |
container DBs all get the right policy index before the object updater | |
runs. Thus, the object updater will enqueue the object for moving if | |
it receives a full set of 409 responses, after which it will delete | |
the async_pending. | |
Handling a Misplaced Object | |
=========================== | |
The container reconciler daemon periodically lists containers in the | |
.misplaced-objects account, and processes each one. The container | |
listing contains the name, timestamp, and etag for each object, and | |
the container name is "%(hash)-%(spi)", where "hash" is the md5 of | |
"/account/container", and "spi" is the storage policy index. | |
The original container name is stored in the container metadata. When | |
the container reconciler notices a misplaced object, it will move it | |
to the proper place. This happens by making a GET request to the | |
object servers in the wrong policy and three PUT requests to the | |
object servers in the right policy and streaming the object from one to | |
the other. The content-type, etag, and other metadata will be | |
preserved. | |
Cleaning Up Leftovers | |
===================== | |
After a successful PUT to the object servers in the right policy, the | |
original object is deleted. | |
Interesting Failure Cases | |
========================= | |
Crash mid-enqueue in replicator | |
------------------------------- | |
The replicator may have to enqueue multiple objects for moving; if it | |
crashes halfway through, the work queue will contain some entries but | |
not others. This can result in a single object being enqueued, | |
processed, and then enqueued again. | |
Also note that the replicator cannot fix the container's SPI until all | |
the rows have been successfully enqueued, otherwise objects may never | |
get enqueued and hence will be lost forever. | |
Crash after obj PUT but before DELETE | |
------------------------------------- | |
On the next run, the container reconciler will re-issue the PUT | |
requests. Since the timestamp will be the same, it'll get back 409 | |
responses from the object servers, and it will recognize that as | |
indicating that the object already exists. Then it'll retry the DELETE | |
requests. | |
Double Enqueue | |
-------------- | |
An object can be enqueued for moving, processed, and then enqueued | |
again later. There are a few ways this can happen; multiple container | |
DBs may have the wrong SPI so each one enqueues, or multiple | |
async_pendings may cause multiple enqueues. | |
If an object gets enqueued a second time post-move, then we have a | |
problem: we get a 404 from the object servers in the wrong policy. | |
Now, we can't remove the work queue entry because the object might | |
just be temporarily unavailable. However, we can't keep it because | |
then the queue grows without bound. | |
One might be tempted to check and see if the object has appeared in | |
the right policy, and if so, just issue a DELETE to the wrong one and | |
move on. However, if the user has deleted the object in the interim, | |
it won't be there, and you can't tell the difference between that and | |
a move that hasn't happened yet. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Keep timestamp, use X-Tiebreaker header [rejected] | |
================================================== | |
Instead of incrementing the moved object's X-Timestamp, keep it the | |
same. To ensure that the object row is updated, we pass along some | |
X-Tiebreaker header to the container server, and that changes its behavior | |
so that the new row is merged even if its timestamp is equal to the | |
row already there (change a < to a <=). | |
This has the advantage of not modifying the X-Timestamp value, which | |
means clients never see any change in Last-Modified. However, if you | |
get into a situation where two container DBs disagree about the SPI, | |
replication may never fix it. The container hash (container_stat.hash) | |
is based on the names and timestamps of everything in the objects | |
table (grep for "chexor"), so those two container DBs have the same | |
hash value, and so replication will consider them identical. | |
Also, when deleting the misplaced object from the old policy, we'd | |
have to pass in X-Tiebreaker to make sure the deletion took effect, | |
and then we'd have to persist that in the tombstone somehow so that | |
object replication could look for it to make sure the delete | |
propagates. Yikes. | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Only store SPI per-container | |
============================ | |
Instead of keeping per-object things, only store SPI per-container. | |
When a new object is PUT to the container, if the SPI is wrong, write | |
it to a "work queue" (set of containers, like .expiring_objects) for a | |
daemon to consume. | |
With this design, we have no way to make an object available while it | |
is misplaced. If the SPI is stored per-object, the proxy can get an | |
object 404 and ask the container server what SPI the object has. | |
Also, what if two containers have different ideas of the right SPI, | |
and updates only hit one of them so that the work queue does not get | |
updated? Replication will propagate the object listing to both | |
containers, but there will never have been a PUT that saw the SPI | |
mismatch, so no queue entry will be created, and the object will be | |
lost. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I don't think it's quite fair to say "we can't tell the difference" between a object deleted after move and move hasn't happened yet. You know when the object to be moved was created and swift damn well knows when the tombstone was created if there is one:
https://gist.github.com/clayg/9148820
Just tell swift to cough up the goods, there's a comment right below that where X-Timestamp is added to HEAD responses for container sync, and something similar in DELETE for expiring objects.
More to the point I think LBYL is the better strategy generally for reconciliation since it's likely multiple actors will be at play and cheaper to see if someone already did the work (or if the work is negated by a newer PUT/DELETE) before even checking you can still get to the allegedly misplaced object. If the thing you have that is supposed to be in sp-RIGHT is already there then your only remaining job in life is to get some tombstones down in sp-WRONG - the more the better! (you could even send the .misplaced-objects container headers along with the DELETES?)
I think the cleanup table is risky if it's not going to be replicated. Quite possible that one of the three container's reconcilers sends out the new PUT with the modified timestamp and it's the first the other two containers are hearing of it (i.e. the original PUTs aysnc haven't come in yet, so there's nothing to move into cleanup, and when they do come in they'll be too late). So now we have one and only one piece of spinning rust that knows it needs to clean up the dark matter that's fully replicated.
Also I don't think you should stick any moving objects around inside the main container-updater loop. Those account stats are fully async so his cycle time is very sensitive to your clusters eventual consistency window for things like quotas etc.
Also for the spi in the container approach it's not clear to me what the strategy is when the container-replicator realizes it's stats table has the wrong spi? Obviously it would still want to replicate it's rows - but first it has to fix it misplaced object counts stats? Would his reconcilers be fighting the other guys in the meantime? The async-reject-queue strategy of dump your objects and shoot yourself in the head (maybe not even in that order) seems pretty safe - at least you can hope somewhere an object server (or two) was trying to talk to a primary that knew etter and ended up with an async that will eventually get enqueued...
Anyway if that "just change the spi on the container" thing "works" - you've sorta almost got transparent container storage policy updates/migrations!?