smerritt/approach-A.txt Secret

## approach-A.txt
Overview A
==========
Each container DB knows the storage policy index for all its objects.
The container updater copies misplaced objects to the right place and
removes them from the wrong one. All the bookkeeping happens in the
container DBs.

Noticing Misplaced Objects
==========================

Misplaced objects can be located by looking at a container DB. Each
row in the 'object' table has a 'storage_policy_index' column
representing the policy index where the object lives. There is also a
'storage_policy_index' column in the 'container_stat' table
representing where all the objects in that container *should* live.
When they differ, that object is said to be misplaced.

In order to find such container DBs, we'll re-use the container
updater, as it already walks the filesystem looking at all the
container DBs. Upon finding a misplaced object, it will move it to the
right place (i.e. the right storage policy index).

Optimization note: the 'container_stat' table contains a
'misplaced_object_count' column which is kept up-to-date by triggers
on the 'object' table. When there are no misplaced objects (which will
be the vast majority of the time), there will be no need to query the
(sizable) 'object' table; a single query to the (1-row)
'container_stat' table will suffice.

Optimization note 2: if there is already an entry for this in the
'object_cleanup' table (explained later), then the container
reconciler can skip over the object since it knows the object has
already moved into the right storage policy.


Handling a Misplaced Object
===========================

When the container reconciler notices a misplaced object, it will move
it to the proper place. This happens by making a GET request to the
object servers in the wrong policy and three PUT requests to the
object servers in the right policy and streaming the object from one
to the other. The content-type, etag, and other metadata will be
preserved. X-Container-* headers will be sent so that the container
will be informed of the move via the normal async-update mechanism.
The async-update path already passes the SPI to the container server;
that's how the container learns where objects are in the first place.

The one piece of metadata that will be altered is the timestamp. The
timestamp of the moved object will be the original object's timestamp
plus one microsecond. This ensures that the container listings get
properly updated; if the timestamp were equal, the update would be
ignored.

Note that this may, in very rare cases, produce a client-visible
change: if you have an object in the wrong policy and its timestamp
ends with .99999, then the Last-Modified timestamp of the object will
be one second greater than it was at upload. Since the probability of
a .99999 timestamp is literally one in a million, and this only
affects misplaced objects (which should be extremely rare), this seems
like a reasonably small impact.


Tracking Misplaced Leftovers
============================

The object servers tell the container servers about the object move
via the normal async-update mechanism. When the container server
notices that an object has changed policy from incorrect to correct,
it creates a record in the 'object_cleanup' table with the moved
object's name, (incremented) timestamp, and original (wrong) SPI.

The container server will be smart and not make a duplicate
object_cleanup record.


Cleaning Up Leftovers
=====================

The container reconciler will read entries out of the 'object_cleanup'
table and will issue DELETE requests with the appropriate X-Timestamp
header. This will remove the leftover object data in the wrong policy
index. Even if the relevant object nodes are down, a tombstone will be
created and replication will propagate the deletion. After the DELETEs
are issued, the row is removed from the 'object_cleanup' table.

These DELETEs will not contain any X-Container-* headers, so there's
no worrying about the order in which the PUTs and DELETEs hit the
container.


Interesting Failure Cases
=========================

Crash while creating cleanup entry
----------------------------------

If the container server crashes after the update with the new SPI
comes in but before it can write to the 'object_cleanup' table, we'd
leave dark-matter objects in the wrong policy. To combat this, the
whole thing will be wrapped in a sqlite transaction.

## rejected-1.txt
Only Let The Right Objects In [rejected]
========================================

Each container server only accepts object PUTs with the right storage
policy index (409 otherwise). If container replication finds a
policy-index mismatch, all objects in the incorrect container are
enqueued for moving. If the object updater receives all 409 responses
when processing an async_pending, the object is enqueued for moving.

The queue is a set of containers under a .misplaced-objects account.

One instance of the container reconciler runs in the cluster; it
periodically scans the .misplaced-objects account and moves misplaced
objects.


Noticing Misplaced Objects
==========================

There are two places at which misplaced objects are noticed: in
the container replicator, and in the object updater.

The container replicator notices misplaced objects when the two
container DBs have different storage policy indices. In that case,
every object in the container with the wrong policy index gets put
into the work queue. The 'object' rows are then copied as normal.

If a synchronous object update fails, then it may be possible that the
container DBs all get the right policy index before the object updater
runs. Thus, the object updater will enqueue the object for moving if
it receives a full set of 409 responses, after which it will delete
the async_pending.


Handling a Misplaced Object
===========================

The container reconciler daemon periodically lists containers in the
.misplaced-objects account, and processes each one. The container
listing contains the name, timestamp, and etag for each object, and
the container name is "%(hash)-%(spi)", where "hash" is the md5 of
"/account/container", and "spi" is the storage policy index.

The original container name is stored in the container metadata. When
the container reconciler notices a misplaced object, it will move it
to the proper place. This happens by making a GET request to the
object servers in the wrong policy and three PUT requests to the
object servers in the right policy and streaming the object from one to
the other. The content-type, etag, and other metadata will be
preserved.


Cleaning Up Leftovers
=====================

After a successful PUT to the object servers in the right policy, the
original object is deleted.


Interesting Failure Cases
=========================

Crash mid-enqueue in replicator
-------------------------------

The replicator may have to enqueue multiple objects for moving; if it
crashes halfway through, the work queue will contain some entries but
not others. This can result in a single object being enqueued,
processed, and then enqueued again.

Also note that the replicator cannot fix the container's SPI until all
the rows have been successfully enqueued, otherwise objects may never
get enqueued and hence will be lost forever.


Crash after obj PUT but before DELETE
-------------------------------------

On the next run, the container reconciler will re-issue the PUT
requests. Since the timestamp will be the same, it'll get back 409
responses from the object servers, and it will recognize that as
indicating that the object already exists. Then it'll retry the DELETE
requests.


Double Enqueue
--------------

An object can be enqueued for moving, processed, and then enqueued
again later. There are a few ways this can happen; multiple container
DBs may have the wrong SPI so each one enqueues, or multiple
async_pendings may cause multiple enqueues.

If an object gets enqueued a second time post-move, then we have a
problem: we get a 404 from the object servers in the wrong policy.
Now, we can't remove the work queue entry because the object might
just be temporarily unavailable. However, we can't keep it because
then the queue grows without bound.

One might be tempted to check and see if the object has appeared in
the right policy, and if so, just issue a DELETE to the wrong one and
move on. However, if the user has deleted the object in the interim,
it won't be there, and you can't tell the difference between that and
a move that hasn't happened yet.

## rejected-2.txt
Keep timestamp, use X-Tiebreaker header [rejected]
==================================================

Instead of incrementing the moved object's X-Timestamp, keep it the
same. To ensure that the object row is updated, we pass along some
X-Tiebreaker header to the container server, and that changes its behavior
so that the new row is merged even if its timestamp is equal to the
row already there (change a < to a <=).

This has the advantage of not modifying the X-Timestamp value, which
means clients never see any change in Last-Modified. However, if you
get into a situation where two container DBs disagree about the SPI,
replication may never fix it. The container hash (container_stat.hash)
is based on the names and timestamps of everything in the objects
table (grep for "chexor"), so those two container DBs have the same
hash value, and so replication will consider them identical.

Also, when deleting the misplaced object from the old policy, we'd
have to pass in X-Tiebreaker to make sure the deletion took effect,
and then we'd have to persist that in the tombstone somehow so that
object replication could look for it to make sure the delete
propagates. Yikes.


## rejected-3.txt
Only store SPI per-container
============================

Instead of keeping per-object things, only store SPI per-container.
When a new object is PUT to the container, if the SPI is wrong, write
it to a "work queue" (set of containers, like .expiring_objects) for a
daemon to consume.

With this design, we have no way to make an object available while it
is misplaced. If the SPI is stored per-object, the proxy can get an
object 404 and ask the container server what SPI the object has.

Also, what if two containers have different ideas of the right SPI,
and updates only hit one of them so that the work queue does not get
updated? Replication will propagate the object listing to both
containers, but there will never have been a PUT that saw the SPI
mismatch, so no queue entry will be created, and the object will be
lost.
	Overview A
	==========
	Each container DB knows the storage policy index for all its objects.
	The container updater copies misplaced objects to the right place and
	removes them from the wrong one. All the bookkeeping happens in the
	container DBs.

	Noticing Misplaced Objects
	==========================

	Misplaced objects can be located by looking at a container DB. Each
	row in the 'object' table has a 'storage_policy_index' column
	representing the policy index where the object lives. There is also a
	'storage_policy_index' column in the 'container_stat' table
	representing where all the objects in that container should live.
	When they differ, that object is said to be misplaced.

	In order to find such container DBs, we'll re-use the container
	updater, as it already walks the filesystem looking at all the
	container DBs. Upon finding a misplaced object, it will move it to the
	right place (i.e. the right storage policy index).

	Optimization note: the 'container_stat' table contains a
	'misplaced_object_count' column which is kept up-to-date by triggers
	on the 'object' table. When there are no misplaced objects (which will
	be the vast majority of the time), there will be no need to query the
	(sizable) 'object' table; a single query to the (1-row)
	'container_stat' table will suffice.

	Optimization note 2: if there is already an entry for this in the
	'object_cleanup' table (explained later), then the container
	reconciler can skip over the object since it knows the object has
	already moved into the right storage policy.


	Handling a Misplaced Object
	===========================

	When the container reconciler notices a misplaced object, it will move
	it to the proper place. This happens by making a GET request to the
	object servers in the wrong policy and three PUT requests to the
	object servers in the right policy and streaming the object from one
	to the other. The content-type, etag, and other metadata will be
	preserved. X-Container-* headers will be sent so that the container
	will be informed of the move via the normal async-update mechanism.
	The async-update path already passes the SPI to the container server;
	that's how the container learns where objects are in the first place.

	The one piece of metadata that will be altered is the timestamp. The
	timestamp of the moved object will be the original object's timestamp
	plus one microsecond. This ensures that the container listings get
	properly updated; if the timestamp were equal, the update would be
	ignored.

	Note that this may, in very rare cases, produce a client-visible
	change: if you have an object in the wrong policy and its timestamp
	ends with .99999, then the Last-Modified timestamp of the object will
	be one second greater than it was at upload. Since the probability of
	a .99999 timestamp is literally one in a million, and this only
	affects misplaced objects (which should be extremely rare), this seems
	like a reasonably small impact.


	Tracking Misplaced Leftovers
	============================

	The object servers tell the container servers about the object move
	via the normal async-update mechanism. When the container server
	notices that an object has changed policy from incorrect to correct,
	it creates a record in the 'object_cleanup' table with the moved
	object's name, (incremented) timestamp, and original (wrong) SPI.

	The container server will be smart and not make a duplicate
	object_cleanup record.


	Cleaning Up Leftovers
	=====================

	The container reconciler will read entries out of the 'object_cleanup'
	table and will issue DELETE requests with the appropriate X-Timestamp
	header. This will remove the leftover object data in the wrong policy
	index. Even if the relevant object nodes are down, a tombstone will be
	created and replication will propagate the deletion. After the DELETEs
	are issued, the row is removed from the 'object_cleanup' table.

	These DELETEs will not contain any X-Container-* headers, so there's
	no worrying about the order in which the PUTs and DELETEs hit the
	container.


	Interesting Failure Cases
	=========================

	Crash while creating cleanup entry
	----------------------------------

	If the container server crashes after the update with the new SPI
	comes in but before it can write to the 'object_cleanup' table, we'd
	leave dark-matter objects in the wrong policy. To combat this, the
	whole thing will be wrapped in a sqlite transaction.
	Only Let The Right Objects In [rejected]
	========================================

	Each container server only accepts object PUTs with the right storage
	policy index (409 otherwise). If container replication finds a
	policy-index mismatch, all objects in the incorrect container are
	enqueued for moving. If the object updater receives all 409 responses
	when processing an async_pending, the object is enqueued for moving.

	The queue is a set of containers under a .misplaced-objects account.

	One instance of the container reconciler runs in the cluster; it
	periodically scans the .misplaced-objects account and moves misplaced
	objects.


	Noticing Misplaced Objects
	==========================

	There are two places at which misplaced objects are noticed: in
	the container replicator, and in the object updater.

	The container replicator notices misplaced objects when the two
	container DBs have different storage policy indices. In that case,
	every object in the container with the wrong policy index gets put
	into the work queue. The 'object' rows are then copied as normal.

	If a synchronous object update fails, then it may be possible that the
	container DBs all get the right policy index before the object updater
	runs. Thus, the object updater will enqueue the object for moving if
	it receives a full set of 409 responses, after which it will delete
	the async_pending.


	Handling a Misplaced Object
	===========================

	The container reconciler daemon periodically lists containers in the
	.misplaced-objects account, and processes each one. The container
	listing contains the name, timestamp, and etag for each object, and
	the container name is "%(hash)-%(spi)", where "hash" is the md5 of
	"/account/container", and "spi" is the storage policy index.

	The original container name is stored in the container metadata. When
	the container reconciler notices a misplaced object, it will move it
	to the proper place. This happens by making a GET request to the
	object servers in the wrong policy and three PUT requests to the
	object servers in the right policy and streaming the object from one to
	the other. The content-type, etag, and other metadata will be
	preserved.


	Cleaning Up Leftovers
	=====================

	After a successful PUT to the object servers in the right policy, the
	original object is deleted.


	Interesting Failure Cases
	=========================

	Crash mid-enqueue in replicator
	-------------------------------

	The replicator may have to enqueue multiple objects for moving; if it
	crashes halfway through, the work queue will contain some entries but
	not others. This can result in a single object being enqueued,
	processed, and then enqueued again.

	Also note that the replicator cannot fix the container's SPI until all
	the rows have been successfully enqueued, otherwise objects may never
	get enqueued and hence will be lost forever.


	Crash after obj PUT but before DELETE
	-------------------------------------

	On the next run, the container reconciler will re-issue the PUT
	requests. Since the timestamp will be the same, it'll get back 409
	responses from the object servers, and it will recognize that as
	indicating that the object already exists. Then it'll retry the DELETE
	requests.


	Double Enqueue
	--------------

	An object can be enqueued for moving, processed, and then enqueued
	again later. There are a few ways this can happen; multiple container
	DBs may have the wrong SPI so each one enqueues, or multiple
	async_pendings may cause multiple enqueues.

	If an object gets enqueued a second time post-move, then we have a
	problem: we get a 404 from the object servers in the wrong policy.
	Now, we can't remove the work queue entry because the object might
	just be temporarily unavailable. However, we can't keep it because
	then the queue grows without bound.

	One might be tempted to check and see if the object has appeared in
	the right policy, and if so, just issue a DELETE to the wrong one and
	move on. However, if the user has deleted the object in the interim,
	it won't be there, and you can't tell the difference between that and
	a move that hasn't happened yet.
	Keep timestamp, use X-Tiebreaker header [rejected]
	==================================================

	Instead of incrementing the moved object's X-Timestamp, keep it the
	same. To ensure that the object row is updated, we pass along some
	X-Tiebreaker header to the container server, and that changes its behavior
	so that the new row is merged even if its timestamp is equal to the
	row already there (change a < to a <=).

	This has the advantage of not modifying the X-Timestamp value, which
	means clients never see any change in Last-Modified. However, if you
	get into a situation where two container DBs disagree about the SPI,
	replication may never fix it. The container hash (container_stat.hash)
	is based on the names and timestamps of everything in the objects
	table (grep for "chexor"), so those two container DBs have the same
	hash value, and so replication will consider them identical.

	Also, when deleting the misplaced object from the old policy, we'd
	have to pass in X-Tiebreaker to make sure the deletion took effect,
	and then we'd have to persist that in the tombstone somehow so that
	object replication could look for it to make sure the delete
	propagates. Yikes.
	Only store SPI per-container
	============================

	Instead of keeping per-object things, only store SPI per-container.
	When a new object is PUT to the container, if the SPI is wrong, write
	it to a "work queue" (set of containers, like .expiring_objects) for a
	daemon to consume.

	With this design, we have no way to make an object available while it
	is misplaced. If the SPI is stored per-object, the proxy can get an
	object 404 and ask the container server what SPI the object has.

	Also, what if two containers have different ideas of the right SPI,
	and updates only hit one of them so that the work queue does not get
	updated? Replication will propagate the object listing to both
	containers, but there will never have been a PUT that saw the SPI
	mismatch, so no queue entry will be created, and the object will be
	lost.