Skip to content

Instantly share code, notes, and snippets.

@angrycub
Created February 8, 2013 18:49
Show Gist options
  • Save angrycub/4741075 to your computer and use it in GitHub Desktop.
Save angrycub/4741075 to your computer and use it in GitHub Desktop.
Mailing List Post from Jon Meredith about Delete Mode.

Summary/tl;dr - Riak 1.0.0 has introduced more control over deletion with the delete_mode setting.

If you plan to delete and recreate objects under the same key rapidly, and there is enough disk available to store tombstones, it is safest to set delete_mode to keep.

The default three second delay for removing tombstones balances keeping the tombstone around long enough for any rapid delete/recreates, but unlike the keep mode it does remove the data.

Riak keeps your objects available during failures by storing multiple copies of the data. This redundancy makes deletion more complex than a single node database. For example, Riak needs to ensure deletes issued while nodes are down get applied when the nodes recover, or resolve what happens if the network is partitioned and an object is deleted on one side but updated on the other side.

Deletes in Riak are a two step process, first it writes a tombstone objects to the N replicas and only once all replicas have stored the tombstone are they removed. If fallback nodes are in use, the tombstone will not be removed. Riak will wait until the next time the object is accessed and successfully reads the same tombstone object from all primaries. This scheme works well most of the time, but is not perfect and Basho still has some improvements planned. As part of Riak 1.0 we've added a delete_mode setting to give more control over the deletion process while we complete that work.

Explaining what delete_mode does requires more gory details. When a client requests deletion of an object, internally riak performs a get/put against the object to write the tombstone and acknowledges the delete to the client allowing it to continue. In the background Riak issues a second get operation to trigger tombstone removal.

Whenever Riak gets an object, as part of a delete or not, it requests all replicas of the object from the current preference list (made up of owner nodes responsible for storing replicas of the object and fallback nodes if the owners are unavailable). If any vnodes return out of date or conflicting objects out of date it will issue read repairs and stop. If all replicas have the same object and it is a tombstone object and there are no fallbacks in the preference list then the get FSM issues the request to remove the tombstone permanently.

Removing the tombstone object is the hard part of deletion. To maintain its availability properties, Riak relies on eventual consistency and deliberately does not synchronize updates to objects across replicas as one or more could fail during a request. This means that there can be small variations in when the remove tombstone request is processed. If the object is being updated (a get/put) by a client during the window between the tombstone-checking get deciding to remove the tombstones and the tombstone being removed on the nodes then one of two things will happen. Get uses R to decide when it has enough objects, so if any of the first R responses include the tombstone object the response can include a vector clock to base the new object on that will supercede the tombstones. If none of the first R of N requests contain the tombstone object, there is no versioning information to return to the client so a new object will be created with empty versioning information.

The new delete_mode setting in the riak_kv section of app.config is for controlling what happens during the window between the get FSM deciding it can remove the object and it taking place. If a delay is set, when the tomstone removal request is received, the tombstone is hashed and then checked after the delay to see if it has changed. If it has not, the tombstone is removed. If the hash has changed due to an update, the new object is left alone. This is the default delete_mode with a delay of 3000 (three seconds). Delayed deletes are implemented using timers on the vnodes, so setting long delays on systems with heavy delete activity will increase the memory footprint.

Setting delete_mode to keep disables tombstone removal. This is useful for applications where the client may be disconnected for extended periods and keep local copies of objects that they will update on reconnection (for example mobile clients or multi-data center replication when the two sites will be disconnected for long periods of time). It also protects against an edge case where an object is deleted and recreated on the owning nodes while a fallback is either down or awaiting handoff.

There is also an immediate delete_mode which preserves the old 0.14.2 behavior of removing the tombstone as soon as the request is received. Some unit tests for clients rely on the old behavior (e.g. the Python client) and expect the versioning information to be reset after the delete. The 1.0 HTTP, PBC and riak_client interfaces can now provide vclocks when tombstone objects are present so that a put will supercede them.

Wrapping it up - the default delete_mode will work for most use cases with the option to override it when needed. To change the setting add a {delete_mode, keep}, {delete_mode, immediate} or {delete_mode, DelayMsecs} to the riak_kv section of app.config (in /etc/riak, /opt/riak or etc/ depending on your platform).

Jon Meredith
Senior Software Engineer
Basho Technologies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment