Skip to content

Instantly share code, notes, and snippets.

@zaitcev
Forked from tipabu/ring-formats.md
Last active October 19, 2021 23:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zaitcev/d0c900975b496d4b7f2ddb3181a0837d to your computer and use it in GitHub Desktop.
Save zaitcev/d0c900975b496d4b7f2ddb3181a0837d to your computer and use it in GitHub Desktop.

Ring File Formats

Ring files contain three key pieces of information:

  • the part_power value (often stored as part_shift := 32 - part_power), which determines how many partitions are in the ring,
  • the device list, which includes all the disks participating in the ring, and
  • the replica-to-part-to-device table, which has all replica_count * (2 ** part_power) partition assignments.

Ring files have always been gzipped when serialized, though the inner, raw format has evolved over the years.

Ring v0

Initially, rings were simply pickle dumps of the RingData object. With Swift 1.3.0, this changed to pickling a pure-stdlib data structure, but the core concept was the same.

Ring v1

Pickle presented some problems, however. While there are security concerns around unpickling untrusted data, security boundaries are generally drawn such that rings are assumed to be trusted. Ultimately, what pushed us to a new format were performance considerations.

Starting in Swift 1.7.0, Swift began using a new format (while still being willing to read the old one). The new format starts with some magic so we may identify it as such:

+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+

where <vrs> is a network-order two-byte version number (which is always 1). After that, a JSON object is serialized as

+---------------+-------...---+
| <data-length> | <data ... > |
+---------------+-------...---+

where <data-length> is the network-order four-byte length (in bytes) of <data>, which is the ASCII-encoded JSON-serialized object. This object has at minimum three keys:

  • devs for the device list
  • part_shift (i.e., 32 - part_power)
  • replica_count for the integer number of part-to-device rows to read

The replica-to-part-to-device table then follows:

+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
|                ...                |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...|
+-------+-------+...+

Each <dev> is a host-order two-byte index into the devs list. Every row except the last has exactly 2 ** part_power entries; the last row may have the same or fewer.

The metadata object has proven quite versatile: new keys have been added to provide additional information while remaining backwards-compatible. In order, the following new fields have been added:

  • byteorder specifies whether the host-order for the replica-to-part-to-device table is "big" or "little" endian. Added in Swift 2.12.0, this allows rings written on big-endian machines to be read on little-endian machines and vice-versa.
  • next_part_power indicates whether a partition-power increase is in progress. Added in Swift 2.15.0, this will have one of two values, if present: the ring's current part_power, indicating that there may be hardlinks to clean up, or part_power + 1 indicating that hardlinks may need to be created. See the documentation for more information.
  • version specifies the version number of the ring-builder that was used to write this ring. Added in Swift 2.24.0, this allows the comparing of rings from different machines to determine which is newer.

Ring v2

The way that v1 rings dealt with fractional replicas made it impossible to reliably serialize additional large data structures after the replica-to-part-to-device table.

The new format starts with magic similar to v1:

+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+

where is a network-order two-byte version number (which is now 2). After that, a series of BLOBs are serialized, each as

+-------------------------------+-------...---+
| <data-length>                 | <data ... > |
+-------------------------------+-------...---+

where <data-length> is the network-order eight-byte length (in bytes) of <data>.

The first such BLOB is an ASCII-encoded JSON object full of metadata, similar to v1 rings. It has the following required keys:

  • devs
  • part_shift
  • dev_id_bytes specifies the number of bytes used for each <dev> in the replica-to-part-to-device table; will be one of 1, 2, 4, or 8

Additionally, there are several optional keys which may be present:

  • next_part_power
  • version

Notice that two keys are no longer present: replica_count is no longer needed as the size of the replica-to-part-to-device table is explicit, and byteorder is not needed as all data in v2 rings should be written using network-order.

The second BLOB is the replica-to-part-to-device table. It's length will be replicas * (2 ** part_power) * dev_id_bytes, where replicas is the exact (potentially fractional) replica count for the ring. Unlike in v1, each <dev> is written using network-order.

Note that this is why we increased the size of <data-length> as compared to the v1 format -- otherwise, we may not be able to represent rings with both high replica_count and high part_power.

A third BLOB may be present: the last-primaries table. Its structure is similar to the replica-to-part-to-device table, but <dev> may be (2 ** dev_id_bytes) - 1 to indicate the absence of a value.

Future Ideas

The v2 ring format seems flexible enough that we may be able to consolidate the ring and builder formats. Several currently-missing items could be trivially added to the metadata object:

  • min_part_hours (float?)
  • devs_changed (bool)
  • overload (float)
  • _last_part_moves_epoch (int)
  • _last_part_gather_start (int)
  • id (str)
  • dispersion (float, cached)
  • _remove_devs (list of dicts, but no larger than devs)

Others are large enough that they should likely be serialized as further BLOBs:

  • _last_part_moves - array of 2 ** part_power one-byte number of hours since a partition last moved
  • _dispersion_graph - dict of ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment