zaitcev/ring-formats.md Secret

## ring-formats.md

      
    Raw
  

              ring-formats.md
            
          
    Ring File Formats

Ring files contain three key pieces of information:

the part_power value (often stored as part_shift := 32 - part_power),
which determines how many partitions are in the ring,
the device list, which includes all the disks participating in the ring, and
the replica-to-part-to-device table, which has all
replica_count * (2 ** part_power) partition assignments.

Ring files have always been gzipped when serialized, though the inner,
raw format has evolved over the years.
Ring v0

Initially, rings were simply pickle dumps of the RingData object. With
Swift 1.3.0, this
changed to pickling a pure-stdlib data structure, but the core concept
was the same.
Ring v1

Pickle presented some problems, however. While there are security
concerns around unpickling
untrusted data, security boundaries are generally drawn such that rings are
assumed to be trusted. Ultimately, what pushed us to a new format were
performance considerations.
Starting in Swift 1.7.0,
Swift began using a new format (while still being willing to read the old one).
The new format starts with some magic so we may identify it as such:
+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+

where <vrs> is a network-order two-byte version number (which is always 1).
After that, a JSON object is serialized as
+---------------+-------...---+
| <data-length> | <data ... > |
+---------------+-------...---+

where <data-length> is the network-order four-byte length (in bytes) of
<data>, which is the ASCII-encoded JSON-serialized object. This object
has at minimum three keys:

devs for the device list
part_shift (i.e., 32 - part_power)
replica_count for the integer number of part-to-device rows to read

The replica-to-part-to-device table then follows:
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
|                ...                |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...|
+-------+-------+...+

Each <dev> is a host-order two-byte index into the devs list. Every row
except the last has exactly 2 ** part_power entries; the last row may
have the same or fewer.
The metadata object has proven quite versatile: new keys have been added
to provide additional information while remaining backwards-compatible.
In order, the following new fields have been added:

byteorder specifies whether the host-order for the
replica-to-part-to-device table is "big" or "little" endian. Added in
Swift 2.12.0,
this allows rings written on big-endian machines to be read on
little-endian machines and vice-versa.
next_part_power indicates whether a partition-power increase is in
progress. Added in Swift 2.15.0,
this will have one of two values, if present: the ring's current
part_power, indicating that there may be hardlinks to clean up,
or part_power + 1 indicating that hardlinks may need to be created.
See the documentation
for more information.
version specifies the version number of the ring-builder that was used
to write this ring. Added in Swift 2.24.0,
this allows the comparing of rings from different machines to determine
which is newer.

Ring v2

The way that v1 rings dealt with fractional replicas made it impossible
to reliably serialize additional large data structures after the
replica-to-part-to-device table.
The new format starts with magic similar to v1:
+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+

where  is a network-order two-byte version number (which is now 2).
After that, a series of BLOBs are serialized, each as
+-------------------------------+-------...---+
| <data-length>                 | <data ... > |
+-------------------------------+-------...---+

where <data-length> is the network-order eight-byte length (in bytes) of
<data>.
The first such BLOB is an ASCII-encoded JSON object full of metadata, similar
to v1 rings. It has the following required keys:

devs
part_shift
dev_id_bytes specifies the number of bytes used for each <dev> in the
replica-to-part-to-device table; will be one of 1, 2, 4, or 8

Additionally, there are several optional keys which may be present:

next_part_power
version

Notice that two keys are no longer present: replica_count is no longer
needed as the size of the replica-to-part-to-device table is explicit, and
byteorder is not needed as all data in v2 rings should be written using
network-order.
The second BLOB is the replica-to-part-to-device table. It's length will be
replicas * (2 ** part_power) * dev_id_bytes, where replicas is the exact
(potentially fractional) replica count for the ring. Unlike in v1, each
<dev> is written using network-order.
Note that this is why we increased the size of <data-length> as compared to
the v1 format -- otherwise, we may not be able to represent rings with both
high replica_count and high part_power.
A third BLOB may be present: the last-primaries table. Its structure
is similar to the replica-to-part-to-device table, but <dev> may be
(2 ** dev_id_bytes) - 1 to indicate the absence of a value.
Future Ideas

The v2 ring format seems flexible enough that we may be able to consolidate
the ring and builder formats. Several currently-missing items could be
trivially added to the metadata object:

min_part_hours (float?)
devs_changed (bool)
overload (float)
_last_part_moves_epoch (int)
_last_part_gather_start (int)
id (str)
dispersion (float, cached)
_remove_devs (list of dicts, but no larger than devs)

Others are large enough that they should likely be serialized as further BLOBs:

_last_part_moves - array of 2 ** part_power one-byte number of hours
since a partition last moved
_dispersion_graph - dict of ...