Ring files contain three key pieces of information:
- the part_power value (often stored as
part_shift := 32 - part_power
), which determines how many partitions are in the ring, - the device list, which includes all the disks participating in the ring, and
- the replica-to-part-to-device table, which has all
replica_count * (2 ** part_power)
partition assignments.
Ring files have always been gzipped when serialized, though the inner, raw format has evolved over the years.
Initially, rings were simply pickle dumps of the RingData object. With Swift 1.3.0, this changed to pickling a pure-stdlib data structure, but the core concept was the same.
Pickle presented some problems, however. While there are security concerns around unpickling untrusted data, security boundaries are generally drawn such that rings are assumed to be trusted. Ultimately, what pushed us to a new format were performance considerations.
Starting in Swift 1.7.0, Swift began using a new format (while still being willing to read the old one). The new format starts with some magic so we may identify it as such:
+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+
where <vrs>
is a network-order two-byte version number (which is always 1).
After that, a JSON object is serialized as
+---------------+-------...---+
| <data-length> | <data ... > |
+---------------+-------...---+
where <data-length>
is the network-order four-byte length (in bytes) of
<data>
, which is the ASCII-encoded JSON-serialized object. This object
has at minimum three keys:
devs
for the device listpart_shift
(i.e.,32 - part_power
)replica_count
for the integer number of part-to-device rows to read
The replica-to-part-to-device table then follows:
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| ... |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...|
+-------+-------+...+
Each <dev>
is a host-order two-byte index into the devs
list. Every row
except the last has exactly 2 ** part_power
entries; the last row may
have the same or fewer.
The metadata object has proven quite versatile: new keys have been added to provide additional information while remaining backwards-compatible. In order, the following new fields have been added:
byteorder
specifies whether the host-order for the replica-to-part-to-device table is "big" or "little" endian. Added in Swift 2.12.0, this allows rings written on big-endian machines to be read on little-endian machines and vice-versa.next_part_power
indicates whether a partition-power increase is in progress. Added in Swift 2.15.0, this will have one of two values, if present: the ring's currentpart_power
, indicating that there may be hardlinks to clean up, orpart_power + 1
indicating that hardlinks may need to be created. See the documentation for more information.version
specifies the version number of the ring-builder that was used to write this ring. Added in Swift 2.24.0, this allows the comparing of rings from different machines to determine which is newer.
The way that v1 rings dealt with fractional replicas made it impossible to reliably serialize additional large data structures after the replica-to-part-to-device table.
The new format starts with magic similar to v1:
+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+
where is a network-order two-byte version number (which is now 2). After that, a series of BLOBs are serialized, each as
+-------------------------------+-------...---+
| <data-length> | <data ... > |
+-------------------------------+-------...---+
where <data-length>
is the network-order eight-byte length (in bytes) of
<data>
.
The first such BLOB is an ASCII-encoded JSON object full of metadata, similar to v1 rings. It has the following required keys:
devs
part_shift
dev_id_bytes
specifies the number of bytes used for each<dev>
in the replica-to-part-to-device table; will be one of 1, 2, 4, or 8
Additionally, there are several optional keys which may be present:
next_part_power
version
Notice that two keys are no longer present: replica_count
is no longer
needed as the size of the replica-to-part-to-device table is explicit, and
byteorder
is not needed as all data in v2 rings should be written using
network-order.
The second BLOB is the replica-to-part-to-device table. It's length will be
replicas * (2 ** part_power) * dev_id_bytes
, where replicas
is the exact
(potentially fractional) replica count for the ring. Unlike in v1, each
<dev>
is written using network-order.
Note that this is why we increased the size of <data-length>
as compared to
the v1 format -- otherwise, we may not be able to represent rings with both
high replica_count
and high part_power
.
A third BLOB may be present: the last-primaries table. Its structure
is similar to the replica-to-part-to-device table, but <dev>
may be
(2 ** dev_id_bytes) - 1
to indicate the absence of a value.
The v2 ring format seems flexible enough that we may be able to consolidate the ring and builder formats. Several currently-missing items could be trivially added to the metadata object:
min_part_hours
(float?)devs_changed
(bool)overload
(float)_last_part_moves_epoch
(int)_last_part_gather_start
(int)id
(str)dispersion
(float, cached)_remove_devs
(list of dicts, but no larger thandevs
)
Others are large enough that they should likely be serialized as further BLOBs:
_last_part_moves
- array of2 ** part_power
one-byte number of hours since a partition last moved_dispersion_graph
- dict of ...