The goal of the implementation is to go over the limits of the current Redis replication. PSYNC is currently able to avoid a full resync only when a slave reconnects with the same instance (not just as server address, but the same run) of the master, assuming there is enough backlog available.
The new PSYNC fixes this problem by identifying, with was previously called runid
and is now called replid
, an specific history of the data set, so that if the history is the same, PSYNC can work regardless of the actual instance / execution of Redis acting as a master.
The obvious case is a slave that is promoted to master: technically it contains the same history as its master, up to a given time. The time is measured in terms of replication offset, which is incremented at every byte produced by the replication stream.
However the feature is not limited to this use case. For example after a master restart, it could load from the RDB file its replicaiton ID and offset (and potentailly some backlog as well), and continue to be able to PSYNC with the slaves.
The new logical objects (moving parts) in the PSYNC are:
- The primary replication ID, called ID1 in this document.
- The secondary replication ID, called the ID2, which is an ID the instance acting as a master recognizes as well as a valid replication history, but up to a given offset.
To understand why there is an ID1 and ID2, it's worth to consider what happens to a slave which is turned into a master:
T1: Instance is a slave.
T2: Instance is turned into a master at replication offset 10000.
T3: Instance continues to receive writes, up to replication offset 15000.
T4: Slave A connects to instance to PSYNC, using the previous master ID.
T5: Slave B connects to instance to PSYNC, using the new master ID.
The instance knows its history is coherent with the history of its previous master unitl up to the replication offset 10000. For new offsets, it cannot reply to requests referencing its master ID, since starting from T2 a new history is created, that may diverge from the one of the actual master (it is worth to note that other slaves may lost the connection with the previous master after the slave that was promoted).
So at T2, the Slave promoted to master shoud switch its ID1 to a random new one, and use as ID2 its master ID. Moreover, the ID2-max-offset should be set to 10000.
Additionally there are other logical objects:
ID2-max-offset
, the max offset we can accept PSYNC requests for ID2.- The replication backlog.
- The chained slaves that may be connected to a slave. Sometimes we need to disconnect them.
The new CONTINUE
statement requires to be extended so that it is able to correctly acknowledge a PSYNC request, but informing the slave that the master ID1 has changed. This way after a successive disconnection, the slave can successfully perform a PSYNC again for offsets greater than ID2-max-offset
.
Depending on the following events, an instance shoud act as follows:
PSYNC replies +FULLSYNC or is not supported and a full synchronization is required
- Free backlog.
- Set ID1 and offset as Master's ID1 and offset.
- Clear ID2 (setting it all zeroes will do).
- Disconnect all slaves, they must FULLSYNC as well now.
PSYNC replies +CONTINUE with same ID as before
- Nothing to do. The master will feed us with the missing bytes in the replication stream.
PSYNC replies +CONTINUE but changes ID
The master changed replication ID, even if can provide the correct incremental diff for the history/offset we asked for. We need to change our main replicaiton ID in order to reflect the one of the master, and use the old one as ID2 so that our sub-slaves will be able to PSYNC with us correctly. The steps to perform are:
- Set ID2 as our previoius ID1.
- Set ID2 max offset as current offset.
- Set ID1 as master new ID.
- Disconnect all slaves (they must be informed of ID switch, but will be able to partially resynchronize).
Instance changes master address (SLAVEOF or API call)
- Nothing to do, PSYNC reply will care about it.
Slave is turned into master (SLAVEOF NO ONE or API call)
- Set ID2 to ID1 (ID1 is always our last master ID).
- Set ID2 max offset to current offset.
- Set ID1 to random new value.
- Disconnect all slaves (they must be informed of ID switch, but will be able to partially resynchronize).
Master is turned into slave
- Nothing to do, PSYNC reply will care about it. However note that the master should create a fake cached master object, so that it will be able to PSYNC with the slave that performed a failover later.
It is possible to interface the PSYNC semantics to RDB and AOF persistence in diffenet ways. We can persist:
- The slave master's ID and master offset.
- The master ID1, offset, and some backlog.
Both things can be done with AOF as well only on SHUTDOWN, so that we are sure they are the last part of the AOF (otherwise the new replication offset would be undefined).
However an important change to the way we process the replication offset, will allow to simplify persisting the slave state to RDB files. Currently we increment the replication offset when we receive new bytes, not when we process commands.
When saving to RDB, we should remove from the current offset the length of the current (or cached) master c->querybuf
buffer, so that we don't have to persist it. The next PSYNC will ask for the part we are discarding.
With the new design, slaves often publish IDs of masters. This means that we must be able to ensure that the replication stream from a master to a slave is identical to the one from the slave to other sub-slaves.
To guarantee this using the normal master replication mechanism, where writes trigger the creation of the stream, is difficult for many implementation details, including the fact that the replication stream does not just includes raw data, but pings, in order to discover timeout conditions, commands to implement synchronous replication (requests for current offset), and so forth.
For this reasons a much simpler approach is that sub-slaves (slaves of slaves) are handled in a different way, just writing to them, when they are online, exactly what we receive from our master.
So writing directly what we receive from the master, is used in the following three steps of replicating with a sub-slave (or chained slave):
- The population of the replication backlog in the slave.
- The accumulation of the differences during full synchronizations.
- As said, sending the normal replication stream to slaves.
Conceptually it is like if all the slaves and chained slaves are served by the top-level master. Slaves in the middle just consume the replication stream and act as proxies for the next level slaves.
That't great! May I ask when will it be released or which branch are you working on?