Instantly share code, notes, and snippets.

@antirez /newpsync.md Secret
Last active Oct 31, 2018

Embed
What would you like to do?

New PSYNC

Motivations

The goal of the implementation is to go over the limits of the current Redis replication. PSYNC is currently able to avoid a full resync only when a slave reconnects with the same instance (not just as server address, but the same run) of the master, assuming there is enough backlog available.

The new PSYNC fixes this problem by identifying, with was previously called runid and is now called replid, an specific history of the data set, so that if the history is the same, PSYNC can work regardless of the actual instance / execution of Redis acting as a master.

The obvious case is a slave that is promoted to master: technically it contains the same history as its master, up to a given time. The time is measured in terms of replication offset, which is incremented at every byte produced by the replication stream.

However the feature is not limited to this use case. For example after a master restart, it could load from the RDB file its replicaiton ID and offset (and potentailly some backlog as well), and continue to be able to PSYNC with the slaves.

Objects

The new logical objects (moving parts) in the PSYNC are:

  1. The primary replication ID, called ID1 in this document.
  2. The secondary replication ID, called the ID2, which is an ID the instance acting as a master recognizes as well as a valid replication history, but up to a given offset.

To understand why there is an ID1 and ID2, it's worth to consider what happens to a slave which is turned into a master:

T1: Instance is a slave.

T2: Instance is turned into a master at replication offset 10000.

T3: Instance continues to receive writes, up to replication offset 15000.

T4: Slave A connects to instance to PSYNC, using the previous master ID.

T5: Slave B connects to instance to PSYNC, using the new master ID.

The instance knows its history is coherent with the history of its previous master unitl up to the replication offset 10000. For new offsets, it cannot reply to requests referencing its master ID, since starting from T2 a new history is created, that may diverge from the one of the actual master (it is worth to note that other slaves may lost the connection with the previous master after the slave that was promoted).

So at T2, the Slave promoted to master shoud switch its ID1 to a random new one, and use as ID2 its master ID. Moreover, the ID2-max-offset should be set to 10000.

Additionally there are other logical objects:

  • ID2-max-offset, the max offset we can accept PSYNC requests for ID2.
  • The replication backlog.
  • The chained slaves that may be connected to a slave. Sometimes we need to disconnect them.

Changes to the CONTINUE reply of PSYNC

The new CONTINUE statement requires to be extended so that it is able to correctly acknowledge a PSYNC request, but informing the slave that the master ID1 has changed. This way after a successive disconnection, the slave can successfully perform a PSYNC again for offsets greater than ID2-max-offset.

Invalidations depending on events

Depending on the following events, an instance shoud act as follows:

PSYNC replies +FULLSYNC or is not supported and a full synchronization is required

  1. Free backlog.
  2. Set ID1 and offset as Master's ID1 and offset.
  3. Clear ID2 (setting it all zeroes will do).
  4. Disconnect all slaves, they must FULLSYNC as well now.

PSYNC replies +CONTINUE with same ID as before

  1. Nothing to do. The master will feed us with the missing bytes in the replication stream.

PSYNC replies +CONTINUE but changes ID

The master changed replication ID, even if can provide the correct incremental diff for the history/offset we asked for. We need to change our main replicaiton ID in order to reflect the one of the master, and use the old one as ID2 so that our sub-slaves will be able to PSYNC with us correctly. The steps to perform are:

  1. Set ID2 as our previoius ID1.
  2. Set ID2 max offset as current offset.
  3. Set ID1 as master new ID.
  4. Disconnect all slaves (they must be informed of ID switch, but will be able to partially resynchronize).

Instance changes master address (SLAVEOF or API call)

  1. Nothing to do, PSYNC reply will care about it.

Slave is turned into master (SLAVEOF NO ONE or API call)

  1. Set ID2 to ID1 (ID1 is always our last master ID).
  2. Set ID2 max offset to current offset.
  3. Set ID1 to random new value.
  4. Disconnect all slaves (they must be informed of ID switch, but will be able to partially resynchronize).

Master is turned into slave

  1. Nothing to do, PSYNC reply will care about it. However note that the master should create a fake cached master object, so that it will be able to PSYNC with the slave that performed a failover later.

Persistence and PSYNC

It is possible to interface the PSYNC semantics to RDB and AOF persistence in diffenet ways. We can persist:

  1. The slave master's ID and master offset.
  2. The master ID1, offset, and some backlog.

Both things can be done with AOF as well only on SHUTDOWN, so that we are sure they are the last part of the AOF (otherwise the new replication offset would be undefined).

However an important change to the way we process the replication offset, will allow to simplify persisting the slave state to RDB files. Currently we increment the replication offset when we receive new bytes, not when we process commands.

When saving to RDB, we should remove from the current offset the length of the current (or cached) master c->querybuf buffer, so that we don't have to persist it. The next PSYNC will ask for the part we are discarding.

Implementation details

With the new design, slaves often publish IDs of masters. This means that we must be able to ensure that the replication stream from a master to a slave is identical to the one from the slave to other sub-slaves.

To guarantee this using the normal master replication mechanism, where writes trigger the creation of the stream, is difficult for many implementation details, including the fact that the replication stream does not just includes raw data, but pings, in order to discover timeout conditions, commands to implement synchronous replication (requests for current offset), and so forth.

For this reasons a much simpler approach is that sub-slaves (slaves of slaves) are handled in a different way, just writing to them, when they are online, exactly what we receive from our master.

So writing directly what we receive from the master, is used in the following three steps of replicating with a sub-slave (or chained slave):

  1. The population of the replication backlog in the slave.
  2. The accumulation of the differences during full synchronizations.
  3. As said, sending the normal replication stream to slaves.

Conceptually it is like if all the slaves and chained slaves are served by the top-level master. Slaves in the middle just consume the replication stream and act as proxies for the next level slaves.

@wenchaomeng

This comment has been minimized.

wenchaomeng commented Apr 5, 2016

That't great! May I ask when will it be released or which branch are you working on?

@xbsura

This comment has been minimized.

xbsura commented Jun 20, 2016

This feature is great, but using a histroy id maybe not a beautiful solution.
keep runid simple and clear, it just mean the exec id, and add a replid for each running redis, while using psync, always use PSYNC/SYNC replid offset, and not use runid

replid is generated as the following rule:

  1. first start and no rdb/aof, same as runid
  2. start from rdb/aof, can be loaded from such file
  3. become slave of other redis, use the replid of new redis

maybe this is more simple and clear compare with a histroy

thanks

@antirez

This comment has been minimized.

Owner

antirez commented Jun 24, 2016

@xbsura: the above proposal already uses a distinct replication ID that is not related to the Instance runid. However this is not enough, multiple histories are possible with the same runid, so the slave cannot inherit the runid fo the master (this is clearly explained in the document above). For this reason it must use a different runid, but remember the old master runid and its "switch offset".

@yossigo

This comment has been minimized.

yossigo commented Jun 26, 2016

@antirez I think this will be a great improvement to Redis replication, especially when coupled with replication state persistence. The idea of treating intermediate slaves as proxy also sounds very good to me. I have a few notes/questions:

  1. What happens if multiple master/slave failovers occur in a relatively short time window before other slaves are able to reconnect? Unless I miss something, a slave will succeed PSYNC only if has either the current or previous replid. Assuming this is the case, it may be useful to store a configurable number of replid+offset history pairs.
  2. I think it may be useful to do (configurable) periodic checkpointing of the replication state into AOF. This way, on a slave side crash (obviously no SHUTDOWN) it will be possible to read the AOF up to the last checkpoint entry and attempt replication from there. As an optimization, it may also be useful to have this entry written in some other easy-to-access location so it's possible to verify with the mater the offset is still valid before reading the entire AOF.

While discussing replication improvements, I think there is another issue that deserves some attention. The Redis replication processes requires additional memory resources on the master side - slave output buffers and copy-on-write. The amount of additional memory depends on the data set contents, access pattern. traffic volume, replication speed etc. While there is a cap on the COW memory required (100% of original), there is no limit on the slave buffers. This can result with situations in which Redis servers can grow beyond the point they can be replicated which can be a real problem in production systems.

An interesting (although very hard to implement) solution can be to move the slave output buffer from the master side to the slave side. Imagine a replication link that can handle multiplexing of both RDB frames and command stream simultaneously, and a slave that does the demux while waiting for the RDB transfer to complete. Such mechanism doesn't necessarily replaces the slave buffer in the master entirely, but at least makes it possible to limit this buffer (i.e. for sharing between slaves connecting at the same time) and offers a good alternative once the limit is reached (instead of just dropping the slave). What do you think?

@antirez

This comment has been minimized.

Owner

antirez commented Oct 24, 2016

Hello @yossigo, I can finally reply to your message given I'm working full time at this project finally.

  1. You are right that if there are two sequential changes of master in the time needed for slaves to reconnect, a full resynchronization is triggered. The reason why this scenario is not covered, is that I think it makes things more complex without providing a serious real world advantage: note that in order for a second failover to happen, at least one slave should already be synchronized, and additionally there is the time needed to detect the new master as failing again. The slaves that may not reconnect in time, are only the additional slaves that were not able to synchronize while at least one slave was able to do so (so there was enough time), even if they had more time. While this scenario is relatively narrow, the complexity it introduces is larger than expected since it's not just a matter of remembering a list of replication IDs / offsets, the same sets should be propagated to slaves after a successful PSYNC, since every slave should be able to PSYNC with sub-slaves with the same IDs / offsets.
  2. Yes, I agree but I'm taking it out of scope for now in order to address the problem in a more approachable way. This work should be extended for sure to persistence, in multiple ways: the slaves should be able to PSYNC after a restart if cleanly stopped (with SHUTDOWN). Similarly slaves with AOF should be able to do the AOF checkpointing, if enabled.

About used memory, I think your approach is interesting and worth exploring. A similar idea is used in AOF in order to move the accumulate "additional buffer" of pending writes from the parent to the child process in order to make a smaller final write in the master. Your idea seems a similar concept but extended to the replication. To me it looks promising, especially since now we even have the mixed RDB/AOF format, so the slave may receive the buffer of additional writes, store them in memory, when the RDB transfer is finished, write it as an AOF tail of the RDB file, and switch to loading the RDB+AOF combo as usually.

Thanks for your comment and sorry for the late reply, I was not able to digest these topics easily while not working at this stuff.

@yossigo

This comment has been minimized.

yossigo commented Oct 25, 2016

@antirez, no problem, happy to know you manage to progress this.

Regarding (1) - an example topology I had in mind was master+slave in a data center/cluster, and additional one or more slaves elsewhere replicating from whoever is the current master. Local fail-over can be expected to be quicker and frequent (e.g. admin task of upgrading a a master/slave pair and failing over from one to the other and back). But I understand the tradeoff and replication is certainly one area where it's makes sense to keep things simple.

@antirez

This comment has been minimized.

Owner

antirez commented Oct 29, 2016

@yossigo: Yes, in the context of your example topology I think it's worth to add that when new IDs are generated (slave turned into a master) after a failover the system takes care of disconnecting the slaves so that they will PSYNC again (slaved should be able to do a partial sync when this happens) and inform them of the ID switch (via the modified +CONTINUE reply). So at least there is not the problem of the sub-slaves in the far datacenter to still keep thinking the replication ID is the old one. However I bet you already know this and you were just thinking about the far data center to have slaves that can't resync fast enough after the first failover.

@antirez

This comment has been minimized.

Owner

antirez commented Oct 29, 2016

Note: this document was updated as I detected small errors implementing it as actual code.

@antirez

This comment has been minimized.

Owner

antirez commented Nov 8, 2016

Update: I finished turning this proposal into code, I hope to release everything Friday, then a rigorous testing phase will be started.

@ravilr

This comment has been minimized.

ravilr commented Nov 21, 2016

@antirez this is great news. this will greatly enhance the redis replication operability. Does this also handle antirez/redis#1702 ? Also, let me know how i can help testing this feature. Thanks.

@v6

This comment has been minimized.

v6 commented Jan 19, 2017

// , We shall test this.

@ravilr, could you just download and use https://raw.githubusercontent.com/antirez/redis/4.0/00-RELEASENOTES on a master and slave with some data (more than 0.25G, perhaps?), and check whether partial resync occurs upon manual failover while under load?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment