Vector Clocks are used in a distributed systems to determine whether pairs of events are causally related. Using Vector Clocks, timestamps are generated for each event in the system, and their causal relationship is determined by comparing those timestamps.
- Use in key-value store like riak, Dynamo
- Enables causality to be captured
- With vector clocks, we assume we know about each participating process
- Suppose there are N processes in the group [1…N]
- Each vector has N elements
- Instead of sending a single timestamp, send a vector of timestamps for each process (Update pairwise)
- Initially, all the clocks are set to zero.
- Each time a process experiences an internal event, it increments its own logical clock in the vector by one.
- Each time for a process to send a message, it must increment its own clock and then send a copy of its own vector.
- Each time a process receives a message,
- It increments its own logical clock in the vector by one
- Updates each element in its vector by taking the maximum value in its own vector clock
Let's think about versioning, ie multiple people editing the same document over time. Your responsibility is to keep the editing history consistent. So for example, if person A edits the doc, and then B edit's A's version, we can say that B's version of the doc is the most up to date since it contains all of A's edits. But if A and B grab the document simultaneously, and then A submits his edit, then B submits his, you'd like to be able to tell that their versions are conflicts (also known as siblings). You would not want to automatically call B's the most up to date, since that would wipe out A's changes, even though B's edit was submitted after A's.
How do we detect conflicts? One method of doing this is to keep parent/child relationships over time. Suppose we start with a document at version 1, and then A makes an edit and submits a new version 1a that contains a reference to 1 as its parent. Then B edits and submits a new version 1b that contains a reference to 1 as its parent. We can tell that 1a and 1b are siblings by walking up the tree and finding a common parent.
Can we do better? Vector clocks approach the problem by keeping around version history as a function of editors rather than documents. So an initial document might start with version { A: 0, B: 0 }, meaning that person A has made no edits, and person B has made no edits. Then, when a person edits a document, they take the vector clock and increment just the counter for their own entry.
For example, A grabs the doc with version { A: 0, B: 0 }, edits and submits a new document with version { A: 1, B: 0}. B grabs { A: 0, B: 0}, edits and submits a new doc with version { A: 0, B: 1 }.
You can tell that a document X is a direct ancestor (ie, not a sibling) of document Y if the counters for each participant in the vector clock of X is greater than or equal to the vector clock of Y. The vector clock is explicitly tracking the history of people who have edited the document.
You can tell that a document X is a sibling of Y if there exist any participant in X's vector clock who has a counter that is less than his corresponding counter in Y.
The space requirement under this scheme is a function of the number of participants, not the length of the history. That's why you might adopt this strategy over the more obvious one.
Consider two vectors, X and Y:
- If each element of X is <= Y: X causally precedes Y
- If each timestamp in X is >= Y: Y causally precedes X
- Else: X and Y are concurrent
- The main disadvantage of vector clock is that they are not being constant in size.
- The implementation of vector clock need an entry for each site
this page is helpful, but the following statement is incorrect:
X is an ancestor of Y if every element of X is <= that of Y
X is sibling of Y, if there exists an element such that X>Y and an element such that X<Y.