Version Vectors: Resolving Concurrent Updates
Version Vectors also maintain a vector per node, but the entries don’t represent the logical times anymore.
Prerequisites:
In my last article, we talked about Vector Clocks. We learned that Vector Clocks could help establish the ordering of operations, including identifying whether operations happened concurrently or were causally related. A very similar mechanism is used by Version Vector, but it’s used for a slightly different purpose.
Version Vector
Version Vectors are generally leveraged in distributed data-driven applications, where each data record is tied to a Version Vector. Since it’s a distributed system, a data record can be updated concurrently by multiple nodes. Thus, we can leverage version vectors, to identify if a data record can be reconciled immediately or if it requires a conflict resolution(remember that we can identify concurrent updates, & if an update to a data record is concurrent, it needs a conflict resolution).
Just like Vector Clocks, Version Vectors also maintain a vector per node, but the entries don’t represent the logical times anymore.
Before going further, let’s specify the criteria for identifying a concurrent update. Similar to Vector clocks, we compare two Version Vectors to understand the relationship between them -
A version vector is considered higher than the other if both of the version vectors have version numbers for the same nodes and each version number is higher than the one in the other vector and vice versa.
If the version vectors cover different nodes or if not all the version numbers are higher, they are considered concurrent.
Examples(A-D are actors, while the number represents event count just like in Vector Clocks)-
{A: 1, B: 1} is greater than {A: 1, B: 0}
{A: 2, B: 1} is concurrent with {A: 1, B: 2}
{A: 2, B: 1, C: 1} is greater than {A: 2, B: 1}
{A: 2, B: 1, C: 1} is concurrent with {A: 2, B: 1, D:1}
Now, it’s extremely critical to identify where things are different from Vector Clocks. Always remember that Version Vectors exist for each data item, & hence the comparison of vectors in the above example is a comparison of vectors on different Actors for the same data record/item.
Following are the steps followed by the algorithm -
Each actor reads the initial state of the version vector for each key & the value associated with the key. Let’s start with just 1 key for simplicity & the algo can easily be extended to multiple keys.
The actor pushes an update to the value to the data store(which internally implements Version Vectors). The update message includes the updated value, an identifier for the actor & the version vector present with the client.
Two nodes can synchronize their version vector, allowing them to reconcile the version vector or detect concurrent updates. In case there is a concurrent update, a conflict is detected. But how can we represent conflicts in the system?
We’ll talk about two approaches that we can leverage to identify conflicts, & the advantage/pitfalls with both.
Client As An Actor
As we read in the algorithm above, each write is associated with the Actor, and since we’re taking each client to be an Actor, the client responsible for the write will send its identifier along with the write. So, if in case a conflict is detected, a sibling is added to the version vector corresponding to the write from that particular client.
Let’s try to understand what’s happening in the above diagram -
Let’s assume we have a key K, with value U. We’re assuming that we have an empty version vector to begin with. Client’s C2 and C3 sync the same state from the system that’s implementing Version Vectors.
C3 updates the value to V & sends a PUT command, with the local state of the Version Vector it has(empty VV).
The System on receiving the Version Vector, compares it with its local state(again empty). So it creates a new Version Vector for this client(C3) and updates the value of key K to V.
C3 syncs the state from the server and again updates the value to A from V and sends a PUT command.
The System on receiving the Version Vector, compares it with its local state. This time the local state has an entry for C3, so it increments the counter to 2 & updates the value to A.
C2 updates the value to W & sends a PUT command, with the local state of the Version Vector it has(empty VV).
The System on receiving the Version Vector, compares it with its local state((C3, 2)). Since there is no local state for C2, it creates a new state of (C2, 1). If we compare the 2 Version Vectors, {C3, 2} and {C2, 1}, based on what we’ve discussed before, they are concurrent. Since there is a conflict, the System stores both version vectors(1 per client) as siblings.
Advantage -
Leveraging Client IDs to identify conflicts, we could easily handle concurrent updates to the same key.
Concerns -
Scalability — This approach suffers from something called Actors Explosion. If your system has a lot of clients, the corresponding version vectors for each data item could potentially grow to the size of the client. Since this will be stored per data item, we’ll have a lot of data to be stored. Also, as the size of the vector grows, the cost of comparing grows along with it.
Needs a unique Id for each client. If there is any overlap in client IDs, we could have potential data loss.
This brings us to the end of this article. We learned about Version Vectors, where they differ from Vector Clocks. We looked at how Version Vectors can be used by distributed data stores, relying on eventual consistency to identify concurrent updates. We saw the advantages and pitfalls of using Client as an Actor, and in our next article, we’ll look at other approaches to identify concurrent updates and more!