Leases & Fences: Architecture Pattern
Leases are time-bound locks, where the locks are automatically released by the system managing them after a timeout.
Good Reads:
Ensuring consistency of critical data for which we want to have exclusive access, can be achieved through locking. However, in a distributed setup, we can run into scenarios where because of a network partition, process pauses or nodes go down. This leads to the resources being locked indefinitely!
Leases
To overcome the above problem, we make use of leases.
Leases are time-bound locks, where the locks are automatically released by the system managing them after a timeout.
When a client obtains a lease on a particular resource, the distributed system guarantees that during the tenure of the lease, no other client will be able to make modifications to the resource. The client holding the lease to a resource can also choose to renew the lease on the resource, thereby extending the timeout on the lease.
The timeout on the lease is generally tracked by TTL(Time to Live) when the lease is first obtained on a particular resource.
The tradeoff with choosing the lease term
We know that we need to create time-bound leases, but what’s the right timeout that we should choose and what's the impact of setting a short timeout vs a long one?
Clocks & Leases
Do read my series of articles on Clocks!
Leases are time-bound and to track time, they rely on clocks. We have two actors in the system, the server that's providing the lease, and the client that’s obtaining the lease.
The Server’s clock is faster than the Client’s — In this scenario, the lease will timeout on the server before it does on the client & hence the server might allow another client to take the lease, while the first client’s lease is still active. This can lead to unexpected behaviour, as we’re no more allowing exclusive access.
To handle the above scenario, we make the client conservative, and we ensure that the client sends multiple communications(heartbeats) to the server within the timeout window. It’s reasonable to send a request after about half of the lease time is elapsed. This results in up to two refresh requests within the lease time.
Eg: Kafka has a lease expiration time of 18 seconds, while the client sends a heartbeat every 3 seconds.
Fences
There can be a scenario with leases if we have a separate system for providing leases(Single Source of Truth for Leases) & another system where any client with a Lease can get data from/write data.
Let’s assume Client A requested a lease(used to read/write to DB) from the Lease Management System(LMS) and got it. However, there was a GC pause in client A. The lease expired in the LMS, and Client B requested another lease and got it. Client B requested the DB and updated some records. Client A(after the GC pause) used the lease(expired, but only LMS knows about expiry) and modified some records of its own. This leads to an unexpected state + and unauthorized access to your system.
Fencing prevents operations that can cause inconsistent systems, by using a very simple concept. We’ve already discussed this during High Watermark, as Generation/Epoch.
Every time a lease is obtained from the LMS, it attaches a generation/epoch to the lease. While accessing any system using the lease, if the generation/epoch of the lease is lower than the last epoch the system has seen, the system fences the request.
In the example we discussed, Client A gets a lease with Generation 1. Client B gets a request with Generation 2. When Client B accesses the DB, the DB records the last generation it has seen to be 2. When the request from Client A(after GC pause), the DB sees that the generation is 1, which is < 2. Hence it fences the request and prevents the request from going through.
This brings us to the end of this article. Leases and Fences are very commonly used across distributed systems, and it’s important to understand these basic concepts. Please post comments on any doubts you might have and will be happy to discuss them!