[Feature] Persist etcd lease TTLs by enabling checkpointing and analyse its effect like on write throughput, disk usage etc. #733
Labels
kind/enhancement
Enhancement, improvement, extension
lifecycle/stale
Nobody worked on this for 6 months (will further age)
Feature (What you would like to be added):
It has been observed that a change in leadership of etcd cluster(or restart of etcd in single node cluster) reset/renewed the etcd leases TTL(time-to-live) as
etcd
don't persist the leases by default.If etcd configuration set this experimental flag --experimental-enable-lease-checkpoint to
true
, then lessor i.e etcd leader will persist the lease by writing a checkpoint onto the disk for every5mins
, so that a change in leadership of etcd cluster(or restart of etcd in single node cluster) won't reset/renew the lease TTL(time to live) ifTTL > 5mins
, and by doing this we can prevent indefinite auto-renewal of lease's TTL: etcd-io/etcd#9888But if we want to persist the lease by setting this flag
--experimental-enable-lease-checkpoint
totrue
, then before enabling it we should also analyse the write throughput and disk usage etc. for persisting the etcd's lease as this shouldn't cause extra load our etcd's performance which already have8Gi
of quota limit and etcd is not very write optimal database.Pre-requisite for persisting the leases by enabling this flag:
--experimental-enable-lease-checkpoint
:--experimental-enable-lease-checkpoint
like using this lease checkpoint will add new raft log entry in etcd cluster which can cause panic in etcd cluster if due to some reason we want to downgrade our etcd cluster to some older etcd version: *: enable lease checkpoint via experimental flag etcd-io/etcd#10797Finally, if every aspects have been cleared then we can proceed with enabling this flag in our etcds.
--experimental-enable-lease-checkpoint
.Note:
etcd-events
not for ouretcd-main
.Motivation (Why is this needed?):
It has been observed in one of our live landscape cluster that events were generated with etcd lease of
TTLs 24hours
but due to some reasons leadership changes within 24hours and hence when the leadership changes, theetcd lease
TTLs(time to live) values were reset/renewed by the new leader and this lead to increase in total no. of events as old leases were not revoked as they got renewed due to leadership change, and this leads to total no. of events got accumulated, hence etcd's performance degradation.In such scenario, we are depending on restart and leadership change should be infrequent else if leadership keep changing within 24hours then this will lead to indefinite auto-renewal of lease's TTLs which can leads to accumulation of total no. of events.
cc @istvanballok
Approach/Hint to the implement solution (optional):
The text was updated successfully, but these errors were encountered: