Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Persist etcd lease TTLs by enabling checkpointing and analyse its effect like on write throughput, disk usage etc. #733

Open
3 tasks
ishan16696 opened this issue Dec 8, 2023 · 2 comments
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age)

Comments

@ishan16696
Copy link
Member

ishan16696 commented Dec 8, 2023

Feature (What you would like to be added):
It has been observed that a change in leadership of etcd cluster(or restart of etcd in single node cluster) reset/renewed the etcd leases TTL(time-to-live) as etcd don't persist the leases by default.
If etcd configuration set this experimental flag --experimental-enable-lease-checkpoint to true, then lessor i.e etcd leader will persist the lease by writing a checkpoint onto the disk for every 5mins, so that a change in leadership of etcd cluster(or restart of etcd in single node cluster) won't reset/renew the lease TTL(time to live) if TTL > 5mins, and by doing this we can prevent indefinite auto-renewal of lease's TTL: etcd-io/etcd#9888

But if we want to persist the lease by setting this flag --experimental-enable-lease-checkpoint to true, then before enabling it we should also analyse the write throughput and disk usage etc. for persisting the etcd's lease as this shouldn't cause extra load our etcd's performance which already have 8Gi of quota limit and etcd is not very write optimal database.

Pre-requisite for persisting the leases by enabling this flag: --experimental-enable-lease-checkpoint:

  • Analyse the etcd's performance like write throughput, disk usage etc.
  • Analyse other aspects of enabling this flag --experimental-enable-lease-checkpoint like using this lease checkpoint will add new raft log entry in etcd cluster which can cause panic in etcd cluster if due to some reason we want to downgrade our etcd cluster to some older etcd version: *: enable lease checkpoint via experimental flag etcd-io/etcd#10797

Finally, if every aspects have been cleared then we can proceed with enabling this flag in our etcds.

  • Persist the leases by enabling the flag: --experimental-enable-lease-checkpoint.
    Note:
    • IMO, we should only set this flag only for etcd-events not for our etcd-main.

Motivation (Why is this needed?):
It has been observed in one of our live landscape cluster that events were generated with etcd lease of TTLs 24hours but due to some reasons leadership changes within 24hours and hence when the leadership changes, the etcd lease TTLs(time to live) values were reset/renewed by the new leader and this lead to increase in total no. of events as old leases were not revoked as they got renewed due to leadership change, and this leads to total no. of events got accumulated, hence etcd's performance degradation.
In such scenario, we are depending on restart and leadership change should be infrequent else if leadership keep changing within 24hours then this will lead to indefinite auto-renewal of lease's TTLs which can leads to accumulation of total no. of events.

cc @istvanballok

Approach/Hint to the implement solution (optional):

@ishan16696 ishan16696 added the kind/enhancement Enhancement, improvement, extension label Dec 8, 2023
@ishan16696
Copy link
Member Author

ishan16696 commented Jan 2, 2024

I discovered that merely setting the --experimental-enable-lease-checkpoint flag to true is not sufficient. This can lead to issues where the leases TTL can still be reset even after the lease has been checkpointed . For more details, please refer to issue etcd-io/etcd#17132

To address this, it's necessary to also enable the --experimental-enable-lease-checkpoint-persist flag to true. This should be done in conjunction with the flag mentioned in the issue, i.e --experimental-enable-lease-checkpoint.
Interestingly, the --experimental-enable-lease-checkpoint-persist flag is not listed in the etcd --help for versions etcd v3.4.26 or etcd v3.5.9.

~ > etcd --help | grep lease                                                                                                                                            
  --experimental-enable-lease-checkpoint 'false'
    ExperimentalEnableLeaseCheckpoint enables primary lessor to persist lease remainingTTL to prevent indefinite auto-renewal of long lived leases.

@ishan16696
Copy link
Member Author

As I mentioned that flag: --experimental-enable-lease-checkpoint-persist is missing in etcd --help in etcd version 3.4.x and version 3.5.x.
I have opened the PR to add this flag on respective etcd verison: etcd-io/etcd#17189 and etcd-io/etcd#17190

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age)
Projects
None yet
Development

No branches or pull requests

2 participants