Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include and stabilize experimental-compaction-sleep-interval flag in releases #18481

Closed
JalinWang opened this issue Aug 21, 2024 · 8 comments
Closed

Comments

@JalinWang
Copy link
Contributor

JalinWang commented Aug 21, 2024

What would you like to be added?

Two parameters govern the auto compaction process: experimental-compaction-batch-limit and experimental-compaction-sleep-interval. Despite being added three years ago in this PR commit, the sleep interval flag has yet to be included in any releases. Meanwhile, the batch limit flag is under stabilization consideration in issue, and I propose stabilizing the experimental-compaction-sleep-interval as well.

Why is this needed?

Compaction significantly affects service response time. Distributing pressure more evenly is desired, where these two params serve. While workarounds exist currently, retention window size has limit flexibility and it's better to utilize the built-in mechanism over additional independent maintenance scripts.

image
image

@ivanvc
Copy link
Member

ivanvc commented Aug 29, 2024

Discussed during the fortnightly triage meeting. I'll review the PR.

@JalinWang
Copy link
Contributor Author

JalinWang commented Aug 30, 2024

Discussed during the fortnightly triage meeting. I'll review the PR.

Thanks for the update! I'm looking forward to your feedback~

Also, the following PR for bbolt can greatly improve etcd performance in our scenario where free space is considerable (dbSize - dbSizeInUse) in some time. If possible, could you also mention the release for 1.4.0? The alpha0 was released in January and alpha1 in May, so it seems the next version could be expected in September. That would be a great step toward a stable 1.4.0. (Although we'll still need to wait for etcd 3.6 😫 )

## v1.4.0-alpha.0(2024-01-12) change log
- [Record the count of free page to improve the performance of hashmapFreeCount]
([https://github.com/etcd-io/bbolt/pull/585 ](https://github.com/etcd-io/bbolt/pull/585)).

Attachment: our pprof result screenshot ( dbSize ~11GB, dbSizeInUse ~6GB)
image

@ivanvc
Copy link
Member

ivanvc commented Sep 3, 2024

@JalinWang, can you help with the CHANGELOG pull request to mention #18514?

Regarding the bbolt change, I'd suggest opening an issue on its repository.

Thanks!

@JalinWang
Copy link
Contributor Author

@JalinWang, can you help with the CHANGELOG pull request to mention #18514?

Sorry for the late PR. Plz review: #18556 :)

Regarding the bbolt change, I'd suggest opening an issue on its repository.

okkkkk~

@elias-dbx
Copy link

Hello, is there any guidance on how to tweak --experimental-compaction-batch-limit and --experimental-compaction-sleep-interval for large clusters?

We have ~40GB etcd databases which create around 2000 new revisions per second. We run compaction once every 30 minutes but see availability drops due to pauses during compaction time.

@JalinWang
Copy link
Contributor Author

JalinWang commented Oct 16, 2024

Hello, is there any guidance on how to tweak --experimental-compaction-batch-limit and --experimental-compaction-sleep-interval for large clusters?

Hi~
Personally, I adjusted --experimental-compaction-sleep-interval to a higher value and decreased --experimental-compaction-batch-limit to distribute the compaction load evenly across the whole auto compaction interval (typcial 1h) . This should minimize the spikes of RT during compaction tasks.

I found an article online link (in Chinese, use google translater maybe) about optimizing etcd for large clusters(~10k nodes), which mentioned the "compaction-sleep-interval" param. However, it doesn't provide any specific guidance on tuning these two parameters. If you come across any other resources, please share with me :)

@elias-dbx
Copy link

Once we upgrade to 3.5.16 I will try tweaking the compaction sleep interval and report back. We run up to 15k nodes in our k8s clusters.

@ivanvc
Copy link
Member

ivanvc commented Oct 18, 2024

I'll close this issue as the backport is complete and is already part of the 3.5.16 release. Please reopen if you feel there's more work to do.

Thanks, @JalinWang, for your contribution.

@ivanvc ivanvc closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants