-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage of etcd member catchup mechanism #17098
Comments
cc @ahrtr |
I am interested in helping. But I am not sure I know exactly what needs to be done. Could I shadow someone or get some guidance if I were to attempt this? @serathius |
There are 4 different changes proposed. Looking at the first one
|
/assign |
@serathius this https://github.com/etcd-io/etcd/blob/main/server/etcdserver/server.go#L1168-L1185 |
@tangwz Are you still planning to work on this? |
Hi @serathius , I could give this a shot, but I would like to understand the proposed changes a little better. Are all 4 changes necessary to address the problem? It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower (see etcd/server/etcdserver/server.go Line 1191 in 557e7f0
It sounds like change 2 by itself would address the problem in the common case where followers are able to keep up. Together, changes 3 and 4 sound like they would also address the problem in a different way. If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind? |
I don't have full context as some time passed since I created the issue, still I think we need all the changes, first one to make compaction more frequent, second to improve cases where all members are healthy, third is needed to pick the best that the memory/ time to recovery tradeoff, fourth to better handle cases where one member is fully down. Feel free to add your own suggestions or ask more questions. The best way to reach me is on K8s slack https://github.com/etcd-io/etcd?tab=readme-ov-file#contact
The first case is not about having it lower, it's about making InMemoryStorage compaction independent from snapshots.
Yes |
/assign @clement2026 |
@serathius: GitHub didn't allow me to assign the following users: clement2026. Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hey @serathius, I'm still on it and will be for a bit. Could you reassign this to me. Thanks! |
/assign |
I can assign myself. Awesome😎 |
#18382 raises the issue of high memory usage related to the etcd member catch-up mechanism. I’ve been working on it for a while and have some findings to share. ExperimentsI’ve run a few experiments to observe the heap size of an etcd instance. Below is a table I put together from my observations, showing how the heap size changes when benchmarking etcd.
v3.5.16 f20bbadputSize 1 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000
# Monitor heap size using live-pprof (https://github.com/moderato-app/live-pprof)
live-pprof 2379 putSize 10 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=10000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 100 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 1 MB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 1 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 10 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=10000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 100 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 1 MB snapshot-count 500 experimental-snapshot-catchup-entries 500# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=500 \
--experimental-snapshot-catchup-entries=500
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000000
# Monitor heap size using live-pprof
live-pprof 2379 v3.6.0-alpha.0 981061aputSize 1 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 100 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 1 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000
# Monitor heap size using live-pprof
live-pprof 2379 putSize 100 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000
# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000
# Monitor heap size using live-pprof
live-pprof 2379 How to estimate the heap size of etcdThe etcd member catch-up mechanism maintains a list of entries to keep the leader and followers in sync. As Once we know the average size of experimental-snapshot-catchup-entries * putSize to: (experimental-snapshot-catchup-entries + snapshot-count) * putSize The heap size of these entries, plus some overhead, is roughly the heap size and RSS of etcd. With this in mind, we can try to answer some questions. Q1: Do I need to worry about the heap size of etcd?If If Q2: Is it okay to set a really low value for
|
Thank you @clement2026 for the analysis, which makes sense. Great work! A couple of thoughts/points,
In the long run, we don't actually need the v2 snapshot since it only contains membership data. However, removing it would have a significant impact, so let's hold off until we've thoroughly discussed and understood the implications to ensure we're confident in the decision. |
@ahrtr Thanks for sharing your thoughts, it’s really helpful!
// Status contains information about this Raft peer and its view of the system.
// The Progress is only populated on the leader.
type Status struct {
BasicStatus
Config tracker.Config
Progress map[uint64]tracker.Progress
} I checked out
I can start a PR to identify any risks and discuss it further. |
I ran some benchmarks for PR #18589, which changes The results should be reliable, as I ran the benchmark twice on the etcd-benchmark-20240917-07-58-13.zip What causes the throughput increase?I analyzed the pprof profile data. It appears that release-3.5 vs release-3.5go tool pprof -http=: -diff_base release-3.5.pb.gz release-3.5-again.pb.gz pprof profile data and benchmark script.zip pprof profiling was run several times with different Based on the benchmarks from #18589 and #18459, we can see that smaller raft log entries lead to lower heap usage and higher throughput. I'm sharing the benchmark results here, hoping it boosts our confidence and motivation to keep pushing forward. |
Wait whaaat, @clement2026 did you just automate the etcd benchmarking in https://github.com/clement2026/etcd-benchmark-action? cc @jmhbnz I'm really impressed, could you take a look at #16467 |
@serathius, thanks! It seems like #16467 has some big plans that will require a lot of effort. What https://github.com/clement2026/etcd-benchmark-action does is pretty simple: it runs rw-benchmark on each branch, outputs CSVs, and heatmap images. It's inspired by https://github.com/etcd-io/bbolt/actions/runs/9087376452. If https://github.com/clement2026/etcd-benchmark-action can benefit etcd, I’d be happy to adjust it and add it to etcd. |
I agree it can greatly reduce memory usage, but it performs compaction more frequently. So we need to evaluate the impact on CPU usage and throughput. Note that even we just compact one entry, it will move all the entries, https://github.com/etcd-io/raft/blob/5d6eb55c4e6929e461997c9113aba99a5148e921/storage.go#L266-L269 |
I didn’t notice it before. Now I share your concern. It makes sense that performing compaction frequently can reduce the throughput.
@ahrtr I've done some benchmarks before, and they showed high throughput with more frequent compaction: #18459 (comment). However, that PR has not been fully reviewed because it has too many changes. Also, the reasons for the increased throughput weren’t fully investigated. It seems I'll need some time to run some solid benchmarks(with minimal code changes) to evaluate the impact on throughput, and figure out which parts of the code are responsible for it. For the CPU usage, I need your advice. When I ran the rw-benchmark script, I saw that etcd uses over 95% CPU when it’s busy. How can we get meaningful information by comparing the CPU usage of a PR and the main branch when both hit 95%? Maybe I misunderstood what you meant by I would appreciate more suggestions on how to do the evaluation and what data would be most helpful. |
I don't have a magic solution. It makes sense to compare usage only when both handle the same traffic (like clients, connections, and value size) and don’t max out the CPU. |
The discussion is spread across multiple places. Can we consolidate the motivation, design, plan and evaluation (performance test) result into one spot, like a Google doc or just by updating the first comment on this issue? |
I'm pretty sure the whole motivation, design and plan is in the top comment of this issue. |
What would you like to be added?
All requests made to etcd are serialized into raft entry proto and persisted on disk WAL. That's good, but to allow slow/disconnected members to catchup etcd also stores last 10`000 entries in raft.InMemoryStorage, all loaded into memory. In some cases this can cause huge memory bloat of etcd. Imagine you have a sequence of large put requests (for example 1MB configmaps in Kubernetes). etcd will keep all 10GB in memory, doing nothing.
This can be reproduced by running
./bin/tools/benchmark put --total=1000 --val-size=1000000
and collecting inuse_space heap profile.The mechanism is really dump and could benefit from following improvements:
Why is this needed?
Prevent etcd memory bloating and make memory usage more predictable.
The text was updated successfully, but these errors were encountered: