Reduce memory usage of etcd member catchup mechanism #17098

serathius · 2023-12-11T22:09:40Z

What would you like to be added?

All requests made to etcd are serialized into raft entry proto and persisted on disk WAL. That's good, but to allow slow/disconnected members to catchup etcd also stores last 10`000 entries in raft.InMemoryStorage, all loaded into memory. In some cases this can cause huge memory bloat of etcd. Imagine you have a sequence of large put requests (for example 1MB configmaps in Kubernetes). etcd will keep all 10GB in memory, doing nothing.

This can be reproduced by running ./bin/tools/benchmark put --total=1000 --val-size=1000000 and collecting inuse_space heap profile.

The mechanism is really dump and could benefit from following improvements:

Compact raft.InMemoryStorage after every apply instead of once every snapshot (10`000 entries is default snapshot frequency). With removal of v2storage we can switch to using applied index instead of snapshot index. As apply index is updated more frequently we can execute Compact more frequently, possibly after every apply assuming that it's not too costly,
Compact raft.InMemoryStorage based state of the slowest member. Why keep 5`000 entries (default catchup entries) in 1 node cluster? or there all members are up to date? We could read the state of the slowest member and Compact based on that
Tune the default snapshot catchup entries (5`000 entries). Current is based on 1ms latency and 10k throughput https://github.com/etcd-io/etcd/pull/2403/files. Would be good to revisit this and tune it. For example compare catchup times and availability for different sizes of WAL entries, DB file, latency, network throughput etc.
Change the semantic of catchup-entries from "we always store at least X entries" to "we store entries only for members that are behind by X entries max". If member is behind more then X entries, as it no longer makes sense to use raft entries to sync it. In this case we will use snapshot. So why keep those entries?

Why is this needed?

Prevent etcd memory bloating and make memory usage more predictable.

The text was updated successfully, but these errors were encountered:

serathius · 2023-12-12T12:11:26Z

cc @ahrtr

moficodes · 2023-12-12T15:09:34Z

I am interested in helping. But I am not sure I know exactly what needs to be done. Could I shadow someone or get some guidance if I were to attempt this? @serathius

serathius · 2023-12-13T20:09:42Z

There are 4 different changes proposed. Looking at the first one Compact raft.InMemoryStorage after every apply instead of once every snapshot.. High level what we would want to do is:

Move the RaftLog compaction logic

etcd/server/etcdserver/server.go

Lines 2097 to 2125 in dfdffe4

    
           // When sending a snapshot, etcd will pause compaction. 
        
           // After receives a snapshot, the slow follower needs to get all the entries right after 
        
           // the snapshot sent to catch up. If we do not pause compaction, the log entries right after 
        
           // the snapshot sent might already be compacted. It happens when the snapshot takes long time 
        
           // to send and save. Pausing compaction avoids triggering a snapshot sending cycle. 
        
           if atomic.LoadInt64(&s.inflightSnapshots) != 0 { 
        
           	lg.Info("skip compaction since there is an inflight snapshot") 
        
           	return 
        
           } 
        
           // keep some in memory log entries for slow followers. 
        
           compacti := uint64(1) 
        
           if snapi > s.Cfg.SnapshotCatchUpEntries { 
        
           	compacti = snapi - s.Cfg.SnapshotCatchUpEntries 
        
           } 
        
           err = s.r.raftStorage.Compact(compacti) 
        
           if err != nil { 
        
           	// the compaction was done asynchronously with the progress of raft. 
        
           	// raft log might already been compact. 
        
           	if err == raft.ErrCompacted { 
        
           		return 
        
           	} 
        
           	lg.Panic("failed to compact", zap.Error(err)) 
        
           } 
        
           lg.Info( 
        
           	"compacted Raft logs", 
        
           	zap.Uint64("compact-index", compacti), 
        
           )

to separate function compactRaftLog

Use etcdProgress.appliedi instead of snapi to decide compacti

Call the function from applyAll

etcd/server/etcdserver/server.go

Lines 900 to 920 in dfdffe4

    
           func (s *EtcdServer) applyAll(ep *etcdProgress, apply *toApply) { 
        
           	s.applySnapshot(ep, apply) 
        
           	s.applyEntries(ep, apply) 
        
           	proposalsApplied.Set(float64(ep.appliedi)) 
        
           	s.applyWait.Trigger(ep.appliedi) 
        
           	// wait for the raft routine to finish the disk writes before triggering a 
        
           	// snapshot. or applied index might be greater than the last index in raft 
        
           	// storage, since the raft routine might be slower than toApply routine. 
        
           	<-apply.notifyc 
        
           	s.triggerSnapshot(ep) 
        
           	select { 
        
           	// snapshot requested via send() 
        
           	case m := <-s.r.msgSnapC: 
        
           		merged := s.createMergedSnapshotMessage(m, ep.appliedt, ep.appliedi, ep.confState) 
        
           		s.sendMergedSnap(merged) 
        
           	default: 
        
           	} 
        
           }

after triggerSnapshot

Benchmark the result to confirm a performance result
Possibly limit the calls to comactRaftLog to be done only once per 100 or 1000 entries to avoid copying the raft log to frequently

tangwz · 2023-12-25T14:19:05Z

/assign

tangwz · 2024-01-05T14:49:58Z

There are 4 different changes proposed. Looking at the first one Compact raft.InMemoryStorage after every apply instead of once every snapshot.. High level what we would want to do is:

Move the RaftLog compaction logic

etcd/server/etcdserver/server.go

Lines 2097 to 2125 in dfdffe4

// When sending a snapshot, etcd will pause compaction.

// After receives a snapshot, the slow follower needs to get all the entries right after

// the snapshot sent to catch up. If we do not pause compaction, the log entries right after

// the snapshot sent might already be compacted. It happens when the snapshot takes long time

// to send and save. Pausing compaction avoids triggering a snapshot sending cycle.

if atomic.LoadInt64(&s.inflightSnapshots) != 0 {

lg.Info("skip compaction since there is an inflight snapshot")

return

}

// keep some in memory log entries for slow followers.

compacti := uint64(1)

if snapi > s.Cfg.SnapshotCatchUpEntries {

compacti = snapi - s.Cfg.SnapshotCatchUpEntries

}

err = s.r.raftStorage.Compact(compacti)

if err != nil {

// the compaction was done asynchronously with the progress of raft.

// raft log might already been compact.

if err == raft.ErrCompacted {

return

}

lg.Panic("failed to compact", zap.Error(err))

}

lg.Info(

"compacted Raft logs",

zap.Uint64("compact-index", compacti),

)

to separate function compactRaftLog

Use etcdProgress.appliedi instead of snapi to decide compacti

Call the function from applyAll

etcd/server/etcdserver/server.go

Lines 900 to 920 in dfdffe4

func (s *EtcdServer) applyAll(ep *etcdProgress, apply *toApply) {

s.applySnapshot(ep, apply)

s.applyEntries(ep, apply)

proposalsApplied.Set(float64(ep.appliedi))

s.applyWait.Trigger(ep.appliedi)

// wait for the raft routine to finish the disk writes before triggering a

// snapshot. or applied index might be greater than the last index in raft

// storage, since the raft routine might be slower than toApply routine.

<-apply.notifyc

s.triggerSnapshot(ep)

select {

// snapshot requested via send()

case m := <-s.r.msgSnapC:

merged := s.createMergedSnapshotMessage(m, ep.appliedt, ep.appliedi, ep.confState)

s.sendMergedSnap(merged)

default:

}

}

after triggerSnapshot

Benchmark the result to confirm a performance result

Possibly limit the calls to comactRaftLog to be done only once per 100 or 1000 entries to avoid copying the raft log to frequently

@serathius this snapi is already etcdProgress.appliedi from triggerSnapshot(ep *etcdProgress)

https://github.com/etcd-io/etcd/blob/main/server/etcdserver/server.go#L1168-L1185

serathius · 2024-04-03T09:22:35Z

@tangwz Are you still planning to work on this?

yipal · 2024-04-03T22:52:01Z

Hi @serathius , I could give this a shot, but I would like to understand the proposed changes a little better. Are all 4 changes necessary to address the problem? It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower (see

etcd/server/etcdserver/server.go

Line 1191 in 557e7f0

    
           return (s.forceSnapshot && ep.appliedi != ep.snapi) || (ep.appliedi-ep.snapi > s.Cfg.SnapshotCount)

). The code appears to already compact the log right after snapshotting, as you pasted above.

It sounds like change 2 by itself would address the problem in the common case where followers are able to keep up.

Together, changes 3 and 4 sound like they would also address the problem in a different way. If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind?

serathius · 2024-04-04T08:33:53Z

r. Are all 4 changes necessary to address the problem?

I don't have full context as some time passed since I created the issue, still I think we need all the changes, first one to make compaction more frequent, second to improve cases where all members are healthy, third is needed to pick the best that the memory/ time to recovery tradeoff, fourth to better handle cases where one member is fully down. Feel free to add your own suggestions or ask more questions. The best way to reach me is on K8s slack https://github.com/etcd-io/etcd?tab=readme-ov-file#contact

It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower.

The first case is not about having it lower, it's about making InMemoryStorage compaction independent from snapshots.

If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind?

Yes

serathius · 2024-07-04T12:12:39Z

/assign @clement2026

k8s-ci-robot · 2024-07-04T12:12:42Z

@serathius: GitHub didn't allow me to assign the following users: clement2026.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @clement2026

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

clement2026 · 2024-08-28T18:48:51Z

/assign @clement2026

Hey @serathius, I'm still on it and will be for a bit. Could you reassign this to me. Thanks!

clement2026 · 2024-08-31T19:38:34Z

/assign

clement2026 · 2024-08-31T19:40:38Z

I can assign myself. Awesome😎

clement2026 · 2024-09-12T08:37:39Z

#18382 raises the issue of high memory usage related to the etcd member catch-up mechanism. I’ve been working on it for a while and have some findings to share.

Experiments

I’ve run a few experiments to observe the heap size of an etcd instance. Below is a table I put together from my observations, showing how the heap size changes when benchmarking etcd.

putSize: average size of put requests

putSize	--snapshot-count	--experimental-snapshot -catchup-entries	heap size v3.5.16	heap size v3.6.0-alpha.0
1 KB	10000	5000	6 MB ~ 28 MB	13 MB ~ 31.7 MB
10 KB	10000	5000	64 MB ~ 180 MB
100 KB	10000	5000	569 MB ~ 1.5 GB	536 MB ~ 1.62 GB
1 MB	10000	5000	5 GB ~ 14.2 GB
---	---	---	---	---
1 KB	100000	5000	15 MB ~ 143 MB	15 MB ~ 152 MB
10 KB	100000	5000	67 MB ~ 1.1GB
100 KB	100000	5000	900 MB ~ 10.6 GB	690 MB ~ 10.4 GB
---	---	---	---	---
1 MB	500	500	550 MB ~ 1 GB

Both v3.5 and v3.6 use 5000 as the default value for --experimental-snapshot-catchup-entries; however, the default value for --snapshot-count is set much lower in v3.6 at 10,000, compared to 100,000 in v3.5.

v3.5.16 `f20bbad`

putSize 1 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000

# Monitor heap size using live-pprof (https://github.com/moderato-app/live-pprof)
live-pprof 2379

putSize 10 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=10000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 100 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 1 MB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 1 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 10 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=10000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 100 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 1 MB snapshot-count 500 experimental-snapshot-catchup-entries 500

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=500 \
--experimental-snapshot-catchup-entries=500

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000000

# Monitor heap size using live-pprof
live-pprof 2379

v3.6.0-alpha.0 `981061a`

putSize 1 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 100 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=10000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 1 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000

# Monitor heap size using live-pprof
live-pprof 2379

putSize 100 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000

# Run etcd
rm -rf tmp.etcd;
etcd --data-dir tmp.etcd \
--enable-pprof=true \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--snapshot-count=100000 \
--experimental-snapshot-catchup-entries=5000

# Benchmark
./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000

# Monitor heap size using live-pprof
live-pprof 2379

How to estimate the heap size of etcd

The etcd member catch-up mechanism maintains a list of entries to keep the leader and followers in sync.
When etcd receives a put request, it appends the request data to the entries. These entries significantly impact etcd’s heap size.

As put requests keep appending to the entries, --snapshot-count and --experimental-snapshot-catchup-entries control when and how to shrink/compact the entries.

Once we know the average size of put requests (let’s call it putSize), we can estimate the heap size of these entries. It ranges from:

experimental-snapshot-catchup-entries * putSize

to:

(experimental-snapshot-catchup-entries + snapshot-count) * putSize

The heap size of these entries, plus some overhead, is roughly the heap size and RSS of etcd.

With this in mind, we can try to answer some questions.

Q1: Do I need to worry about the heap size of etcd?

If putSize is small, like 1KB, the heap size should be under 200 MB for v3.5 and under 50 MB for v3.6. With such low memory usage, there is no need to manually set --snapshot-count and --experimental-snapshot-catchup-entries. The default settings work fine.

If putSize is big, you can estimate the heap size of etcd according to the table and calculations we discussed earlier. You can also set custom values for --snapshot-count and --experimental-snapshot-catchup-entries to control the heap size.

Q2: Is it okay to set a really low value for `--snapshot-count`?

Setting a low value for --snapshot-count makes etcd create snapshots more often. This can cause CPU spikes and isn't ideal for latency-sensitive situations.
Here’s an example of the spikes:

Q3: Is it okay to set a really low value for `--experimental-snapshot-catchup-entries`?

If --experimental-snapshot-catchup-entries is set too low, slow followers might need to use snapshots to catch up with the leader. This is less efficient and puts more pressure on the leader compared to just using the entries. This often occurs when the network connection between the leader and followers is bad.

However, it’s fine to set --experimental-snapshot-catchup-entries to as low as 1 if you only have a single instance of etcd.

The analysis above focuses solely on the heap size of etcd. It doesn’t include memory allocated through mmap (used by bbolt) and cgo. To determine the total physical memory requirements for etcd, memory allocated through mmap must also be taken into account.

ahrtr · 2024-09-12T09:45:42Z

Thank you @clement2026 for the analysis, which makes sense. Great work!

A couple of thoughts/points,

For one member cluster, there is no need to keep 5000 entries, we don't even care about the value of --experimental-snapshot-catchup-entries at all. We can just compact all entries prior to the appliedIndex.
For multi-member cluster, we don't have to always keep 5000 entries either, instead we only need to keep min('--experimental-snapshot-catchup-entries', smallest_member_appliedIndex).
- You can get the member's progress using Node.Status()
For --snapshot-count, It's 10K in 3.6 and 100K in 3.5. Probably we can change it to 10K as well in 3.5 as well. It's open to any discussion.

In the long run, we don't actually need the v2 snapshot since it only contains membership data. However, removing it would have a significant impact, so let's hold off until we've thoroughly discussed and understood the implications to ensure we're confident in the decision.

clement2026 · 2024-09-12T22:27:35Z

@ahrtr Thanks for sharing your thoughts, it’s really helpful!

You can get the member's progress using Node.Status()

// Status contains information about this Raft peer and its view of the system.
// The Progress is only populated on the leader.
type Status struct {
	BasicStatus
	Config   tracker.Config
	Progress map[uint64]tracker.Progress
}

I checked out Node.Status() and noticed that Progress is only there for the leader. For followers, we can also compact all entries prior to the appliedIndex. It might lead to issues if a follower becomes the leader, but the existing tests should reveal any risks.

For --snapshot-count, It's 10K in 3.6 and 100K in 3.5. Probably we can change it to 10K as well in 3.5 as well. It's open to any discussion.

I can start a PR to identify any risks and discuss it further.

ahrtr · 2024-09-16T12:52:00Z

#18588 (comment)

clement2026 · 2024-09-18T12:40:15Z

I ran some benchmarks for PR #18589, which changes DefaultSnapshotCount from 100,000 to 10,000 in etcd v3.5, and they show higher throughput.

The results should be reliable, as I ran the benchmark twice on the release-3.5 branch and rebooted the OS between each run.

release-3.5 vs PR

release-3.5 vs release-3.5

etcd-benchmark-20240917-07-58-13.zip

What causes the throughput increase?

I analyzed the pprof profile data. It appears that mvcc.(*keyIndex).get is the main factor. I’m still trying to understand how this relates to DefaultSnapshotCount.

release-3.5(as base) vs PR

go tool pprof -http=: -diff_base release-3.5.pb.gz pr.pb.gz

release-3.5 vs release-3.5

go tool pprof -http=: -diff_base release-3.5.pb.gz release-3.5-again.pb.gz

pprof profile data and benchmark script.zip

pprof profiling was run several times with different VALUE_SIZE and CONN_CLI_COUNT settings, and the results were consistent.

Based on the benchmarks from #18589 and #18459, we can see that smaller raft log entries lead to lower heap usage and higher throughput. I'm sharing the benchmark results here, hoping it boosts our confidence and motivation to keep pushing forward.

serathius · 2024-09-19T08:14:14Z

Wait whaaat, @clement2026 did you just automate the etcd benchmarking in https://github.com/clement2026/etcd-benchmark-action? cc @jmhbnz

I'm really impressed, could you take a look at #16467

clement2026 · 2024-09-21T07:41:43Z

@serathius, thanks!

It seems like #16467 has some big plans that will require a lot of effort.

What https://github.com/clement2026/etcd-benchmark-action does is pretty simple: it runs rw-benchmark on each branch, outputs CSVs, and heatmap images. It's inspired by https://github.com/etcd-io/bbolt/actions/runs/9087376452.

If https://github.com/clement2026/etcd-benchmark-action can benefit etcd, I’d be happy to adjust it and add it to etcd.

ahrtr · 2024-09-25T14:52:14Z

Compact raft.InMemoryStorage after every apply instead of once every snapshot (10`000 entries is default snapshot frequency). With removal of v2storage we can switch to using applied index instead of snapshot index. As apply index is updated more frequently we can execute Compact more frequently, possibly after every apply assuming that it's not too costly,

I agree it can greatly reduce memory usage, but it performs compaction more frequently. So we need to evaluate the impact on CPU usage and throughput. Note that even we just compact one entry, it will move all the entries,

https://github.com/etcd-io/raft/blob/5d6eb55c4e6929e461997c9113aba99a5148e921/storage.go#L266-L269

clement2026 · 2024-09-25T16:14:43Z

Note that even we just compact one entry, it will move all the entries

I didn’t notice it before. Now I share your concern. It makes sense that performing compaction frequently can reduce the throughput.

I agree it can greatly reduce memory usage, but it performs compaction more frequently. So we need to evaluate the impact on CPU usage and throughput.

@ahrtr I've done some benchmarks before, and they showed high throughput with more frequent compaction: #18459 (comment).

However, that PR has not been fully reviewed because it has too many changes. Also, the reasons for the increased throughput weren’t fully investigated.

It seems I'll need some time to run some solid benchmarks(with minimal code changes) to evaluate the impact on throughput, and figure out which parts of the code are responsible for it.

For the CPU usage, I need your advice. When I ran the rw-benchmark script, I saw that etcd uses over 95% CPU when it’s busy. How can we get meaningful information by comparing the CPU usage of a PR and the main branch when both hit 95%? Maybe I misunderstood what you meant by impact on CPU usage.

I would appreciate more suggestions on how to do the evaluation and what data would be most helpful.

ahrtr · 2024-09-25T19:49:27Z

I saw that etcd uses over 95% CPU when it’s busy. How can we get meaningful information by comparing the CPU usage of a PR and the main branch when both hit 95%?

I don't have a magic solution. It makes sense to compare usage only when both handle the same traffic (like clients, connections, and value size) and don’t max out the CPU.

ahrtr · 2024-09-26T08:53:15Z

Compact raft.InMemoryStorage after every apply instead of once every snapshot (10`000 entries is default snapshot frequency). With removal of v2storage we can switch to using applied index instead of snapshot index. As apply index is updated more frequently we can execute Compact more frequently, possibly after every apply assuming that it's not too costly,

I agree it can greatly reduce memory usage, but it performs compaction more frequently. So we need to evaluate the impact on CPU usage and throughput. Note that even we just compact one entry, it will move all the entries,

https://github.com/etcd-io/raft/blob/5d6eb55c4e6929e461997c9113aba99a5148e921/storage.go#L266-L269

The discussion is spread across multiple places. Can we consolidate the motivation, design, plan and evaluation (performance test) result into one spot, like a Google doc or just by updating the first comment on this issue?

serathius · 2024-09-26T11:36:38Z

I'm pretty sure the whole motivation, design and plan is in the top comment of this issue.

serathius added type/feature help wanted labels Dec 11, 2023

jmhbnz added area/performance area/raft labels Dec 12, 2023

k8s-ci-robot assigned tangwz Dec 25, 2023

clement2026 mentioned this issue Jun 26, 2024

etcdserver: separate "raft log compact" from snapshot #18235

Closed

serathius unassigned tangwz Jul 4, 2024

clement2026 mentioned this issue Jul 27, 2024

etcdserver: separate "raft log compact" from snapshot #18372

Closed

2 tasks

serathius mentioned this issue Jul 31, 2024

High memory usage (memory leak?) with 1 node cluster and small DB (but a lot of updates) #18382

Open

4 tasks

clement2026 mentioned this issue Aug 17, 2024

etcdserver: separate raft log compact from snapshot #18459

Closed

1 task

clement2026 mentioned this issue Aug 25, 2024

etcdserver: creates a non-empty raft log snapshot on server startup #18494

Closed

3 tasks

k8s-ci-robot assigned clement2026 Aug 31, 2024

This was referenced Sep 13, 2024

tests: fix MustNewMember #18584

Closed

tests: fix TestV2DeprecationSnapshotMatches #18583

Closed

This was referenced Sep 13, 2024

tests: fix server_test.go #18582

Closed

[3.5] Backport DefaultSnapshotCount 10K #18589

Closed

ahrtr mentioned this issue Sep 16, 2024

tests: reproduce a panic #18588

Closed

This was referenced Sep 23, 2024

etcdserver: compact raft log entries more frequently #18622

Draft

etcdserver: separate maybeCompactRaftLog function to compact raft log independently #18635

Open

serathius mentioned this issue Nov 2, 2024

Run a separate in memory snapshot to reduce number of entries stored in raft memory storage #18825

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage of etcd member catchup mechanism #17098

Reduce memory usage of etcd member catchup mechanism #17098

serathius commented Dec 11, 2023 •

edited

Loading

serathius commented Dec 12, 2023

moficodes commented Dec 12, 2023

serathius commented Dec 13, 2023

tangwz commented Dec 25, 2023

tangwz commented Jan 5, 2024

serathius commented Apr 3, 2024

yipal commented Apr 3, 2024 •

edited

Loading

serathius commented Apr 4, 2024

serathius commented Jul 4, 2024

k8s-ci-robot commented Jul 4, 2024

clement2026 commented Aug 28, 2024

clement2026 commented Aug 31, 2024

clement2026 commented Aug 31, 2024

clement2026 commented Sep 12, 2024

ahrtr commented Sep 12, 2024

clement2026 commented Sep 12, 2024

ahrtr commented Sep 16, 2024

clement2026 commented Sep 18, 2024

serathius commented Sep 19, 2024

clement2026 commented Sep 21, 2024

ahrtr commented Sep 25, 2024

clement2026 commented Sep 25, 2024

ahrtr commented Sep 25, 2024

ahrtr commented Sep 26, 2024

serathius commented Sep 26, 2024 •

edited

Loading

Reduce memory usage of etcd member catchup mechanism #17098

Reduce memory usage of etcd member catchup mechanism #17098

Comments

serathius commented Dec 11, 2023 • edited Loading

What would you like to be added?

Why is this needed?

serathius commented Dec 12, 2023

moficodes commented Dec 12, 2023

serathius commented Dec 13, 2023

tangwz commented Dec 25, 2023

tangwz commented Jan 5, 2024

serathius commented Apr 3, 2024

yipal commented Apr 3, 2024 • edited Loading

serathius commented Apr 4, 2024

serathius commented Jul 4, 2024

k8s-ci-robot commented Jul 4, 2024

clement2026 commented Aug 28, 2024

clement2026 commented Aug 31, 2024

clement2026 commented Aug 31, 2024

clement2026 commented Sep 12, 2024

Experiments

v3.5.16 f20bbad

v3.6.0-alpha.0 981061a

How to estimate the heap size of etcd

Q1: Do I need to worry about the heap size of etcd?

Q2: Is it okay to set a really low value for --snapshot-count?

Q3: Is it okay to set a really low value for --experimental-snapshot-catchup-entries?

ahrtr commented Sep 12, 2024

clement2026 commented Sep 12, 2024

ahrtr commented Sep 16, 2024

clement2026 commented Sep 18, 2024

What causes the throughput increase?

serathius commented Sep 19, 2024

clement2026 commented Sep 21, 2024

ahrtr commented Sep 25, 2024

clement2026 commented Sep 25, 2024

ahrtr commented Sep 25, 2024

ahrtr commented Sep 26, 2024

serathius commented Sep 26, 2024 • edited Loading

serathius commented Dec 11, 2023 •

edited

Loading

yipal commented Apr 3, 2024 •

edited

Loading

v3.5.16 `f20bbad`

v3.6.0-alpha.0 `981061a`

Q2: Is it okay to set a really low value for `--snapshot-count`?

Q3: Is it okay to set a really low value for `--experimental-snapshot-catchup-entries`?

serathius commented Sep 26, 2024 •

edited

Loading