raft learner limit increase proposal #13148

hexfusion · 2021-06-25T17:56:54Z

Today we limit the raft learner count to 1[1] which rightfully was set in 3.4[2] as a gate against leader stress of log replication. I think the time has come to consider lifting this limit. Perhaps as a gate to additional learners, we can scale learners only if existing learners are promotable (in sync with the leader). Open to other ideas but I think for the feature to evolve user needs to be able to explore learners further even if this is under an experimental flag.

cc @jingyih @gyuho @xiang90 @ptabor

[1]

etcd/server/etcdserver/api/membership/cluster.go

Lines 339 to 340 in 46b49a6

    
           if numLearners+1 > maxLearners { 
        
           	return ErrTooManyLearners

[2] #10730 (comment)

hexfusion · 2021-06-25T18:04:05Z

cc @marun @lilic

anguslees · 2021-07-08T05:11:57Z

Perhaps as a gate to additional learners, we can scale learners only if existing learners are promotable

I'm ignorant - can you help me learn why is this necessary?

Other than the additional replication log load, I feel like the whole point of learners is that they don't affect anything important. So it should be safe to have lots of them (on an unloaded cluster, and for small values of 'lots'), added/removed fairly arbitrarily afaics? .. Do we care how the existing learners are doing before adding/removing other learners?

hexfusion · 2021-07-08T12:20:24Z

The concern is stress on leader to provide log replication to learners. If you add them unbounded leader will take a performance hit. Think of 6GB state file replicated N times. So one approach I see is ensuring current learners are promotable before scaling another.

If you wanted to add learner unbounded for experimental purposes it should probably be under an unsafe flag.

hexfusion · 2021-07-08T12:27:07Z

This all needs testing if your interested in doing a performance review that would help with validation.

stale · 2021-10-07T02:01:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

hexfusion · 2021-10-14T11:05:06Z

I have been working on the performance review will update when completed.

hexfusion · 2021-10-28T13:00:58Z

In the initial learner implementation, @jingyih had noted a need for an addition of an API for this to be configurable. While this would be ideal we have many stateless runtime configurations which can affect the way the cluster works. Can we rely on the admin to ensure cluster-wide configuration at runtime?

My feeling is that a dynamic change of this configuration is probably not needed at this time. Thoughts?

cc @jingyih @ptabor @hasbro17 @serathius @chaochn47

[1] #10730 (comment)

hasbro17 · 2021-10-28T22:30:20Z

@hexfusion by dynamic API, do you mean something like etcdctl learner size ...? Yeah that would be nice, but having a config/flag/env would probably do just as well I think.

Just wondering what happens when the etcd instances in a cluster have differing learner limit configurations. Not sure if there is already something that already handles a difference in some cluster-wide configuration flag for a new member.

hexfusion · 2021-10-29T12:24:11Z

If learners exist in the membership and the count is greater than the limit defined at runtime the member will panic.

etcd-2_1          | panic: failed to nodeToMember
etcd-2_1          | 
etcd-2_1          | goroutine 198 [running]:
etcd-2_1          | go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000ae0c0, {0xc0003033c0, 0x1, 0x1})
etcd-2_1          | 	/home/remote/sbatsche/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x499
etcd-2_1          | go.uber.org/zap.(*Logger).Panic(0xedc5e0, {0x1046bac, 0x23ea4ab19f7e0a41}, {0xc0003033c0, 0x1, 0x1})
etcd-2_1          | 	/home/remote/sbatsche/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x59
etcd-2_1          | go.etcd.io/etcd/server/v3/etcdserver/api/membership.membersFromStore(0x11aed80, {0x11ec9d8, 0xc0003120e0})
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/server/etcdserver/api/membership/cluster.go:681 +0x467
etcd-2_1          | go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).ValidateConfigurationChange(0xc0000d8000, {0x3, 0xbd265dc7727bdff9, {0xc0000bac00, 0x54, 0x60}, 0xa417cc8a8cb7011})
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/server/etcdserver/api/membership/cluster.go:305 +0x73
etcd-2_1          | go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange(0xc000238600, {0x3, 0xbd265dc7727bdff9, {0xc0000bac00, 0x54, 0x60}, 0xa417cc8a8cb7011}, 0xc00026a480, 0x0)
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/server/etcdserver/server.go:1962 +0xa5
etcd-2_1          | go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply(0xc000238600, {0xc00002c830, 0x28, 0xc00065c380}, 0x7f663a00ffff)
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/server/etcdserver/server.go:1850 +0x54c
etcd-2_1          | go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries(0xc000238600, 0xc00026a480, 0xc000228790)
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/server/etcdserver/server.go:1078 +0x27d
etcd-2_1          | go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0xc000238600, 0xc00026a480, 0xc000228790)
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/server/etcdserver/server.go:900 +0x65
etcd-2_1          | go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8({0x0, 0x0})
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/server/etcdserver/server.go:832 +0x25
etcd-2_1          | go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0xc0001fc660)
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/pkg/schedule/schedule.go:157 +0x119
etcd-2_1          | created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler
etcd-2_1          | 	/home/remote/sbatsche/projects/etcd.io/etcd/pkg/schedule/schedule.go:70 +0x15c

stale · 2022-01-29T02:45:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale · 2022-06-12T23:15:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

hexfusion self-assigned this Jun 25, 2021

stale bot added the stale label Oct 7, 2021

stale bot removed the stale label Oct 14, 2021

stale bot added the stale label Jan 29, 2022

stale bot closed this as completed Feb 19, 2022

serathius reopened this Feb 20, 2022

stale bot removed the stale label Feb 20, 2022

stale bot added the stale label Jun 12, 2022

stale bot closed this as completed Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft learner limit increase proposal #13148

raft learner limit increase proposal #13148

hexfusion commented Jun 25, 2021

hexfusion commented Jun 25, 2021

anguslees commented Jul 8, 2021 •

edited

Loading

hexfusion commented Jul 8, 2021

hexfusion commented Jul 8, 2021

stale bot commented Oct 7, 2021

hexfusion commented Oct 14, 2021

hexfusion commented Oct 28, 2021

hasbro17 commented Oct 28, 2021

hexfusion commented Oct 29, 2021

stale bot commented Jan 29, 2022

stale bot commented Jun 12, 2022

raft learner limit increase proposal #13148

raft learner limit increase proposal #13148

Comments

hexfusion commented Jun 25, 2021

hexfusion commented Jun 25, 2021

anguslees commented Jul 8, 2021 • edited Loading

hexfusion commented Jul 8, 2021

hexfusion commented Jul 8, 2021

stale bot commented Oct 7, 2021

hexfusion commented Oct 14, 2021

hexfusion commented Oct 28, 2021

hasbro17 commented Oct 28, 2021

hexfusion commented Oct 29, 2021

stale bot commented Jan 29, 2022

stale bot commented Jun 12, 2022

anguslees commented Jul 8, 2021 •

edited

Loading