*: support raft learner in etcd - part 3 #10730

jingyih · 2019-05-16T00:32:18Z

Last part of #10645. Continuation of #10727.

This PR incldues the last 12 commits in #10645. Some of the commits are modified during rebasing to #10727.

(ordered by: latest commits on top)

b433162c0b9ea4d13e9a0dd49b1f51e20896cbcc integration: update TestMemberPromote test
ae08d06eb991b66d1de77912da0fe4c8ff03a35b etcdserver: update raftStatus
3f937cc459db33b9fd385d14a98b6747e11e93df etcdserver: check http StatusCode before unmarshal
ec1928e7b44fb14b53bd878eb37283ef1a144dec etcdserver: use etcdserver ErrLearnerNotReady
a8f87b83be4fe9f826104c8ac62361c30d779de3 etcdserver: forward member promote to leader
dc79587cd1b05007516544db0bb576b70436aca3 etcdserver: add mayPromote check
c6298843d77162ebb62a256fc2bcde4bcd190153 Doc: add learner in runtime-configuration.md
90956f77372a2deb5094259f9f02ebcf384bf9d3 clientv3/integration: better way to deflake test
2e87a9a2f66b70e0151b8a388e8760f99910b2d7 etcdserver: allow 1 learner in cluster
678d86e2db594368e12e868223e2edde4e3bf183 etcdserver: check IsMemberExist before IsLearner
6d8abba69fd85b1cc5e33b79054ebadc321aa148 etcdserver: learner return Unavailable for unsupported RPC
d93fecc3750f0cfb7b302de599b7ade99037127b etcdserver: adjust StrictReconfigCheck

cc @xiang90

Documentation/op-guide/runtime-configuration.md

xiang90 · 2019-05-16T01:35:05Z

Documentation/op-guide/runtime-configuration.md

+ (see [error cases when promoting a member] section for more details).
+ In this case, user should wait and retry later.
+
+In v3.4, etcd server limits the number of learners that cluster can have to one. The main consideration is to limit the


I feel we should make this configurable. One seems to be a small number.

Some people want to increase read perf. So they want more learners for example.

cc @gyuho

I kind of agree that 1 is not enough for some use cases, such as upsizing cluster from 1 to 3, or 3 to 5, and live-migrating a 3-node cluster to a new 3-node cluster. On the other hand, we do not want users to add too many learners at the same time, which might result in too much overhead for the leader. One goal of using learner is to make add members safer - not if it results in too much overhead on leader and then cause the cluster to fail. I think we can hard-code the limit to a small number, and make it configurable as an experimental feature in 3.4.

Configuring this limit is cluster-wide reconfiguration (very similar to member change), which means we need to add an API, and maybe a new config change type.

@xiang90 Do we keep the limit of 1? I do not have strong opinions on this.

let us keep this as it is for now. we can remove the limit later.

jingyih · 2019-05-24T00:01:41Z

ping @xiang90

gyuho · 2019-05-28T23:22:52Z

CIs have been broken. Should be fix now. Could you rebase?

xiang90 · 2019-05-29T00:34:22Z

etcdserver/server.go

@@ -1635,6 +1635,38 @@ func (s *EtcdServer) RemoveMember(ctx context.Context, id uint64) ([]*membership

 // PromoteMember promotes a learner node to a voting node.
 func (s *EtcdServer) PromoteMember(ctx context.Context, id uint64) ([]*membership.Member, error) {
+	resp, err := s.promoteMember(ctx, id)


wont this go through raft? why do we need to forward the promote request to leader in a side channel after it goes through raft?

We need to decide if the learner is ready to be promoted, before sending the request to raft. Only etcd server whose local raft node is leader has the information on learner's progress. So the request is first routed to leader.

ok. now i understand we actually check if this local node is leader inside the s.promoteMember func. I guess we should just move that checking code out of the s.promoteMember func. so it would be more clear.

Instead I added comments 23511d2 to explain checking local node is leader is part of the function.

xiang90 · 2019-05-29T01:05:21Z

etcdserver/server.go

+
+	cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())
+	defer cancel()
+	// forward to leader


probably comment on why do we need to forward to leader.

xiang90 · 2019-05-29T01:06:19Z

LGTM after resolving the merge conflicts and the nits.

Adjust StrictReconfigCheck logic to accommodate learner members in the cluster.

Make learner return code.Unavailable when the request is not supported by learner. Client balancer will retry a different endpoint.

If member does not exist in cluster, IsLearner will panic.

Hard-coded the maximum number of learners to 1.

Use ReadyNotify instead of time.Sleep to wait for server ready.

If learner is not ready to be promoted, use etcdserver.ErrLearnerNotReady instead of using membership.ErrLearnerNotReady.

Check http StatusCode. Only Unmarshal body if StatusCode is statusOK.

Update TestMemberPromote to include both learner not-ready and learner ready test cases. Removed unit test TestPromoteMember, it requires underlying raft node to be started and running. The member promote is covered by the integration test.

jingyih · 2019-05-29T01:57:46Z

Rebased.

xiang90 · 2019-05-29T02:32:12Z

@jingyih can you check why the CI failed?

jingyih · 2019-05-29T02:49:08Z

@jingyih can you check why the CI failed?

Travis failed with a known flaky test:
https://travis-ci.com/etcd-io/etcd/jobs/203753990#L2118

Semaphore failed with 20m timeout, which is also known:
#10700 (comment)

xiang90 reviewed May 16, 2019

View reviewed changes

WIZARD-CXY mentioned this pull request May 17, 2019

etcdserver: add learner metrics #10731

Merged

xiang90 reviewed May 29, 2019

View reviewed changes

jingyih and others added 13 commits May 28, 2019 18:47

etcdserver: adjust StrictReconfigCheck

76a63f9

Adjust StrictReconfigCheck logic to accommodate learner members in the cluster.

etcdserver: learner return Unavailable for unsupported RPC

d0c1b3f

Make learner return code.Unavailable when the request is not supported by learner. Client balancer will retry a different endpoint.

etcdserver: check IsMemberExist before IsLearner

c438f6d

If member does not exist in cluster, IsLearner will panic.

etcdserver: allow 1 learner in cluster

aa4cda2

Hard-coded the maximum number of learners to 1.

clientv3/integration: better way to deflake test

7a4d233

Use ReadyNotify instead of time.Sleep to wait for server ready.

Doc: add learner in runtime-configuration.md

cca8b0d

etcdserver: add mayPromote check

dfe296a

etcdserver: forward member promote to leader

f5eaaaf

etcdserver: use etcdserver ErrLearnerNotReady

f8ad8ae

If learner is not ready to be promoted, use etcdserver.ErrLearnerNotReady instead of using membership.ErrLearnerNotReady.

etcdserver: check http StatusCode before unmarshal

e994a4d

Check http StatusCode. Only Unmarshal body if StatusCode is statusOK.

etcdserver: update raftStatus

3f94385

integration: update TestMemberPromote test

6bf609b

Update TestMemberPromote to include both learner not-ready and learner ready test cases. Removed unit test TestPromoteMember, it requires underlying raft node to be started and running. The member promote is covered by the integration test.

*: address comments

23511d2

jingyih force-pushed the learner_part3 branch from c60db7c to 23511d2 Compare May 29, 2019 01:54

xiang90 merged commit 77e1c37 into etcd-io:master May 29, 2019

jingyih mentioned this pull request Jun 1, 2019

*: support raft learner in etcd #10645

Closed

jingyih mentioned this pull request Jun 9, 2019

Support non-voting members in etcd server #9161

Closed

hexfusion mentioned this pull request Jun 25, 2021

raft learner limit increase proposal #13148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: support raft learner in etcd - part 3 #10730

*: support raft learner in etcd - part 3 #10730

jingyih commented May 16, 2019

xiang90 May 16, 2019

xiang90 May 16, 2019

jingyih May 16, 2019 •

edited

Loading

jingyih May 16, 2019 •

edited

Loading

jingyih May 29, 2019

xiang90 May 29, 2019

jingyih commented May 24, 2019

gyuho commented May 28, 2019 •

edited

Loading

xiang90 May 29, 2019

jingyih May 29, 2019

xiang90 May 29, 2019

jingyih May 29, 2019

xiang90 May 29, 2019

jingyih May 29, 2019

xiang90 commented May 29, 2019

jingyih commented May 29, 2019

xiang90 commented May 29, 2019

jingyih commented May 29, 2019

*: support raft learner in etcd - part 3 #10730

*: support raft learner in etcd - part 3 #10730

Conversation

jingyih commented May 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingyih May 16, 2019 • edited Loading

Choose a reason for hiding this comment

jingyih May 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingyih commented May 24, 2019

gyuho commented May 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented May 29, 2019

jingyih commented May 29, 2019

xiang90 commented May 29, 2019

jingyih commented May 29, 2019

jingyih May 16, 2019 •

edited

Loading

jingyih May 16, 2019 •

edited

Loading

gyuho commented May 28, 2019 •

edited

Loading