-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: postpone MsgReadIndex until first commit in the term #12762
Conversation
/cc @xiang90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. That looks good to me.
Could you please update as well comment to:
Lines 182 to 186 in b9226d0
// ReadIndex request a read state. The read state will be set in the ready. | |
// Read state has a read index. Once the application advances further than the read | |
// index, any linearizable read requests issued before the read request can be | |
// processed safely. The read state will have the same rctx attached. | |
ReadIndex(ctx context.Context, rctx []byte) error |
It should state:
Note that request can be lost without notice, therefore it is user's job to ensure read index retries.
I think we should eventually audit all etcd calls of 'Propose' and 'ReadIndex' for retries, but
still retries would add unnecessary latency. I think we should merge this PR as well.
97d4a4b
to
5fc59ae
Compare
One thing that that might be worth considering is that in current setup there is no limit to |
It is second approach (with first being etcd-io#12762) to solve etcd-io#12680
It is second approach (with first being etcd-io#12762) to solve etcd-io#12680
It is second approach (with first being etcd-io#12762) to solve etcd-io#12680
That would mean we have a
|
I've never used ReadIndex so I'm not an expert here, but allowing a leader to OOM seems like a pretty bad outcome (consider a process that is running multiple instances of raft, or maintaining a lot of state in addition to raft). I'd prefer to avoid this amount of buffering in the raft object. We have some similar use cases in CockroachDB where we just retry if the leader isn't ready. In order to avoid waiting for the full timeout (500ms or whatever), we retry immediately when we see the leader commit the first empty entry of its term. |
Is this something we can prevent at etcd server layer, and just expose the progress from raft? Just like we do for "error too many requests"? And let etcd handle the retry? Agree with @bdarnell and we should never OOM etcd server. |
I was not precise. I don't think that the change in reality is growing the probability of OOM. Line 42 in 63c5170
Line 62 in 63c5170
and the size of the queue is unbound. As long as the leader is operating and being able to breadcast messages, the queue keeps being truncated. BTW: Interesting idea with resending all ReadIndex on leadership change as well (although adding more 'internal dependencies' on etcd side). |
Indeed. If we trace execution of ReadIndex we would find out that all messages that are appended to
@ptabor I believe that you meant retry on "first commit in current turn", not "leadership change" as we currently retry on leadership change in etcd/server/etcdserver/v3_server.go Lines 717 to 718 in 3ead91c
and etcd/server/etcdserver/v3_server.go Lines 770 to 774 in 3ead91c
|
Correct. So potentially the trigger could go from: etcd/server/etcdserver/server.go Lines 2069 to 2074 in 8469108
|
If it's already going into a queue I'm not as worried. A second queue feels inelegant but I'm not sure how to restructure it so LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Reject read only request when this leader has not committed any log entry at its term
I assume this time window in practice is pretty small. So, LGTM. /cc @xiang90
Requesting some clarification and docs. Otherwise LGTM.
5fc59ae
to
758ff01
Compare
Codecov Report
@@ Coverage Diff @@
## master #12762 +/- ##
===========================================
- Coverage 67.68% 53.15% -14.53%
===========================================
Files 421 389 -32
Lines 32973 31379 -1594
===========================================
- Hits 22318 16681 -5637
- Misses 8602 12965 +4363
+ Partials 2053 1733 -320
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Thank you for the change. It's a big step towards seamless leadership change. |
qq @ptabor is any PR that reduced leader election downtime, possible for backport to v3.4? |
Backport etcd-io#12762 to 3.4 Signed-off-by: Benjamin Wang <wachao@vmware.com>
Backport etcd-io#12762 to 3.4 Signed-off-by: Benjamin Wang <wachao@vmware.com>
Backport etcd-io#12762 to 3.4 Signed-off-by: Benjamin Wang <wachao@vmware.com>
This PR tries to address #12680. Instead of dropping requests while Leader didn't committed first log in the term, it records them and process afer commit (1).
#12680 could be also fixed by another approach: retry mechanism in the following code (2)
etcd/server/etcdserver/v3_server.go
Lines 752 to 783 in 3ead91c
While this PR is proposal for (1), I'd like to introduce PR for (2) as well. In my opinion implementing (1) alone, (2) alone and both seems reasonable and I'd love to hear what you think.
Edit:
Implementation of (2) can be found in #12780