raft: postpone MsgReadIndex until first commit in the term #12762

wpedrak · 2021-03-11T16:17:31Z

This PR tries to address #12680. Instead of dropping requests while Leader didn't committed first log in the term, it records them and process afer commit (1).

#12680 could be also fixed by another approach: retry mechanism in the following code (2)

etcd/server/etcdserver/v3_server.go

Lines 752 to 783 in 3ead91c

    
           for !timeout && !done { 
        
           	select { 
        
           	case rs = <-s.r.readStateC: 
        
           		done = bytes.Equal(rs.RequestCtx, ctxToSend) 
        
           		if !done { 
        
           			// a previous request might time out. now we should ignore the response of it and 
        
           			// continue waiting for the response of the current requests. 
        
           			id2 := uint64(0) 
        
           			if len(rs.RequestCtx) == 8 { 
        
           				id2 = binary.BigEndian.Uint64(rs.RequestCtx) 
        
           			} 
        
           			lg.Warn( 
        
           				"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader", 
        
           				zap.Uint64("sent-request-id", id1), 
        
           				zap.Uint64("received-request-id", id2), 
        
           			) 
        
           			slowReadIndex.Inc() 
        
           		} 
        
           	case <-leaderChangedNotifier: 
        
           		timeout = true 
        
           		readIndexFailed.Inc() 
        
           		// return a retryable error. 
        
           		nr.notify(ErrLeaderChanged) 
        
           	case <-time.After(s.Cfg.ReqTimeout()): 
        
           		lg.Warn("timed out waiting for read index response (local node might have slow network)", zap.Duration("timeout", s.Cfg.ReqTimeout())) 
        
           		nr.notify(ErrTimeout) 
        
           		timeout = true 
        
           		slowReadIndex.Inc() 
        
           	case <-s.stopping: 
        
           		return 
        
           	} 
        
           }

While this PR is proposal for (1), I'd like to introduce PR for (2) as well. In my opinion implementing (1) alone, (2) alone and both seems reasonable and I'd love to hear what you think.

Edit:
Implementation of (2) can be found in #12780

hexfusion · 2021-03-11T16:49:43Z

/cc @xiang90

ptabor

Thank you. That looks good to me.

Could you please update as well comment to:

etcd/raft/node.go

Lines 182 to 186 in b9226d0

    
           // ReadIndex request a read state. The read state will be set in the ready. 
        
           // Read state has a read index. Once the application advances further than the read 
        
           // index, any linearizable read requests issued before the read request can be 
        
           // processed safely. The read state will have the same rctx attached. 
        
           ReadIndex(ctx context.Context, rctx []byte) error

It should state:

Note that request can be lost without notice, therefore it is user's job to ensure read index retries.

I think we should eventually audit all etcd calls of 'Propose' and 'ReadIndex' for retries, but
still retries would add unnecessary latency. I think we should merge this PR as well.

wpedrak · 2021-03-12T08:44:44Z

One thing that that might be worth considering is that in current setup there is no limit to pendingReadIndexMessages size, thus in situation when we are not able to commit after election and we are spammed with ReadIndex requests we would potentially run out of memory.

It is second approach (with first being etcd-io#12762) to solve etcd-io#12680

ptabor · 2021-03-17T07:55:58Z

there is no limit to pendingReadIndexMessages size, thus in situation when we are not able to commit after election and we are spammed with ReadIndex requests we would potentially run out of memory.

That would mean we have a leader that:

is not committing anything, so is blocking all writes
is not propagating ReadIndex, so is blocking all reads
I think OOM is better for such a leader than staying in such state for a long time if it ever happens.

ptabor · 2021-03-17T07:58:10Z

@tangcong @tbg @bdarnell - can you, please take a look from RAFT perspective ?

bdarnell · 2021-03-17T20:05:47Z

I've never used ReadIndex so I'm not an expert here, but allowing a leader to OOM seems like a pretty bad outcome (consider a process that is running multiple instances of raft, or maintaining a lot of state in addition to raft). I'd prefer to avoid this amount of buffering in the raft object.

We have some similar use cases in CockroachDB where we just retry if the leader isn't ready. In order to avoid waiting for the full timeout (500ms or whatever), we retry immediately when we see the leader commit the first empty entry of its term.

gyuho · 2021-03-17T21:23:31Z

Is this something we can prevent at etcd server layer, and just expose the progress from raft? Just like we do for "error too many requests"? And let etcd handle the retry? Agree with @bdarnell and we should never OOM etcd server.

ptabor · 2021-03-17T21:40:54Z

I was not precise. I don't think that the change in reality is growing the probability of OOM.
In particular RAFT state is already 'queuing' the readOnlyIndex requests in its internal structures:

etcd/raft/read_only.go

Line 42 in 63c5170

readIndexQueue []string

etcd/raft/read_only.go

Line 62 in 63c5170

ro.readIndexQueue = append(ro.readIndexQueue, s)

and the size of the queue is unbound.

As long as the leader is operating and being able to breadcast messages, the queue keeps being truncated.
The first commit by a new leader should not be different.

BTW: Interesting idea with resending all ReadIndex on leadership change as well (although adding more 'internal dependencies' on etcd side).

wpedrak · 2021-03-22T09:27:07Z

I don't think that the change in reality is growing the probability of OOM.

Indeed. If we trace execution of ReadIndex we would find out that all messages that are appended to pendingReadIndexMessages (slice introduced in this PR) will eventually be appended to readIndexQueue mentioned above. In case leader already committed entry in its turn, all those messages would go to readIndexQueue directly.

BTW: Interesting idea with resending all ReadIndex on leadership change as well (although adding more 'internal dependencies' on etcd side).

@ptabor I believe that you meant retry on "first commit in current turn", not "leadership change" as we currently retry on leadership change in

etcd/server/etcdserver/v3_server.go

Lines 717 to 718 in 3ead91c

    
           case <-leaderChangedNotifier: 
        
           	continue

and

etcd/server/etcdserver/v3_server.go

Lines 770 to 774 in 3ead91c

    
           case <-leaderChangedNotifier: 
        
           	timeout = true 
        
           	readIndexFailed.Inc() 
        
           	// return a retryable error. 
        
           	nr.notify(ErrLeaderChanged)

ptabor · 2021-03-22T10:10:11Z

BTW: Interesting idea with resending all ReadIndex on leadership change as well (although adding more 'internal dependencies' on etcd side).

@ptabor I believe that you meant retry on "first commit in current turn", not "leadership change" as we currently retry on leadership change in

Correct. So potentially the trigger could go from:

etcd/server/etcdserver/server.go

Lines 2069 to 2074 in 8469108

    
           if len(e.Data) == 0 { 
        
           	select { 
        
           	case s.forceVersionC <- struct{}{}: 
        
           	default: 
        
           	} 
        
           	// promote lessor when the local member is leader and finished

wpedrak · 2021-03-22T15:54:40Z

Makes sense. I definitely like the idea of being proactive here, however I see this as separate PR as it wouldn't share any code with proposition here.

@bdarnell @gyuho Could you elaborate on your concerns after @ptabor and @wpedrak comments?

bdarnell · 2021-03-22T19:57:56Z

If it's already going into a queue I'm not as worried. A second queue feels inelegant but I'm not sure how to restructure it so LGTM.

gyuho

// Reject read only request when this leader has not committed any log entry at its term

I assume this time window in practice is pretty small. So, LGTM. /cc @xiang90

Requesting some clarification and docs. Otherwise LGTM.

raft/raft.go

Fixes etcd-io#12680

raft/raft.go

codecov-io · 2021-03-23T14:12:55Z

Codecov Report

Merging #12762 (758ff01) into master (456e129) will decrease coverage by 14.52%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master   #12762       +/-   ##
===========================================
- Coverage   67.68%   53.15%   -14.53%     
===========================================
  Files         421      389       -32     
  Lines       32973    31379     -1594     
===========================================
- Hits        22318    16681     -5637     
- Misses       8602    12965     +4363     
+ Partials     2053     1733      -320

Flag	Coverage Δ
all	`53.15% <0.00%> (-14.53%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
raft/node.go	`50.75% <ø> (-36.69%)`	⬇️
raft/raft.go	`25.70% <0.00%> (-64.68%)`	⬇️
client/v3/utils.go	`0.00% <0.00%> (-100.00%)`	⬇️
raft/quorum/quorum.go	`0.00% <0.00%> (-100.00%)`	⬇️
raft/tracker/state.go	`0.00% <0.00%> (-100.00%)`	⬇️
client/v3/compact_op.go	`0.00% <0.00%> (-100.00%)`	⬇️
client/v3/ordering/util.go	`0.00% <0.00%> (-100.00%)`	⬇️
pkg/tlsutil/cipher_suites.go	`0.00% <0.00%> (-100.00%)`	⬇️
pkg/transport/sockopt_unix.go	`0.00% <0.00%> (-100.00%)`	⬇️
client/v3/naming/endpoints/endpoints.go	`0.00% <0.00%> (-100.00%)`	⬇️
... and 275 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 456e129...758ff01. Read the comment docs.

ptabor · 2021-03-23T14:57:26Z

Thank you for the change. It's a big step towards seamless leadership change.

serathius · 2022-07-13T16:17:47Z

qq @ptabor is any PR that reduced leader election downtime, possible for backport to v3.4?

Backport etcd-io#12762 to 3.4 Signed-off-by: Benjamin Wang <wachao@vmware.com>

wpedrak mentioned this pull request Mar 11, 2021

Small number of high latency requests after a leader election #12680

Closed

ptabor added the area/raft label Mar 11, 2021

ptabor approved these changes Mar 11, 2021

View reviewed changes

wpedrak force-pushed the postpone-read-index branch from 97d4a4b to 5fc59ae Compare March 12, 2021 08:38

wpedrak added a commit to wpedrak/etcd that referenced this pull request Mar 16, 2021

server: add 500ms retries to ReadIndex requests for l-reads

6d492e4

It is second approach (with first being etcd-io#12762) to solve etcd-io#12680

wpedrak mentioned this pull request Mar 16, 2021

Read index retry #12780

Merged

wpedrak added a commit to wpedrak/etcd that referenced this pull request Mar 16, 2021

server: add 500ms retries to ReadIndex requests for l-reads

1df6caf

It is second approach (with first being etcd-io#12762) to solve etcd-io#12680

wpedrak added a commit to wpedrak/etcd that referenced this pull request Mar 16, 2021

server: add 500ms retries to ReadIndex requests for l-reads

e977923

It is second approach (with first being etcd-io#12762) to solve etcd-io#12680

ptabor requested review from bdarnell and tbg March 17, 2021 07:58

gyuho suggested changes Mar 22, 2021

View reviewed changes

raft/raft.go Outdated Show resolved Hide resolved

gyuho reviewed Mar 22, 2021

View reviewed changes

raft/raft.go Outdated Show resolved Hide resolved

raft: postpone MsgReadIndex until first commit in the term

758ff01

Fixes etcd-io#12680

wpedrak force-pushed the postpone-read-index branch from 5fc59ae to 758ff01 Compare March 23, 2021 13:00

wpedrak commented Mar 23, 2021

View reviewed changes

raft/raft.go Show resolved Hide resolved

ptabor merged commit 4f34f14 into etcd-io:master Mar 23, 2021

wpedrak mentioned this pull request Mar 23, 2021

etcdserver: resend ReadIndex request on empty apply request #12795

Merged

wpedrak deleted the postpone-read-index branch March 23, 2021 15:55

ptabor mentioned this pull request Dec 7, 2021

Data inconsistency in etcd version 3.5.0 (3.5.x rollback> 3.4 upgrade-> 3.5) story #13514

Closed

serathius mentioned this pull request Jul 11, 2022

Plans for v3.4.19 release #14105

Closed

ahrtr mentioned this pull request Jul 19, 2022

Plans for v3.4.20 release #14232

Closed

25 tasks

ahrtr added a commit to ahrtr/etcd that referenced this pull request Jul 22, 2022

raft: postpone MsgReadIndex until first commit in the term

55f12b4

Backport etcd-io#12762 to 3.4 Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr mentioned this pull request Jul 22, 2022

[3.4] raft: postpone MsgReadIndex until first commit in the term #14258

Merged

ahrtr added a commit to ahrtr/etcd that referenced this pull request Jul 22, 2022

raft: postpone MsgReadIndex until first commit in the term

9c9148c

Backport etcd-io#12762 to 3.4 Signed-off-by: Benjamin Wang <wachao@vmware.com>

tjungblu pushed a commit to tjungblu/etcd that referenced this pull request Sep 8, 2022

raft: postpone MsgReadIndex until first commit in the term

c69e7b8

Backport etcd-io#12762 to 3.4 Signed-off-by: Benjamin Wang <wachao@vmware.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: postpone MsgReadIndex until first commit in the term #12762

raft: postpone MsgReadIndex until first commit in the term #12762

wpedrak commented Mar 11, 2021 •

edited

Loading

hexfusion commented Mar 11, 2021

ptabor left a comment

wpedrak commented Mar 12, 2021

ptabor commented Mar 17, 2021

ptabor commented Mar 17, 2021

bdarnell commented Mar 17, 2021

gyuho commented Mar 17, 2021

ptabor commented Mar 17, 2021

wpedrak commented Mar 22, 2021

ptabor commented Mar 22, 2021

wpedrak commented Mar 22, 2021

bdarnell commented Mar 22, 2021

gyuho left a comment

codecov-io commented Mar 23, 2021

ptabor commented Mar 23, 2021

serathius commented Jul 13, 2022

	for !timeout && !done {
	select {
	case rs = <-s.r.readStateC:
	done = bytes.Equal(rs.RequestCtx, ctxToSend)
	if !done {
	// a previous request might time out. now we should ignore the response of it and
	// continue waiting for the response of the current requests.
	id2 := uint64(0)
	if len(rs.RequestCtx) == 8 {
	id2 = binary.BigEndian.Uint64(rs.RequestCtx)
	}
	lg.Warn(
	"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader",
	zap.Uint64("sent-request-id", id1),
	zap.Uint64("received-request-id", id2),
	)
	slowReadIndex.Inc()
	}
	case <-leaderChangedNotifier:
	timeout = true
	readIndexFailed.Inc()
	// return a retryable error.
	nr.notify(ErrLeaderChanged)
	case <-time.After(s.Cfg.ReqTimeout()):
	lg.Warn("timed out waiting for read index response (local node might have slow network)", zap.Duration("timeout", s.Cfg.ReqTimeout()))
	nr.notify(ErrTimeout)
	timeout = true
	slowReadIndex.Inc()
	case <-s.stopping:
	return
	}
	}

	// ReadIndex request a read state. The read state will be set in the ready.
	// Read state has a read index. Once the application advances further than the read
	// index, any linearizable read requests issued before the read request can be
	// processed safely. The read state will have the same rctx attached.
	ReadIndex(ctx context.Context, rctx []byte) error

raft: postpone MsgReadIndex until first commit in the term #12762

raft: postpone MsgReadIndex until first commit in the term #12762

Conversation

wpedrak commented Mar 11, 2021 • edited Loading

hexfusion commented Mar 11, 2021

ptabor left a comment

Choose a reason for hiding this comment

wpedrak commented Mar 12, 2021

ptabor commented Mar 17, 2021

ptabor commented Mar 17, 2021

bdarnell commented Mar 17, 2021

gyuho commented Mar 17, 2021

ptabor commented Mar 17, 2021

wpedrak commented Mar 22, 2021

ptabor commented Mar 22, 2021

wpedrak commented Mar 22, 2021

bdarnell commented Mar 22, 2021

gyuho left a comment

Choose a reason for hiding this comment

codecov-io commented Mar 23, 2021

Codecov Report

ptabor commented Mar 23, 2021

serathius commented Jul 13, 2022

wpedrak commented Mar 11, 2021 •

edited

Loading