Fix watch event loss #17555

chaochn47 · 2024-03-08T06:43:08Z

Fix #17529

@fuweid @serathius @ahrtr

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

chaochn47 · 2024-03-08T08:19:41Z

Filed flaky test report #17556, provided root cause and how to fix.

Is there any way to re-test this specific failed test E2E / test (linux-386)? @jmhbnz

jmhbnz · 2024-03-08T08:22:35Z

/retest

(Will only retest workflows that have failed by default)

fuweid

LGTM

The change is small and it's easy to backport. Thanks

tests/e2e/watch_delay_test.go

ahrtr · 2024-03-08T15:03:35Z

Thanks for the fix.

The fix is simple and makes sense to me, but the 230+ lines of e2e test is a little over complicated to me. Are you able to create a simple unit test to simulate the case that w.ch is full instead?

etcd/server/storage/mvcc/watcher_group.go

Lines 254 to 260 in b81936d

    
           select { 
        
           case w.ch <- WatchResponse{WatchID: w.id, CompactRevision: compactRev}: 
        
           	w.compacted = true 
        
           	wg.delete(w) 
        
           default: 
        
           	// retry next time 
        
           }

Also when I read the source code, I found there maybe another slightly related issue. As the code showed above, when w.minRev < compactRev, etcd will send out a WatchResponse with CompactRevision, and also remove the watcher wg.delete(w), but when the watcher count > maxWatchersPerSync (512), it only removes the watcher from a temporary watcherGroup (see below). So it doesn't really remove the watcher? Could you double confirm this, e.g by writing a unit test? Thanks.

etcd/server/storage/mvcc/watcher_group.go

Lines 228 to 236 in b81936d

    
           ret := newWatcherGroup() 
        
           for w := range wg.watchers { 
        
           	if maxWatchers <= 0 { 
        
           		break 
        
           	} 
        
           	maxWatchers-- 
        
           	ret.add(w) 
        
           } 
        
           return &ret, ret.chooseAll(curRev, compactRev)

server/storage/mvcc/watchable_store.go

tests/e2e/watch_delay_test.go

chaochn47 · 2024-03-09T06:32:19Z

@ahrtr @serathius

it only removes the watcher from a temporary watcherGroup (see below). So it doesn't really remove the watcher? Could you double confirm this, e.g by writing a unit test? Thanks.

Because the .choose function is written badly. It sometimes mutates the original watchGroup, sometimes returns the same watch group, sometimes a copy. It's not great.

Yeah, I have confirmed it would cause compacted watcher still exists in the original watcher group after wg.choose with the following unit test.

Also I observed that slow watchers metrics is 829 even if I only opened 801 watchers. Haven't yet dig into that but it could be related.

func TestWatchGroupUpdate(t *testing.T) {
	ch := make(chan WatchResponse, chanBufLen)
	compactRev := int64(5)
	curRev := int64(12)
	wg := newWatcherGroup()
	wg.add(&watcher{
		key:    []byte("foo/"),
		end:    []byte("foo0"),
		minRev: 2,
		id:     0,
		ch:     ch,
	})
	wg.add(&watcher{
		key:    []byte("foo/"),
		end:    []byte("foo0"),
		minRev: 3,
		id:     1,
		ch:     ch,
	})

	wg.choose(1 /* maxWatchers */, curRev, compactRev)
	// we would expect whatever the picked compacted watcher should be deleted from the watch group
	for w := range wg.watchers {
		t.Logf("compactRev is: %d; watcher %d with minRev %d still in the watcher group", compactRev, w.id, w.minRev)
	}
	require.Equal(t, 1, len(wg.watchers))
}

dev-dsk-chaochn-2c-a26acd76 % GOWORK=off go test -v -run TestWatchGroupUpdate
=== RUN   TestWatchGroupUpdate
    watcher_group_test.go:32: compactRev is: 5; watcher 0 with minRev 2 still in the watcher group
    watcher_group_test.go:32: compactRev is: 5; watcher 1 with minRev 3 still in the watcher group
    watcher_group_test.go:34:
        	Error Trace:	/home/chaochn/workplace/EKS-etcd/src/EKS-etcd/server/storage/mvcc/watcher_group_test.go:34
        	Error:      	Not equal:
        	            	expected: 1
        	            	actual  : 2
        	Test:       	TestWatchGroupUpdate
--- FAIL: TestWatchGroupUpdate (0.00s)
FAIL
exit status 1
FAIL	go.etcd.io/etcd/server/v3/storage/mvcc	0.011s

I can submit another separate PR to fix it.

chaochn47 · 2024-03-09T06:38:30Z

Are you able to create a simple unit test to simulate the case that w.ch is full instead?

I would like to keep the existing e2e test to ensure no regression of etcd watch behavior after the fix gets in. I would simplify it by removing unnecessary code paths.

Unit test is good to have and I am working on it.

ahrtr · 2024-03-10T16:14:35Z

Please ping me back when you resolve all comments.

ahrtr · 2024-03-10T16:22:11Z

I would like to keep the existing e2e test to ensure no regression of etcd watch behavior after the fix gets in. I would simplify it by removing unnecessary code paths.

Unit test is good to have and I am working on it.

OK

serathius · 2024-03-11T17:38:33Z

I think we might need to do more changes in the watch code soon.

Let's go with a minimal fix proposed by the original PR to address the issue.

@chaochn47 can you simplify the e2e test? I should be able to help with if needed.

tests/e2e/watch_delay_test.go

chaochn47 · 2024-03-11T20:03:24Z

Ping @ahrtr @serathius

serathius · 2024-03-12T18:10:57Z

Have you considered using a gofailpoint to simulate watch stream being clogged?

I managed to reproduce the issue using a following test:

func TestV3NoEventsLostOnCompact(t *testing.T) {
	if integration.ThroughProxy {
		t.Skip("grpc proxy currently does not support requesting progress notifications")
	}
	integration.BeforeTest(t)

	clus := integration.NewCluster(t, &integration.ClusterConfig{Size: 1})
	defer clus.Terminate(t)

	client := clus.RandClient()
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

	writeCount := mvcc.WatchStreamResponseBufferLen * 11 / 10

	wch := client.Watch(ctx, "foo")
	require.NoError(t, gofail.Enable("watchResponseSend", `sleep(1000)`))
	var rev int64 = 0
	for i := 0; i < writeCount; i++ {
		resp, err := client.Put(ctx, "foo", "bar")
		require.NoError(t, err)
		rev = resp.Header.Revision
	}
	_, err := client.Compact(ctx, rev)
	require.NoError(t, err)
	time.Sleep(time.Second)
	require.NoError(t, gofail.Disable("watchResponseSend"))

	event_count := 0
	compacted := false
	for resp := range wch {
		err = resp.Err()
		if err != nil {
			if !strings.Contains(err.Error(), "required revision has been compacted") {
				t.Fatal(err)
			}
			compacted = true
			break
		}
		event_count += len(resp.Events)
		if event_count == writeCount {
			break
		}
	}
	assert.Truef(t, compacted, "Expected stream to get compacted, instead we got %d events out of %d events", event_count, writeCount)
}

Where watchResponseSend is a new failpoint before grpc.Send

chaochn47 · 2024-03-14T06:30:18Z

Have you considered using a gofailpoint to simulate watch stream being clogged?

I managed to reproduce the issue using a following test:

Thanks! I have updated the integration test based on the one you provided!

Added unit test, integration test with failpoint and e2e test together and let you guys pick which one is the best.

server/storage/mvcc/watchable_store.go

server/etcdserver/api/v3rpc/watch.go

tests/robustness/makefile.mk

server/storage/mvcc/watchable_store_test.go

tests/e2e/watch_delay_test.go

ahrtr · 2024-03-14T13:01:46Z

The e2e test is still a little over complicated to me. It will definitely be painful for other contributors to update/maintain such e2e test in future. I think it should be part of the robustness test, which already supports generating traffic, watching, compaction and verifying results (including watchResponses), why do you have to write similar complex test in e2e test suite? But I won't insist on that.

chaochn47 · 2024-03-14T23:41:39Z

The e2e test is still a little over complicated to me. It will definitely be painful for other contributors to update/maintain such e2e test in future. I think it should be part of the robustness test, which already supports generating traffic, watching, compaction and verifying results (including watchResponses), why do you have to write similar complex test in e2e test suite? But I won't insist on that.

Okay. This e2e test is removed from the PR and can be added in robustness or performance test suite later. This test was created originally to simulate kubernetes traffic. With unit test and integration test, it's enough to capture regression.

chaochn47 · 2024-03-15T00:53:00Z

/retest

ahrtr

LGTM with two very minor comments, which can be resolved separately.

We can let this PR in for now if there is no other comment. cc @serathius

server/storage/mvcc/watchable_store_test.go

Signed-off-by: Chao Chen <chaochn@amazon.com>

serathius · 2024-03-16T10:20:29Z

server/storage/mvcc/watchable_store_test.go

@@ -250,6 +251,63 @@ func TestWatchCompacted(t *testing.T) {
 	}
 }

+func TestWatchNoEventLossOnCompact(t *testing.T) {
+	oldChanBufLen, oldMaxWatchersPerSync := chanBufLen, maxWatchersPerSync


Not sure how relevant is maxWatchersPerSync to the issue and this test as len(watchers) < 4.

What about the case where len(watchers) > maxWatchersPerSync as pointed out in #17555 (comment) ? I haven't verified it, but I expect that unremoved watcher from s.unsynced will cause syncWatchers to return != 0, causing function to be called earlier.

Not sure how relevant is maxWatchersPerSync to the issue and this test as len(watchers) < 4.

It seems to be true. Confirmed that it always runs into the if branch (see below), and confirmed that it has no any impact on the test case.

But it isn't a big deal, and also from another prospective it should be OK to explicitly set a value to ensure len(wg.watchers) < maxWatchers although it's already true by default.

etcd/server/storage/mvcc/watcher_group.go

Lines 225 to 227 in 63e394d

if len(wg.watchers) < maxWatchers {

return wg, wg.chooseAll(curRev, compactRev)

}

What about the case where len(watchers) > maxWatchersPerSync as pointed out in #17555 (comment) ?

Suggest to discuss & fix it separately. We may want to do minor local code refactor. @chaochn47 are you able to continue to work on this?

confirmed that it has no any impact on the test case.

I mean that the even I don't change maxWatchersPerSync in the test case , the test case could also reproduce the issue without the fix and the issue disappeared after applying the patch in this PR.

Sounds good

serathius · 2024-03-16T10:21:44Z

tests/integration/v3_watch_test.go

+// if its next revision of events are compacted and no lost events sent to client.
+func TestV3NoEventsLostOnCompact(t *testing.T) {
+	if integration.ThroughProxy {
+		t.Skip("grpc proxy currently does not support requesting progress notifications")


Don't think this comment is relevant.

ahrtr · 2024-03-18T19:12:30Z

@chaochn47 do you have bandwidth to backport this to 3.5 and 3.4? We may need to release a 3.4 and 3.5 patch soon.

Signed-off-by: Chao Chen <chaochn@amazon.com>

chaochn47 · 2024-03-19T01:44:47Z

@chaochn47 do you have bandwidth to backport this to 3.5 and 3.4? We may need to release a 3.4 and 3.5 patch soon.

@ahrtr @jmhbnz sure. Just provided backports PRs. Could you please take a look?

…-loss-after-compaction [release-3.4] backport fix watch event loss after compaction #17555

chaochn47 force-pushed the fix-watch-event-loss branch from d9a3dc8 to 4025f36 Compare March 8, 2024 07:02

fuweid approved these changes Mar 8, 2024

View reviewed changes

tests/e2e/watch_delay_test.go Outdated Show resolved Hide resolved

serathius reviewed Mar 8, 2024

View reviewed changes

server/storage/mvcc/watchable_store.go Outdated Show resolved Hide resolved

serathius reviewed Mar 8, 2024

View reviewed changes

tests/e2e/watch_delay_test.go Outdated Show resolved Hide resolved

chaochn47 force-pushed the fix-watch-event-loss branch from 4025f36 to 7ff1da9 Compare March 11, 2024 19:03

chaochn47 commented Mar 11, 2024

View reviewed changes

tests/e2e/watch_delay_test.go Outdated Show resolved Hide resolved

tests/e2e/watch_delay_test.go Outdated Show resolved Hide resolved

serathius mentioned this pull request Mar 13, 2024

APIServer watchcache lost events kubernetes/kubernetes#123072

Open

serathius added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 13, 2024

chaochn47 force-pushed the fix-watch-event-loss branch 3 times, most recently from 1712f23 to 12514b9 Compare March 14, 2024 06:42

ahrtr reviewed Mar 14, 2024

View reviewed changes

serathius mentioned this pull request Mar 14, 2024

Serve watch without resourceVersion from cache and introduce a WatchFromStorageWithoutResourceVersion feature gate to allow serving watch from storage kubernetes/kubernetes#123935

Merged

chaochn47 force-pushed the fix-watch-event-loss branch from 12514b9 to 4ec7409 Compare March 14, 2024 23:38

chaochn47 force-pushed the fix-watch-event-loss branch from 4ec7409 to f4421bb Compare March 14, 2024 23:47

ahrtr approved these changes Mar 15, 2024

View reviewed changes

server/storage/mvcc/watchable_store_test.go Outdated Show resolved Hide resolved

server/storage/mvcc/watchable_store_test.go Outdated Show resolved Hide resolved

ahrtr added backport/v3.4 backport/v3.5 labels Mar 15, 2024

Fix event loss after compaction

405862e

Signed-off-by: Chao Chen <chaochn@amazon.com>

chaochn47 force-pushed the fix-watch-event-loss branch from f4421bb to 405862e Compare March 15, 2024 21:22

serathius reviewed Mar 16, 2024

View reviewed changes

serathius approved these changes Mar 16, 2024

View reviewed changes

ahrtr merged commit fca3e8a into etcd-io:main Mar 16, 2024
39 checks passed

This was referenced Mar 18, 2024

UPSTREAM <carry>: Fix event loss after compaction openshift/etcd#254

Closed

UPSTREAM <carry>: Fix event loss after compaction openshift/etcd#255

Closed

UPSTREAM <carry>: Fix event loss after compaction openshift/etcd#256

Closed

chaochn47 deleted the fix-watch-event-loss branch March 18, 2024 17:59

jmhbnz mentioned this pull request Mar 18, 2024

Plan to release etcd 3.4.31 #17609

Closed

2 tasks

chaochn47 added a commit to chaochn47/etcd that referenced this pull request Mar 19, 2024

backport fix watch event loss after compaction etcd-io#17555

c3890fa

Signed-off-by: Chao Chen <chaochn@amazon.com>

chaochn47 added a commit to chaochn47/etcd that referenced this pull request Mar 19, 2024

backport fix watch event loss after compaction etcd-io#17555

e2b29c4

Signed-off-by: Chao Chen <chaochn@amazon.com>

ahrtr added a commit that referenced this pull request Mar 19, 2024

Merge pull request #17610 from chaochn47/backport-3.4-fix-watch-event…

e878a11

…-loss-after-compaction [release-3.4] backport fix watch event loss after compaction #17555

jmhbnz mentioned this pull request Mar 21, 2024

Plan to release etcd v3.5.13 #17633

Closed

3 tasks

tooptoop4 mentioned this pull request Apr 11, 2024

Controller: invalid config map object received in config watcher. Ignored processing with CPU spike argoproj/argo-workflows#11657

Open

3 tasks

serathius mentioned this pull request Apr 18, 2024

Define an official performance validation suite for etcd #16467

Open

2 tasks

tooptoop4 mentioned this pull request Jun 2, 2024

Cluster syncs hang (syncs never complete) argoproj/argo-cd#18467

Closed

3 tasks

serathius mentioned this pull request Oct 18, 2024

Write txn shouldn't End() on a failure #18679

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix watch event loss #17555

Fix watch event loss #17555

chaochn47 commented Mar 8, 2024

chaochn47 commented Mar 8, 2024 •

edited

Loading

jmhbnz commented Mar 8, 2024

fuweid left a comment

ahrtr commented Mar 8, 2024 •

edited

Loading

chaochn47 commented Mar 9, 2024 •

edited

Loading

chaochn47 commented Mar 9, 2024

ahrtr commented Mar 10, 2024

ahrtr commented Mar 10, 2024

serathius commented Mar 11, 2024

chaochn47 commented Mar 11, 2024

serathius commented Mar 12, 2024

chaochn47 commented Mar 14, 2024 •

edited

Loading

ahrtr commented Mar 14, 2024

chaochn47 commented Mar 14, 2024 •

edited

Loading

chaochn47 commented Mar 15, 2024

ahrtr left a comment

serathius Mar 16, 2024

ahrtr Mar 16, 2024

ahrtr Mar 16, 2024

serathius Mar 16, 2024

serathius Mar 16, 2024

ahrtr commented Mar 18, 2024

chaochn47 commented Mar 19, 2024

	if len(wg.watchers) < maxWatchers {
	return wg, wg.chooseAll(curRev, compactRev)
	}

Fix watch event loss #17555

Fix watch event loss #17555

Conversation

chaochn47 commented Mar 8, 2024

chaochn47 commented Mar 8, 2024 • edited Loading

jmhbnz commented Mar 8, 2024

fuweid left a comment

Choose a reason for hiding this comment

ahrtr commented Mar 8, 2024 • edited Loading

chaochn47 commented Mar 9, 2024 • edited Loading

chaochn47 commented Mar 9, 2024

ahrtr commented Mar 10, 2024

ahrtr commented Mar 10, 2024

serathius commented Mar 11, 2024

chaochn47 commented Mar 11, 2024

serathius commented Mar 12, 2024

chaochn47 commented Mar 14, 2024 • edited Loading

ahrtr commented Mar 14, 2024

chaochn47 commented Mar 14, 2024 • edited Loading

chaochn47 commented Mar 15, 2024

ahrtr left a comment

Choose a reason for hiding this comment

serathius Mar 16, 2024

Choose a reason for hiding this comment

ahrtr Mar 16, 2024

Choose a reason for hiding this comment

ahrtr Mar 16, 2024

Choose a reason for hiding this comment

serathius Mar 16, 2024

Choose a reason for hiding this comment

serathius Mar 16, 2024

Choose a reason for hiding this comment

ahrtr commented Mar 18, 2024

chaochn47 commented Mar 19, 2024

chaochn47 commented Mar 8, 2024 •

edited

Loading

ahrtr commented Mar 8, 2024 •

edited

Loading

chaochn47 commented Mar 9, 2024 •

edited

Loading

chaochn47 commented Mar 14, 2024 •

edited

Loading

chaochn47 commented Mar 14, 2024 •

edited

Loading