etcdserver: add watchdog to detect stalled writes #15440

ahrtr · 2023-03-10T02:30:58Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

ahrtr · 2023-03-10T02:52:14Z

This is the draft PR per our discussion in the doc

@mitake @ptabor @serathius @chaochn47 @fuweid Please let me know if you have any immediate comment or concern, before I continue to add tests.

server/embed/etcd.go

server/watchdog/watchdog.go

server/storage/backend/batch_tx.go

lavacat · 2023-03-13T05:53:45Z

server/watchdog/watchdog.go

+				v.inactiveElapsed++
+				if v.inactiveElapsed > wd.inactiveTimeoutTick/2 {
+					elapsedTime := time.Duration(v.inactiveElapsed*tickMs) * time.Millisecond
+					wd.lg.Warn("Slow activity detected", zap.String("activity", v.name), zap.Duration("duration", elapsedTime))


nit: please add a gauge metric for slow activities count with activity name as label. I think this will be useful for folks who rely on metrics to actively monitor etcd cluster.

serathius · 2023-03-13T11:28:30Z

I would like to clarify what is the proposed solution and how it differs from the design proposed by @ptabor and me. Can you please cleanup the design document before people jump on reviewing the code?

server/embed/config.go

server/etcdserver/api/snap/snapshotter.go

ptabor · 2023-03-13T11:37:35Z

server/etcdserver/api/snap/snapshotter.go

@@ -88,7 +89,9 @@ func (s *Snapshotter) save(snapshot *raftpb.Snapshot) error {
 	spath := filepath.Join(s.dir, fname)

 	fsyncStart := time.Now()
+	cancel := watchdog.Register("save v2 snapshot")


I think we need to decide the naming better.

a) ```
endActivity := wd.StartActivity("saving v2 snapshot")
endActivity()

b) ``` markAsDone := wd.Start("saving v2 snapshot") markAsDone()

c) ```
markAsDone := wd.Notify("saving v2 snapshot")
markAsDone()

d) ``` endScope := wd.StartScope("saving v2 snapshot") endScope()

We might also have a synthactic sugar:

or watchdog.executeInScope("saving v2 snapshot", func() { ...} );

I don't like 'cancel' as it suggests an interruption of an existing process.. and it's not.

I hoped we will settle this (API / naming) in a document rather than keep updating the PR, that is relatively expensive.

I like the suggestion of synthactic sugar. I imagine it's similar to bboltDB's View and Update method

StartXXX seems not good to me, it doesn't start any activity, instead essentially it just registers the activity to watchdog. So how about Register and unregister?

unregister := wd.Register("saving v2 snapshot") unregister()

Users can also call the syntactic sugar as you proposed below,

wd.Execute("saving v2 snapshot", fn)

Please also see the updated doc https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit#heading=h.3oeryohw1c9o

ptabor · 2023-03-13T11:46:04Z

server/etcdserver/api/snap/snapshotter.go

@@ -88,7 +89,9 @@ func (s *Snapshotter) save(snapshot *raftpb.Snapshot) error {
 	spath := filepath.Join(s.dir, fname)

 	fsyncStart := time.Now()
+	cancel := watchdog.Register("save v2 snapshot")


I wonder what do you think about pre-registering type of activities that are tracked in the watchdog:

{ at the package level: var SAVING_V2_SNAPSHOTS = watchdog.RegisterActivity("saving v2 snapshots"); }``` endActivity := wd.StartActivity(SAVING_V2_SNAPSHOTS) endActivity()

The main benefit is that it drives the API usage to small number of strings rather than patterns like:

wd.StartActivity(string.format("Writing log entry: %d", entryId));

thus the API can be e.g. used for monitoring how long classes of activities take in scope of a watchdog. i.e. watchdog sees that 17% of wall-time it was in the writing-logs routine. It's a stretch and over design. but in theory such registration allows to build also a hierarchy of 'activities/scope'.

ptabor · 2023-03-13T11:52:39Z

server/watchdog/watchdog.go

+			wd.mu.Lock()
+			for _, v := range wd.activities {
+				v.inactiveElapsed++
+				if v.inactiveElapsed > wd.inactiveTimeoutTick/2 {


The fundamental difference between this design and watchdog pattern in general is that you propose to monitor a specific registered activities.

I assumed the goal is to monitor watchdog as a whole thing....

Watchdog is unhealthy when for the whole 'timeout' it got no 'Start' nor 'Stop/Done' notifications.
Any such notifications does reset the watchdog. Lack of new activities is especially an alart.

We do register specific activities only to:
a) have actionable error messages:
- the watchdog is stuck with following active activities...
- the watchdog is stuck and the last started / DONE activity is X
b) [stretch] We might piggy back on it and collect metrics what percent of time specifica activities are active state

server/watchdog/watchdog.go

ahrtr · 2023-03-14T04:21:27Z

Thanks all for the feedback, which basically makes sense to me. The overall idea is basically coming from one of @ptabor 's comments in the doc. Will update & cleanup the doc later. I will also keep this PR (just a PoC) in sync with the doc so as to avoid any misunderstanding.

One thing I'd like to clarify. Previously I thought it should be an etcd-raft-loop watchdog, and gets notified each time on receiving a ready data struct. The watchdog isn't healthy if it doesn't receive any new notification in the given max-duration. But on second thought, it seems not good, because,

The etcd raft loop will not be triggered at all for one-node cluster when there is no any client requests.
Our purpose for now is to monitor stalled write, why should we care about the etcd-raft-loop?

So eventually I changed to monitor only registered activities (e.g. sync WAL log, commit boltDB, etc). Any concern?

Use watchdog to monitor all storage read/write operations Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr · 2023-03-19T22:56:10Z

Please also see the updated doc https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit#heading=h.3oeryohw1c9o

ahrtr · 2023-04-22T11:53:19Z

#15477 (comment)

stale · 2023-08-12T06:49:37Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

k8s-ci-robot · 2024-01-16T20:35:54Z

@ahrtr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-verify	`93f6dcc`	link	true	`/test pull-etcd-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ruiming-lu · 2024-03-09T04:17:27Z

Hello developers! I am wondering what is the status of this issue right now? Will it be merged to the main branch?

serathius · 2024-03-09T09:16:08Z

We decided to go with different design. Etcd will detect stalled writes by checking raft loop execution and expose a /livez endpoint that supervisor can use to restart etcd if needed.

ahrtr marked this pull request as draft March 10, 2023 02:31

ahrtr force-pushed the watchdog_20230309 branch from b5782da to 39abdbe Compare March 10, 2023 02:40

ahrtr force-pushed the watchdog_20230309 branch from 39abdbe to 8a50ca5 Compare March 10, 2023 04:37

chaochn47 reviewed Mar 11, 2023

View reviewed changes

server/embed/etcd.go Show resolved Hide resolved

server/embed/etcd.go Outdated Show resolved Hide resolved

server/embed/etcd.go Show resolved Hide resolved

server/watchdog/watchdog.go Outdated Show resolved Hide resolved

ahrtr force-pushed the watchdog_20230309 branch from 8a50ca5 to 69e29bf Compare March 11, 2023 02:58

fuweid reviewed Mar 12, 2023

View reviewed changes

server/storage/backend/batch_tx.go Outdated Show resolved Hide resolved

ahrtr force-pushed the watchdog_20230309 branch 2 times, most recently from b8e2988 to 989cce8 Compare March 13, 2023 00:58

lavacat reviewed Mar 13, 2023

View reviewed changes

ptabor suggested changes Mar 13, 2023

View reviewed changes

server/watchdog/watchdog.go Outdated Show resolved Hide resolved

server/watchdog/watchdog.go Show resolved Hide resolved

ahrtr force-pushed the watchdog_20230309 branch 3 times, most recently from 865764b to 9b6a4a3 Compare March 17, 2023 05:34

etcdserver: add watchdog to monitor inactive activities

93f6dcc

Use watchdog to monitor all storage read/write operations Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr force-pushed the watchdog_20230309 branch from 9b6a4a3 to 93f6dcc Compare March 17, 2023 06:45

ahrtr mentioned this pull request Mar 22, 2023

Propose project roadmap #15499

Closed

ahrtr mentioned this pull request Apr 22, 2023

Script for defragmentation #15477

Closed

ahrtr mentioned this pull request Jul 24, 2023

Livez/Readyz #16007

Open

chaochn47 mentioned this pull request Jul 28, 2023

disk write failed and network partitioned leader was not able to step down to follower #13527

Open

stale bot added the stale label Aug 12, 2023

ahrtr added stage/tracked and removed stale labels Aug 12, 2023

serathius closed this Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: add watchdog to detect stalled writes #15440

etcdserver: add watchdog to detect stalled writes #15440

ahrtr commented Mar 10, 2023

ahrtr commented Mar 10, 2023

lavacat Mar 13, 2023

serathius commented Mar 13, 2023

ptabor Mar 13, 2023

ahrtr Mar 16, 2023

ahrtr Mar 17, 2023

ptabor Mar 13, 2023

ptabor Mar 13, 2023

ahrtr commented Mar 14, 2023

ahrtr commented Mar 19, 2023

ahrtr commented Apr 22, 2023

stale bot commented Aug 12, 2023

k8s-ci-robot commented Jan 16, 2024

ruiming-lu commented Mar 9, 2024 •

edited

Loading

serathius commented Mar 9, 2024

etcdserver: add watchdog to detect stalled writes #15440

etcdserver: add watchdog to detect stalled writes #15440

Conversation

ahrtr commented Mar 10, 2023

ahrtr commented Mar 10, 2023

lavacat Mar 13, 2023

Choose a reason for hiding this comment

serathius commented Mar 13, 2023

ptabor Mar 13, 2023

Choose a reason for hiding this comment

ahrtr Mar 16, 2023

Choose a reason for hiding this comment

ahrtr Mar 17, 2023

Choose a reason for hiding this comment

ptabor Mar 13, 2023

Choose a reason for hiding this comment

ptabor Mar 13, 2023

Choose a reason for hiding this comment

ahrtr commented Mar 14, 2023

ahrtr commented Mar 19, 2023

ahrtr commented Apr 22, 2023

stale bot commented Aug 12, 2023

k8s-ci-robot commented Jan 16, 2024

ruiming-lu commented Mar 9, 2024 • edited Loading

serathius commented Mar 9, 2024

ruiming-lu commented Mar 9, 2024 •

edited

Loading