tests/linearizability: added sleep fail point injection #14796

ramil600 · 2022-11-17T06:52:50Z

tests/linearizability: added sleep failpoint injection functionality

Injected sleep fail point inside *raftNode.start() to test functionality

Signed-off-by: Ramil Mirhasanov ramil600@yahoo.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Injected inside *raftNode.start() to test functionality Signed-off-by: Ramil Mirhasanov <ramil600@yahoo.com>

ramil600 · 2022-11-17T07:02:48Z

fix: #14729

serathius · 2022-11-17T11:00:38Z

server/etcdserver/raft.go

@@ -157,7 +157,7 @@ func (r *raftNode) tick() {
 // to modify the fields after it has been started.
 func (r *raftNode) start(rh *raftReadyHandler) {
 	internalTimeout := time.Second
-
+	// gofail: var raftBeforeStart struct{}


This is not needed.

I mean that this is not a useful gofail point so let's not add it. If you want to test sleep failpoint you can just use other ones like backend/defragBeforeCopy.

serathius · 2022-11-17T11:02:12Z

tests/linearizability/failpoints.go

+		Path:   failpoint,
+	}
+	//check whether sleep was enabled
+	r, err := http.NewRequest("GET", failpointsUrl.String(), nil)


Don't think this is needed. If setupGoFailpoint succeeded then we can assume that sleep was enabled.

serathius · 2022-11-17T11:21:24Z

I think there was a misunderstanding about #14729.

Main change that needs to be implemented is detecting that failpoint was executed

Based on http response success we can assume that failpoint was enabled. What we don't know if it really was executed.
For example defragBeforeCopy failpoint. It will not trigger automatically as this code path is not used. After its setup we need to send a Defrag request to trigger it.

After we setup sleep failpoint we need a way to verify if it was really triggered to make sure that tests do what they claim to do. I incorrectly assumed that failpoint is cleared after it was executed. This is true for triggering panic (as etcd crashes and forgets what failpoints were setup) however sleep will say forever. Having failpoints stay is also not good as we run multiple failure injection scenarios on one cluster and we don't want one injection influence another.

What I propose is for sleep failpoint to work like:

Setup the failpoint
Wait for failpoint to be executed (could be multiple times if event is hapening frequently)
Clear the failpoint

Question is how to know that failpoint was executed. It could be either done by:

Having failpoint write specific logs and tests read them
tracking this number in gofail and making it available in API

I prefer the second approach. What do you think?
cc @ahrtr

ahrtr · 2022-11-18T07:44:07Z

This is true for triggering panic (as etcd crashes and forgets what failpoints were setup) however sleep will say forever. Having failpoints stay is also not good as we run multiple failure injection scenarios on one cluster and we don't want one injection influence another.

The linearizability test is based on the e2e test framework, each time each case creates a new cluster and start new processes, so no need to disable the failpoints in this case.

Question is how to know that failpoint was executed.

The test case needs to explicitly trigger the related code path.

ahrtr · 2022-11-18T07:51:30Z

Since the random failpoint will be triggered multiple times in one test case, and each time it might select a different failpoint, so we need to disable the failpoint each time after it's triggered.

ramil600 · 2022-11-19T07:38:52Z

I think there was a misunderstanding about #14729.

Main change that needs to be implemented is detecting that failpoint was executed

Based on http response success we can assume that failpoint was enabled. What we don't know if it really was executed. For example defragBeforeCopy failpoint. It will not trigger automatically as this code path is not used. After its setup we need to send a Defrag request to trigger it.

After we setup sleep failpoint we need a way to verify if it was really triggered to make sure that tests do what they claim to do. I incorrectly assumed that failpoint is cleared after it was executed. This is true for triggering panic (as etcd crashes and forgets what failpoints were setup) however sleep will say forever. Having failpoints stay is also not good as we run multiple failure injection scenarios on one cluster and we don't want one injection influence another.

What I propose is for sleep failpoint to work like:

Setup the failpoint

Wait for failpoint to be executed (could be multiple times if event is hapening frequently)

Clear the failpoint

Question is how to know that failpoint was executed. It could be either done by:

Having failpoint write specific logs and tests read them

tracking this number in gofail and making it available in API

I prefer the second approach. What do you think? cc @ahrtr

One suggestion I have is resetting sleep duration value to 0 and deleting description in the terms field of the failpoint, once sleep is executed.
Then when you check it through http endpoint, it still exists(http status = 200), but it won't execute sleep anymore, unless randomly activated again. Please let me know if it is ok with you and I will send the PR to modify gofail library.

serathius · 2022-11-21T08:51:45Z

One suggestion I have is resetting sleep duration value to 0 and deleting description in the terms field of the failpoint, once sleep is executed.

Not sure I understand why this is needed. We don't need ensure exact number of sleep triggers. Sleep failpoint for linearizability tests should ensure that order of goroutine execution doesn't influence the result, so running it multiple times doesn't matter as long we can recover the state before we run other tests.

Length of the sleep also matters, for rarely run code (compact loop) we might set a long sleep (100ms) and wait for one execution . On the other hand short loops (like raft/apply code) we cannot set long sleep as it could trigger probe failures. It might be better to set short sleep (1-10ms) and wait until we have 10-100 runs. This is why I would prefer implement a run counter for failpoint.

Proposed flow:

Create failpoint by sending HTTP XPUT request
Wait for failpoint to be executed enough times by checking counter via HTTP GET request
Clear failpoint for next test by sending HTTP DELETE request

This requires implementing counter for each failpoint. Please let me know if you need a help with that.

ramil600 · 2022-11-21T11:56:17Z

Length of the sleep also matters, for rarely run code (compact loop) we might set a long sleep (100ms) and wait for one execution . On the other hand short loops (like raft/apply code) we cannot set long sleep as it could trigger probe failures. It might be better to set short sleep (1-10ms) and wait until we have 10-100 runs. This is why I would prefer implement a run counter for failpoint.

Proposed flow:

Create failpoint by sending HTTP XPUT request

Wait for failpoint to be executed enough time by checking counter via HTTP GET request

Clear failpoint for next test by sending HTTP DELETE request

This requires implementing counter for each failpoint. Please let me know if you need a help with that.

Added PR in gofail to implement the counter: runtime: added counter to failpoint.terms field
https://github.com/etcd-io/gofail/pull/37 (link is broken for some reason)
Please review, so that we can test it locally.

serathius · 2022-11-21T12:54:35Z

Please review, so that we can test it locally.

Thinking that maybe we should add tests to gofail to avoid merging broken PRs like it was for gofail-go.

stale · 2023-03-18T09:01:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

tests/linearizability: added sleep fail point injection

b9cc40f

Injected inside *raftNode.start() to test functionality Signed-off-by: Ramil Mirhasanov <ramil600@yahoo.com>

ramil600 mentioned this pull request Nov 17, 2022

Implement injecting sleep in go failpoints #14729

Open

serathius reviewed Nov 17, 2022

View reviewed changes

serathius mentioned this pull request Nov 21, 2022

Implement sleep failpoint status/counter for linearizability test cases etcd-io/gofail#37

Closed

stale bot added the stale label Mar 18, 2023

stale bot closed this May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/linearizability: added sleep fail point injection #14796

tests/linearizability: added sleep fail point injection #14796

ramil600 commented Nov 17, 2022

ramil600 commented Nov 17, 2022

serathius Nov 17, 2022

serathius Nov 17, 2022

serathius Nov 17, 2022

serathius commented Nov 17, 2022

ahrtr commented Nov 18, 2022

ahrtr commented Nov 18, 2022

ramil600 commented Nov 19, 2022

serathius commented Nov 21, 2022 •

edited

Loading

ramil600 commented Nov 21, 2022 •

edited

Loading

serathius commented Nov 21, 2022

stale bot commented Mar 18, 2023

tests/linearizability: added sleep fail point injection #14796

tests/linearizability: added sleep fail point injection #14796

Conversation

ramil600 commented Nov 17, 2022

ramil600 commented Nov 17, 2022

serathius Nov 17, 2022

Choose a reason for hiding this comment

serathius Nov 17, 2022

Choose a reason for hiding this comment

serathius Nov 17, 2022

Choose a reason for hiding this comment

serathius commented Nov 17, 2022

ahrtr commented Nov 18, 2022

ahrtr commented Nov 18, 2022

ramil600 commented Nov 19, 2022

serathius commented Nov 21, 2022 • edited Loading

ramil600 commented Nov 21, 2022 • edited Loading

serathius commented Nov 21, 2022

stale bot commented Mar 18, 2023

serathius commented Nov 21, 2022 •

edited

Loading

ramil600 commented Nov 21, 2022 •

edited

Loading