Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/linearizability: added sleep fail point injection #14796

Closed
wants to merge 1 commit into from

Conversation

ramil600
Copy link
Contributor

tests/linearizability: added sleep failpoint injection functionality

Injected sleep fail point inside *raftNode.start() to test functionality

Signed-off-by: Ramil Mirhasanov ramil600@yahoo.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Injected inside  *raftNode.start() to test functionality

Signed-off-by: Ramil Mirhasanov <ramil600@yahoo.com>
@ramil600
Copy link
Contributor Author

fix: #14729

@@ -157,7 +157,7 @@ func (r *raftNode) tick() {
// to modify the fields after it has been started.
func (r *raftNode) start(rh *raftReadyHandler) {
internalTimeout := time.Second

// gofail: var raftBeforeStart struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that this is not a useful gofail point so let's not add it. If you want to test sleep failpoint you can just use other ones like backend/defragBeforeCopy.

Path: failpoint,
}
//check whether sleep was enabled
r, err := http.NewRequest("GET", failpointsUrl.String(), nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is needed. If setupGoFailpoint succeeded then we can assume that sleep was enabled.

@serathius
Copy link
Member

I think there was a misunderstanding about #14729.

Main change that needs to be implemented is detecting that failpoint was executed

Based on http response success we can assume that failpoint was enabled. What we don't know if it really was executed.
For example defragBeforeCopy failpoint. It will not trigger automatically as this code path is not used. After its setup we need to send a Defrag request to trigger it.

After we setup sleep failpoint we need a way to verify if it was really triggered to make sure that tests do what they claim to do. I incorrectly assumed that failpoint is cleared after it was executed. This is true for triggering panic (as etcd crashes and forgets what failpoints were setup) however sleep will say forever. Having failpoints stay is also not good as we run multiple failure injection scenarios on one cluster and we don't want one injection influence another.

What I propose is for sleep failpoint to work like:

  • Setup the failpoint
  • Wait for failpoint to be executed (could be multiple times if event is hapening frequently)
  • Clear the failpoint

Question is how to know that failpoint was executed. It could be either done by:

  • Having failpoint write specific logs and tests read them
  • tracking this number in gofail and making it available in API

I prefer the second approach. What do you think?
cc @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Nov 18, 2022

This is true for triggering panic (as etcd crashes and forgets what failpoints were setup) however sleep will say forever. Having failpoints stay is also not good as we run multiple failure injection scenarios on one cluster and we don't want one injection influence another.

The linearizability test is based on the e2e test framework, each time each case creates a new cluster and start new processes, so no need to disable the failpoints in this case.

Question is how to know that failpoint was executed.

The test case needs to explicitly trigger the related code path.

@ahrtr
Copy link
Member

ahrtr commented Nov 18, 2022

Since the random failpoint will be triggered multiple times in one test case, and each time it might select a different failpoint, so we need to disable the failpoint each time after it's triggered.

@ramil600
Copy link
Contributor Author

I think there was a misunderstanding about #14729.

Main change that needs to be implemented is detecting that failpoint was executed

Based on http response success we can assume that failpoint was enabled. What we don't know if it really was executed. For example defragBeforeCopy failpoint. It will not trigger automatically as this code path is not used. After its setup we need to send a Defrag request to trigger it.

After we setup sleep failpoint we need a way to verify if it was really triggered to make sure that tests do what they claim to do. I incorrectly assumed that failpoint is cleared after it was executed. This is true for triggering panic (as etcd crashes and forgets what failpoints were setup) however sleep will say forever. Having failpoints stay is also not good as we run multiple failure injection scenarios on one cluster and we don't want one injection influence another.

What I propose is for sleep failpoint to work like:

  • Setup the failpoint
  • Wait for failpoint to be executed (could be multiple times if event is hapening frequently)
  • Clear the failpoint

Question is how to know that failpoint was executed. It could be either done by:

  • Having failpoint write specific logs and tests read them
  • tracking this number in gofail and making it available in API

I prefer the second approach. What do you think? cc @ahrtr

One suggestion I have is resetting sleep duration value to 0 and deleting description in the terms field of the failpoint, once sleep is executed.
Then when you check it through http endpoint, it still exists(http status = 200), but it won't execute sleep anymore, unless randomly activated again. Please let me know if it is ok with you and I will send the PR to modify gofail library.

@serathius
Copy link
Member

serathius commented Nov 21, 2022

One suggestion I have is resetting sleep duration value to 0 and deleting description in the terms field of the failpoint, once sleep is executed.

Not sure I understand why this is needed. We don't need ensure exact number of sleep triggers. Sleep failpoint for linearizability tests should ensure that order of goroutine execution doesn't influence the result, so running it multiple times doesn't matter as long we can recover the state before we run other tests.

Length of the sleep also matters, for rarely run code (compact loop) we might set a long sleep (100ms) and wait for one execution . On the other hand short loops (like raft/apply code) we cannot set long sleep as it could trigger probe failures. It might be better to set short sleep (1-10ms) and wait until we have 10-100 runs. This is why I would prefer implement a run counter for failpoint.

Proposed flow:

  • Create failpoint by sending HTTP XPUT request
  • Wait for failpoint to be executed enough times by checking counter via HTTP GET request
  • Clear failpoint for next test by sending HTTP DELETE request

This requires implementing counter for each failpoint. Please let me know if you need a help with that.

@ramil600
Copy link
Contributor Author

ramil600 commented Nov 21, 2022

Length of the sleep also matters, for rarely run code (compact loop) we might set a long sleep (100ms) and wait for one execution . On the other hand short loops (like raft/apply code) we cannot set long sleep as it could trigger probe failures. It might be better to set short sleep (1-10ms) and wait until we have 10-100 runs. This is why I would prefer implement a run counter for failpoint.

Proposed flow:

  • Create failpoint by sending HTTP XPUT request
  • Wait for failpoint to be executed enough time by checking counter via HTTP GET request
  • Clear failpoint for next test by sending HTTP DELETE request

This requires implementing counter for each failpoint. Please let me know if you need a help with that.

Added PR in gofail to implement the counter: runtime: added counter to failpoint.terms field
https://github.com/etcd-io/gofail/pull/37 (link is broken for some reason)
Please review, so that we can test it locally.

@serathius
Copy link
Member

Please review, so that we can test it locally.

Thinking that maybe we should add tests to gofail to avoid merging broken PRs like it was for gofail-go.

@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 18, 2023
@stale stale bot closed this May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants