Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for failed requests in linearizability tests #14880

Merged
merged 10 commits into from
Dec 6, 2022

Conversation

serathius
Copy link
Member

@serathius serathius commented Dec 1, 2022

Based on recommendations in porcupine issues, failed requests should be marked as they failed at the end of test. This allows to stop aggregating errors within model making it:

  • Much simpler to implement. Handling all the edge cases turned out to be very hard as we added new methods like delete.
  • Much more accurate. Now we can validate revision being continuous as failed request cannot just pop up.

Only downside is history visualization is much less readable.

@serathius serathius marked this pull request as draft December 1, 2022 22:00
@codecov-commenter
Copy link

codecov-commenter commented Dec 1, 2022

Codecov Report

Merging #14880 (5ff9202) into main (42bb543) will decrease coverage by 0.06%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #14880      +/-   ##
==========================================
- Coverage   74.53%   74.46%   -0.07%     
==========================================
  Files         415      415              
  Lines       34335    34335              
==========================================
- Hits        25590    25569      -21     
- Misses       7096     7120      +24     
+ Partials     1649     1646       -3     
Flag Coverage Δ
all 74.46% <ø> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/api/v3rpc/util.go 67.74% <0.00%> (-9.68%) ⬇️
server/storage/mvcc/watchable_store.go 84.42% <0.00%> (-4.72%) ⬇️
client/pkg/v3/testutil/recorder.go 76.27% <0.00%> (-3.39%) ⬇️
pkg/traceutil/trace.go 96.15% <0.00%> (-1.93%) ⬇️
server/proxy/grpcproxy/watch.go 92.48% <0.00%> (-1.16%) ⬇️
server/auth/store.go 84.23% <0.00%> (-1.15%) ⬇️
server/etcdserver/api/v3rpc/interceptor.go 76.56% <0.00%> (-1.05%) ⬇️
pkg/proxy/server.go 60.67% <0.00%> (-0.68%) ⬇️
client/v3/watch.go 93.43% <0.00%> (-0.39%) ⬇️
server/etcdserver/api/v3rpc/watch.go 85.07% <0.00%> (-0.32%) ⬇️
... and 6 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@serathius serathius force-pushed the linearizability-failed branch 4 times, most recently from e15357e to 0d2778b Compare December 1, 2022 23:42
@serathius serathius marked this pull request as ready for review December 1, 2022 23:51
@serathius
Copy link
Member Author

cc @ahrtr @geetasg @ptabor

@ahrtr ahrtr self-requested a review December 2, 2022 08:51
@serathius
Copy link
Member Author

Did some testing and found out that new model is faster on linearizable history, but can be much much slower on non-linearizable history. I expect this is due fact that we have multiple failure injections in one history. This results in a lot of timeout requests which gives exponential cost. Before merging this PR we would need to rewrite testing to trigger only one failure per history.

@tjungblu
Copy link
Contributor

tjungblu commented Dec 2, 2022

when you want to implement multiple keys at some point, don't forget to implement the partition function:
https://github.com/anishathalye/porcupine/blob/master/porcupine_test.go#L1142-L1158

porcupine can parallelize over each partition, making it even quicker.

@serathius
Copy link
Member Author

serathius commented Dec 2, 2022

when you want to implement multiple keys at some point, don't forget to implement the partition function:

This is a good suggestion. It will not help us in this case (partition works for state, not for time like in this case) Based on different usage patters I was considering to add separate key when testing leases. cc @geetasg

@serathius
Copy link
Member Author

Ok, fixed issue with linearizability verification going into exponential time. Solution was to shorten test runs to only one failpoint injections. We still do 60 iterations, however each of them recreates the cluster.

This is a little costly for 3 node cluster, but it was a quickest way to shorten request history and guarantee correctness. We might consider reusing cluster in future.

@serathius
Copy link
Member Author

re-running linearizability tests to validate for flakes.

@serathius serathius force-pushed the linearizability-failed branch 2 times, most recently from 64e37a1 to 85f72e9 Compare December 2, 2022 19:55
@serathius serathius marked this pull request as draft December 3, 2022 10:02
@serathius
Copy link
Member Author

Recreating cluster on every run makes tests unstable. It is required for this change to work, so I decided to separate it to #14885 to be able to iterate on it quicker.

@serathius serathius force-pushed the linearizability-failed branch 3 times, most recently from 150ef6c to 4409a95 Compare December 5, 2022 11:19
@serathius serathius marked this pull request as ready for review December 5, 2022 11:21
@tjungblu
Copy link
Contributor

tjungblu commented Dec 5, 2022

/lgtm (non-binding) overall, just a couple of nits around comments and naming.

@serathius serathius force-pushed the linearizability-failed branch 2 times, most recently from 9d03a46 to 48002c7 Compare December 5, 2022 14:52
@serathius
Copy link
Member Author

Rewrite the PR based on comments from @tjungblu. Thanks for review!
Please take another look @ahrtr @tjungblu

tests/linearizability/model.go Outdated Show resolved Hide resolved
tests/linearizability/model.go Outdated Show resolved Hide resolved
state.LastRevision = response.revision
delete(state.FailedWrites, response.getData)
return true, state
if state.FailedWrite != nil && state.LastRevision < response.Revision {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if state.FailedWrite != nil && state.LastRevision < response.Revision {
if state.FailedWrite != nil && state.LastRevision <= response.Revision {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, as this code is responsible for restoring a failed write. state.LastRevision < response.Revision is correct as writes always increase revision.

Copy link
Member

@ahrtr ahrtr Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writes always increase revision.

With the addition of delete operation, it doesn't stand anymore. Deleting an non-exist key will not increase the revision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that should be changed by PR that introduced delete.

Comment on lines +111 to +146
for _, op := range h.failed {
if op.Call > maxTime {
continue
}
op.Return = maxTime + 1
operations = append(operations, op)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you followed the suggestion in anishathalye/porcupine#10? If yes, please add a comment here, otherwise it's hard for other contributors to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.

I expect merging #14802 firstly, update and merge this PR afterwards.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

Rebased on delete changes.

@serathius serathius merged commit a4c6d1b into etcd-io:main Dec 6, 2022
@serathius serathius deleted the linearizability-failed branch June 15, 2023 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants