-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce issue 13766 in linearizability tests #14682
Conversation
ce4dfe9
to
0f06e47
Compare
Codecov Report
@@ Coverage Diff @@
## main #14682 +/- ##
==========================================
- Coverage 74.77% 74.61% -0.17%
==========================================
Files 415 415
Lines 34335 34328 -7
==========================================
- Hits 25674 25613 -61
- Misses 7025 7081 +56
+ Partials 1636 1634 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
4ad9cb1
to
9bd84c9
Compare
ClusterSize: 3, | ||
InitialCorruptCheck: true, | ||
}, | ||
skipValidation: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you skip the validation, then it isn't linearizablity test anymore. So it doesn't make sense to get the test included in the linearzability test suite.
And also shouldn't we check HashKV after the test? The existing functional test already covers the case. Linearizablity test is good, but please do not recreate the wheel. Instead, I would suggest to enhance both the linearizablity and functional tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linearizablity test is good, but please do not recreate the wheel. Instead, I would suggest to enhance both the linearizablity and functional tests.
As long functional tests are flaky they are useless. Using initial hash check in linerazibility tests I see as a shortcut that I hope to remove in future, however I don't see same possibility in functional tests.
We are already not benefiting from running functional tests, so when we reproduce all standing inconstancy issues and add network blackholing in lineariziability tests I will propose to remove them.
What's the machine requirement to reliably run this test to reproduce the issue #14682? Just in case someone wants to locally try it out. |
3d2693a
to
7e5ab32
Compare
Working on making the test not dependent on data inconsistency check. |
Problem is that linearization of history is an NP-Hard problem. When we increase number of clients and qps the compute/space required explodes. I was not able to get results from linearization as I usually stop it after 5 minutes when it allocates over 100GB of RAM. |
7e5ab32
to
8bcd4f2
Compare
3f22b0a
to
45afb3b
Compare
Marking as draft as I'm still working on making tests execute within reasonable time. |
Got first success! cc @ahrtr @chaochn47 I managed to get over issue of too many clients by looking up persisted requests via watch. Fixing this requires a lot of changes so I will do it one by one, starting from #15044. Reproducing #13766 requires pretty high QPS load (>1000 qps). To achieve that we need increase number of clients from 8 to 20. This caused issue for linearizability verification as it's complexity is exponential to number of clients. This is because when we crash etcd, ongoing request from each client is interrupted, meaning for 8 clients we lost 8 requests, for 20 clients we loos 20 requests. Client doesn't know if a lost request is persisted on not. If requests was lost on invocation (before it gets to etcd), it will not be persisted. However if request was lost on return (after etcd processed it), it will be persisted without notifying the client. Consecutive lost requests cause complexity explosion, because current etcd model will try to find whether they were persisted and their exact order of their execution, so for each n lost requests there are Solution was to minimize number of lost requests by checking if they were really persisted. There is a easy source of this information in etcd, watch. By collecting all the events in separate running watch I can verify if particular request was persisted (all Put requests have unique id) and also get revision it was persisted. This eliminates both problems of request ordering and checking persistence. I have a draft code, just need to split it into PRs for review. First two #15045 #15044 |
Makefile
Outdated
@@ -129,6 +129,60 @@ gofail-disable: install-gofail | |||
install-gofail: | |||
cd tools/mod; go install go.etcd.io/gofail@${GOFAIL_VERSION} | |||
|
|||
# Reproduce historical issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving this to a separate Makefile within './tests/repros' directory.
I think it's too detailed for a top-level Makefile (and have big potential to growth over time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups, didn't want this included with PR.
a52b29f
to
287e1b6
Compare
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Decided to use the original approach of reproduction of the issue. I was not able to reliable reproduction via SIGKILL on lower QPS or clients. This approach has downside that validating linerizability for so many clients is too costly. Memory usage for pourcupine just exploded and would definitely not fit in github action worker.
Decided to use the initial data inconsistency validation for reproduction.