Reproduce issue 13766 in linearizability tests #14682

serathius · 2022-11-03T20:39:08Z

Decided to use the original approach of reproduction of the issue. I was not able to reliable reproduction via SIGKILL on lower QPS or clients. This approach has downside that validating linerizability for so many clients is too costly. Memory usage for pourcupine just exploded and would definitely not fit in github action worker.

Decided to use the initial data inconsistency validation for reproduction.

codecov-commenter · 2022-11-03T21:50:13Z

Codecov Report

Merging #14682 (85f72e9) into main (26c0627) will decrease coverage by 0.16%.
The diff coverage is n/a.

❗ Current head 85f72e9 differs from pull request most recent head 3f22b0a. Consider uploading reports for the commit 3f22b0a to get more accurate results

@@            Coverage Diff             @@
##             main   #14682      +/-   ##
==========================================
- Coverage   74.77%   74.61%   -0.17%     
==========================================
  Files         415      415              
  Lines       34335    34328       -7     
==========================================
- Hits        25674    25613      -61     
- Misses       7025     7081      +56     
+ Partials     1636     1634       -2

Flag	Coverage Δ
all	`74.61% <ø> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/proxy/grpcproxy/register.go	`69.76% <0.00%> (-20.94%)`	⬇️
server/etcdserver/api/rafthttp/peer_status.go	`87.87% <0.00%> (-12.13%)`	⬇️
server/etcdserver/api/rafthttp/peer.go	`87.01% <0.00%> (-8.45%)`	⬇️
client/pkg/v3/testutil/leak.go	`62.83% <0.00%> (-4.43%)`	⬇️
server/etcdserver/txn/util.go	`75.47% <0.00%> (-3.78%)`	⬇️
server/etcdserver/api/v3rpc/watch.go	`84.76% <0.00%> (-3.50%)`	⬇️
client/v3/leasing/txn.go	`88.09% <0.00%> (-3.18%)`	⬇️
client/v3/experimental/recipes/key.go	`75.34% <0.00%> (-2.74%)`	⬇️
client/v3/experimental/recipes/double_barrier.go	`68.83% <0.00%> (-2.60%)`	⬇️
client/v3/concurrency/election.go	`79.68% <0.00%> (-2.35%)`	⬇️
... and 10 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

serathius · 2022-11-04T13:54:56Z

cc @ahrtr @spzala @chaochn47 @ptabor

ahrtr · 2022-11-04T21:54:37Z

tests/linearizability/linearizability_test.go

+				ClusterSize:         3,
+				InitialCorruptCheck: true,
+			},
+			skipValidation: true,


If you skip the validation, then it isn't linearizablity test anymore. So it doesn't make sense to get the test included in the linearzability test suite.

And also shouldn't we check HashKV after the test? The existing functional test already covers the case. Linearizablity test is good, but please do not recreate the wheel. Instead, I would suggest to enhance both the linearizablity and functional tests.

Linearizablity test is good, but please do not recreate the wheel. Instead, I would suggest to enhance both the linearizablity and functional tests.

As long functional tests are flaky they are useless. Using initial hash check in linerazibility tests I see as a shortcut that I hope to remove in future, however I don't see same possibility in functional tests.

We are already not benefiting from running functional tests, so when we reproduce all standing inconstancy issues and add network blackholing in lineariziability tests I will propose to remove them.

chaochn47 · 2022-11-04T23:17:09Z

I was not able to reliable reproduction via SIGKILL on lower QPS or clients. This approach has downside that validating linerizability for so many clients is too costly. Memory usage for pourcupine just exploded and would definitely not fit in github action worker.

What's the machine requirement to reliably run this test to reproduce the issue #14682? Just in case someone wants to locally try it out.

serathius · 2022-11-08T22:02:29Z

Working on making the test not dependent on data inconsistency check.

serathius · 2022-11-15T16:28:17Z

What's the machine requirement to reliably run this test to reproduce the issue #14682? Just in case someone wants to locally try it out.

Problem is that linearization of history is an NP-Hard problem. When we increase number of clients and qps the compute/space required explodes. I was not able to get results from linearization as I usually stop it after 5 minutes when it allocates over 100GB of RAM.

serathius · 2022-12-09T09:11:31Z

Marking as draft as I'm still working on making tests execute within reasonable time.

serathius · 2022-12-25T11:08:45Z

Got first success! cc @ahrtr @chaochn47

I managed to get over issue of too many clients by looking up persisted requests via watch. Fixing this requires a lot of changes so I will do it one by one, starting from #15044.

Reproducing #13766 requires pretty high QPS load (>1000 qps). To achieve that we need increase number of clients from 8 to 20. This caused issue for linearizability verification as it's complexity is exponential to number of clients. This is because when we crash etcd, ongoing request from each client is interrupted, meaning for 8 clients we lost 8 requests, for 20 clients we loos 20 requests.

Client doesn't know if a lost request is persisted on not. If requests was lost on invocation (before it gets to etcd), it will not be persisted. However if request was lost on return (after etcd processed it), it will be persisted without notifying the client. Consecutive lost requests cause complexity explosion, because current etcd model will try to find whether they were persisted and their exact order of their execution, so for each n lost requests there are n! orderings of requests and 2^n possible final states for each path. Pretty big!

Solution was to minimize number of lost requests by checking if they were really persisted. There is a easy source of this information in etcd, watch. By collecting all the events in separate running watch I can verify if particular request was persisted (all Put requests have unique id) and also get revision it was persisted. This eliminates both problems of request ordering and checking persistence.

I have a draft code, just need to split it into PRs for review. First two #15045 #15044

serathius · 2023-01-13T20:24:59Z

Sending code for review! Possibly will also need #15101 due to increase qps this issue happens more frequently.
cc @ahrtr @ptabor

Makefile

ptabor · 2023-01-16T10:37:10Z

Makefile

@@ -129,6 +129,60 @@ gofail-disable: install-gofail
 install-gofail:
 	cd tools/mod; go install go.etcd.io/gofail@${GOFAIL_VERSION}

+# Reproduce historical issues


How about moving this to a separate Makefile within './tests/repros' directory.

I think it's too detailed for a top-level Makefile (and have big potential to growth over time).

Ups, didn't want this included with PR.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius force-pushed the issue13766 branch 3 times, most recently from ce4dfe9 to 0f06e47 Compare November 3, 2022 20:57

serathius force-pushed the issue13766 branch 2 times, most recently from 4ad9cb1 to 9bd84c9 Compare November 4, 2022 10:29

ahrtr reviewed Nov 4, 2022

View reviewed changes

serathius force-pushed the issue13766 branch 2 times, most recently from 3d2693a to 7e5ab32 Compare November 7, 2022 20:07

serathius mentioned this pull request Nov 10, 2022

Introduce etcd linearizability tests #14045

Closed

33 tasks

serathius force-pushed the issue13766 branch from 7e5ab32 to 8bcd4f2 Compare November 15, 2022 16:35

serathius mentioned this pull request Nov 18, 2022

remove gofail-go failpoint etcd-io/gofail#30

Merged

serathius force-pushed the issue13766 branch 2 times, most recently from 3f22b0a to 45afb3b Compare December 7, 2022 12:27

serathius marked this pull request as draft December 9, 2022 09:11

serathius force-pushed the issue13766 branch from 45afb3b to dd3b5f2 Compare January 13, 2023 20:23

serathius marked this pull request as ready for review January 13, 2023 20:23

serathius force-pushed the issue13766 branch from dd3b5f2 to 0142978 Compare January 16, 2023 10:02

ptabor reviewed Jan 16, 2023

View reviewed changes

serathius force-pushed the issue13766 branch 4 times, most recently from a52b29f to 287e1b6 Compare January 17, 2023 13:21

serathius force-pushed the issue13766 branch from 287e1b6 to ca9de27 Compare January 17, 2023 13:33

tests: Add reproduce etcd-io#13766 scenario

a0d12d3

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius force-pushed the issue13766 branch from ca9de27 to a0d12d3 Compare January 17, 2023 13:34

ptabor approved these changes Jan 18, 2023

View reviewed changes

serathius merged commit e3e94fe into etcd-io:main Jan 18, 2023

serathius mentioned this pull request Jan 26, 2023

Replace functional tests with Linearizability #15102

Closed

3 tasks

serathius deleted the issue13766 branch June 15, 2023 20:37

serathius mentioned this pull request Apr 11, 2024

Make robustness qps requirements less fragile to CI performance #17775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce issue 13766 in linearizability tests #14682

Reproduce issue 13766 in linearizability tests #14682

serathius commented Nov 3, 2022 •

edited

Loading

codecov-commenter commented Nov 3, 2022 •

edited

Loading

serathius commented Nov 4, 2022

ahrtr Nov 4, 2022

serathius Nov 5, 2022

chaochn47 commented Nov 4, 2022 •

edited

Loading

serathius commented Nov 8, 2022

serathius commented Nov 15, 2022

serathius commented Dec 9, 2022

serathius commented Dec 25, 2022

serathius commented Jan 13, 2023

ptabor Jan 16, 2023

serathius Jan 16, 2023

Reproduce issue 13766 in linearizability tests #14682

Reproduce issue 13766 in linearizability tests #14682

Conversation

serathius commented Nov 3, 2022 • edited Loading

codecov-commenter commented Nov 3, 2022 • edited Loading

Codecov Report

serathius commented Nov 4, 2022

ahrtr Nov 4, 2022

Choose a reason for hiding this comment

serathius Nov 5, 2022

Choose a reason for hiding this comment

chaochn47 commented Nov 4, 2022 • edited Loading

serathius commented Nov 8, 2022

serathius commented Nov 15, 2022

serathius commented Dec 9, 2022

serathius commented Dec 25, 2022

serathius commented Jan 13, 2023

ptabor Jan 16, 2023

Choose a reason for hiding this comment

serathius Jan 16, 2023

Choose a reason for hiding this comment

serathius commented Nov 3, 2022 •

edited

Loading

codecov-commenter commented Nov 3, 2022 •

edited

Loading

chaochn47 commented Nov 4, 2022 •

edited

Loading