Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jepsen transient failures under network partition conditions #7549

Closed
pilvitaneli opened this issue Sep 3, 2014 · 5 comments
Closed

Jepsen transient failures under network partition conditions #7549

pilvitaneli opened this issue Sep 3, 2014 · 5 comments

Comments

@pilvitaneli
Copy link

Hi! Jepsen tests include five nemeses (test scenarios) that introduce different types of network partitions (see here). The tests add documents to index before, during and after these partitions, and verify that the documents which were acknowledged during the partitions are retrievable afterwards. Sometimes the tests indicate that a number of documents were indexed, but are not retrievable---however, this does not happen on every run (of the same scenario). For example, in a run of 20 times each (against 598854d), the following :lost-frac amounts were reported:

isolate-self-primaries-nemesis 244/361, 2/733, 1/607, 1/603, 1/213, 65/216 (and 14 times 0)
nemesis/partition-random-halves 1/355, 1/226, 4/733, 1/433 (and 16 times 0)
nemesis/partition-halves 1/65, 1/438, 4/715, 2/457, 6/731, 1/435, 9/433 (and 13 times 0)
nemesis/partitioner nemesis/bridge 2/415, 3/253, 2/383, 7/754, 1/786, 1/767 (and 14 times 0)
nemesis/partition-random-node does not report any lost documents.

In total, out of a 100 runs, 23 failed.

@dakrone
Copy link
Member

dakrone commented Sep 3, 2014

Hi @pilvitaneli, thanks for the testing results!

We're actively investigating Jepsen tests on top of our own tests, which resulted in #7572. The Jepsen tests helped verify that we fixed the split brain issue (it no longer happens). In all of our runs though, we couldn't simulate a result similar to your first run (the isolate-self-primaries-nemesis where you lost 244/361), still trying, but I might circle back with you to figure out how you ended up with those results. We do manage to simulate the smaller scale data loss that we believe relates to #7572, but this is also still under investigation.

I'll let you know how our continued testing with Jepsen goes, thanks again for your results!

@pilvitaneli
Copy link
Author

Running just isolate-self-primaries-nemesis 50 times in a succession results in 22 failures:
1/403
404/653
1/583
6/667
287/395
4/583
16/655
3/1037
8/807
1/565
1/555
5/638
1/626
3/784
3/653
2/621
3/632
1/254
1/610
3/307
11/668
1/446

@dakrone
Copy link
Member

dakrone commented Oct 21, 2014

@pilvitaneli circling back to this after a while, do you happen to have the commit sha of Jepsen that you are using for running your tests? I'd like to make sure we run the same tests.

@pilvitaneli
Copy link
Author

I haven't run in a while, but last was with jepsen-io/jepsen@761693b . It does not appear as though there are considerable changes after that, but I could try to re-run with current master.

@dakrone dakrone removed their assignment Feb 21, 2016
@dakrone
Copy link
Member

dakrone commented Sep 27, 2016

Going to close this as it's been almost 2 years and we have a different issue we are tracking things for the 5.0 release - #20031

@dakrone dakrone closed this as completed Sep 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants