Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce etcd linearizability tests #14045

Closed
28 of 33 tasks
serathius opened this issue May 16, 2022 · 20 comments
Closed
28 of 33 tasks

Introduce etcd linearizability tests #14045

serathius opened this issue May 16, 2022 · 20 comments
Labels
area/testing help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@serathius
Copy link
Member

serathius commented May 16, 2022

P0 action item proposed in https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md

This issue proposes to implement linearizability tests within etcd codebase. Tracking work:

@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 16, 2022
@xiaods
Copy link

xiaods commented May 18, 2022

suggest: build a testing team to guide more contributers to do testing on his environments

@serathius
Copy link
Member Author

@aphyr Based on recent issues you posted I expect that your are working on improving Etcd coverage with jepsen. Could you maybe shed some light on your plans so we can improve etcd official testing and qualification?

In https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md you can find that current state of etcd release qualification needs a lot of work. I would like to collaborate on how we can ensure that there are no other remaining consistency issues in v3.5 release.

@aphyr
Copy link

aphyr commented Jun 14, 2022

I'm afraid I can't talk too much about the present work, but I can say that the Jepsen etcd test suite has been updated to the latest jetcd and etcd, and has found at least one safety issue I'd consider serious (#14092) and one spurious crash (#14110), as well as results that suggest etcd may not be performing its fsync/write dance properly (#14098). Given these findings, it might be worth running Jepsen on etcd releases regularly, or adapting some of its nemeses (e.g. kill, process, partition, disk corruption, losing un-fsynced writes) and checkers (e.g. https://github.com/jepsen-io/elle) to work with your current test suite. :-)

@lavacat
Copy link

lavacat commented Jun 14, 2022

CC @endocrimes

#13939

@endocrimes
Copy link
Contributor

👀 i'll update my jepsen fork and try to get those reproing to validate the setup.

@serathius
Copy link
Member Author

Thanks @aphyr for your recommendation. Etcd has it's own tests and framework (called functional tests) for failure injection testing (kill, partition etcd). However, as latest release has showed, it doesn't fulfill it's job. I think it's under invested and not maintained.

I don't think etcd community should maintain their own framework as it's distracts from main goal of the project. Instead investing more in outdated closed off solutions, we should be able to utilize current state of the art that was build to solve similar problems. I have some catching up to do in this area, so please free to give me suggestions about tools in this area. I will start from reading more about Jepsen and https://github.com/jepsen-io/elle.

@endocrimes Thanks for help. I think it could be great if we could automate running jepsen tests. For example we could run it periodically once a week/day.

@serathius serathius self-assigned this Jul 12, 2022
@serathius
Copy link
Member Author

serathius commented Aug 22, 2022

At the current state etcd is not able to reliably prevent and reproduce data inconsistency issues which become a common occurrence. v3.5 release was totally broken due to #13514, #13766 and required a official announcement and direct user action to prevent disaster for etcd users. Since then we got another 3 reports that we are unable to reproduce (#14140, #14143, #14211).

2 of the latest issues were discovered by Jepsen author, however we don’t think we can depend on infrequent sponsored qualification, nor can we use Jepsen ourselves (non AWS setups are not well maintained, implemented in different programming language).

I’m building a prototype using GitHub - anishathalye/porcupine: A fast linearizability checker written in Go 🔎 to start continuous testing linealizability of etcd.

@aphyr
Copy link

aphyr commented Aug 22, 2022

not well maintained

:-/

@serathius
Copy link
Member Author

serathius commented Aug 22, 2022

not well maintained

:-/

Sorry for miscommunication. To clarify: anything except AWS deployment is not well maintained. It took me and @endocrimes over 2 days to run docker setup at all, much longer to get any sensible results. Overall, it would be impossible for etcd maintainers to expand, integrate and maintain high quality testing via Jepsen.

@aphyr
Copy link

aphyr commented Aug 22, 2022

I've been maintaining Jepsen for use with physical nodes and LXC containers since 2013. I do apologize for the Docker situation--it's one of those things that people keep contributing to Jepsen, but Docker's platform differences and own breaking changes render it flaky. That's noted in the README, but people keep trying to use it anyway. See https://github.com/jepsen-io/jepsen/blob/main/README.md#setting-up-a-jepsen-environment for a list of deployment options.

@endocrimes
Copy link
Contributor

endocrimes commented Aug 22, 2022

FWIW I have container stuff working fairly reliably now - I'm just not quite sure how best to deploy constant jepsen testing (lack of infra availability mostly) - and lack of time on my part (etcd gets ~5% of my time rn, so I'm hesitant to do things that would depend on me to maintain)

@serathius serathius changed the title etcd testing can reproduce historical data inconsistency issues Introduce etcd linearizability tests Oct 27, 2022
@serathius
Copy link
Member Author

A bittersweet news that the proposed approach has already bear fruit, we found data inconsistency on etcd revision #14685

@serathius
Copy link
Member Author

serathius commented Nov 10, 2022

With first success for linearizability tests we can start scaling up the investment. I have added subtasks in the top comment so other contributors can pick up the work. Feel pick an issue and leave a comment so I can reserve it for you

@geetasg
Copy link

geetasg commented Nov 14, 2022

I am looking into adding the Delete api.

@serathius serathius removed their assignment Nov 15, 2022
@tjungblu
Copy link
Contributor

I've spent some time to move porcupine to generics earlier this week. It also clones the state before invoking the step func, in case you want to get rid of the json parsing in the model.

example in: https://github.com/tjungblu/porcupine/blob/main/_example/map_model.go

@serathius
Copy link
Member Author

serathius commented Nov 18, 2022

@tjungblu very interesting work. I was also thinking about it as using Json for encoding state is not very nice, however using generics is not our top priority.

@geetasg
Copy link

geetasg commented Dec 1, 2022

I can take the task to add support for lease request.

@geetasg
Copy link

geetasg commented Jan 12, 2023

The PR linked in task list above was not used. Lease support is introduced by #15080.

@serathius
Copy link
Member Author

Thanks for spotting this, updated the current status of the effort.

@serathius
Copy link
Member Author

As the most critical goals of linearizability tests were achieved, we can close this issue and track remaining work in followups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Development

No branches or pull requests

7 participants