-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce etcd linearizability tests #14045
Comments
suggest: build a testing team to guide more contributers to do testing on his environments |
@aphyr Based on recent issues you posted I expect that your are working on improving Etcd coverage with jepsen. Could you maybe shed some light on your plans so we can improve etcd official testing and qualification? In https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md you can find that current state of etcd release qualification needs a lot of work. I would like to collaborate on how we can ensure that there are no other remaining consistency issues in v3.5 release. |
I'm afraid I can't talk too much about the present work, but I can say that the Jepsen etcd test suite has been updated to the latest jetcd and etcd, and has found at least one safety issue I'd consider serious (#14092) and one spurious crash (#14110), as well as results that suggest etcd may not be performing its fsync/write dance properly (#14098). Given these findings, it might be worth running Jepsen on etcd releases regularly, or adapting some of its nemeses (e.g. kill, process, partition, disk corruption, losing un-fsynced writes) and checkers (e.g. https://github.com/jepsen-io/elle) to work with your current test suite. :-) |
👀 i'll update my jepsen fork and try to get those reproing to validate the setup. |
Thanks @aphyr for your recommendation. Etcd has it's own tests and framework (called I don't think etcd community should maintain their own framework as it's distracts from main goal of the project. Instead investing more in outdated closed off solutions, we should be able to utilize current state of the art that was build to solve similar problems. I have some catching up to do in this area, so please free to give me suggestions about tools in this area. I will start from reading more about Jepsen and https://github.com/jepsen-io/elle. @endocrimes Thanks for help. I think it could be great if we could automate running jepsen tests. For example we could run it periodically once a week/day. |
At the current state etcd is not able to reliably prevent and reproduce data inconsistency issues which become a common occurrence. v3.5 release was totally broken due to #13514, #13766 and required a official announcement and direct user action to prevent disaster for etcd users. Since then we got another 3 reports that we are unable to reproduce (#14140, #14143, #14211). 2 of the latest issues were discovered by Jepsen author, however we don’t think we can depend on infrequent sponsored qualification, nor can we use Jepsen ourselves (non AWS setups are not well maintained, implemented in different programming language). I’m building a prototype using GitHub - anishathalye/porcupine: A fast linearizability checker written in Go 🔎 to start continuous testing linealizability of etcd. |
:-/ |
Sorry for miscommunication. To clarify: anything except AWS deployment is not well maintained. It took me and @endocrimes over 2 days to run docker setup at all, much longer to get any sensible results. Overall, it would be impossible for etcd maintainers to expand, integrate and maintain high quality testing via Jepsen. |
I've been maintaining Jepsen for use with physical nodes and LXC containers since 2013. I do apologize for the Docker situation--it's one of those things that people keep contributing to Jepsen, but Docker's platform differences and own breaking changes render it flaky. That's noted in the README, but people keep trying to use it anyway. See https://github.com/jepsen-io/jepsen/blob/main/README.md#setting-up-a-jepsen-environment for a list of deployment options. |
FWIW I have container stuff working fairly reliably now - I'm just not quite sure how best to deploy constant jepsen testing (lack of infra availability mostly) - and lack of time on my part (etcd gets ~5% of my time rn, so I'm hesitant to do things that would depend on me to maintain) |
A bittersweet news that the proposed approach has already bear fruit, we found data inconsistency on etcd revision #14685 |
With first success for linearizability tests we can start scaling up the investment. I have added subtasks in the top comment so other contributors can pick up the work. Feel pick an issue and leave a comment so I can reserve it for you |
I am looking into adding the Delete api. |
I've spent some time to move porcupine to generics earlier this week. It also clones the state before invoking the step func, in case you want to get rid of the json parsing in the model. example in: https://github.com/tjungblu/porcupine/blob/main/_example/map_model.go |
@tjungblu very interesting work. I was also thinking about it as using Json for encoding state is not very nice, however using generics is not our top priority. |
I can take the task to add support for lease request. |
The PR linked in task list above was not used. Lease support is introduced by #15080. |
Thanks for spotting this, updated the current status of the effort. |
As the most critical goals of linearizability tests were achieved, we can close this issue and track remaining work in followups. |
P0 action item proposed in https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md
This issue proposes to implement linearizability tests within etcd codebase. Tracking work:
The text was updated successfully, but these errors were encountered: