Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safety/correctness checker for OpenSearch software #3866

Open
Bukhtawar opened this issue Jul 12, 2022 · 5 comments
Open

Safety/correctness checker for OpenSearch software #3866

Bukhtawar opened this issue Jul 12, 2022 · 5 comments
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jul 12, 2022

Is your feature request related to a problem? Please describe.

High complexity increases the probability of human error in design, code, and operations. Errors in the system could cause loss or corruption of data, or violate other interface contracts that our customers depend on. We should also have formal methods to ensure the correctness of our replication protocols. TLA+ model for verification is an industry standard and maybe with segrep we could probably introduce one

With deep changes being made to core espl in the Engine with the new segment based replication strategy and it's integration with remote segment store and translog, it becomes imperative that we re-evaluate safety, correctness and data integrity properties of the system.

Describe the solution you'd like
Mechanism to inject failures and verify system behaviour for correctness, safety and data integrity

Describe alternatives you've considered
Jepsen is one of the various softwares(not AL2.0) that provides similar capabilities.

Additional context
Add any other context or screenshots about the feature request here.

@Bukhtawar Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 12, 2022
@Bukhtawar Bukhtawar added discuss Issues intended to help drive brainstorming and decision making and removed untriaged labels Jul 13, 2022
@Bukhtawar
Copy link
Collaborator Author

cc: @mch2 @sachinpkale @ashking94 @kartg @andrross @reta @nknize
Looking for thoughts on adding this framework.

@reta
Copy link
Collaborator

reta commented Jul 14, 2022

+1 to look into Jepsen, Elastic has formal TLA+ models published (AL 2.0) [1] which could be quite helpful, although no updates since 2019.

[1] https://github.com/elastic/elasticsearch-formal-models

@andrross
Copy link
Member

I think this is great idea @Bukhtawar. What do you think the best way to get started is? We could start from the TLA+ models referenced above and modify as necessary (if anything is out of date from 2019). Or we could start fresh and build on top of that.

@sachinpkale
Copy link
Member

I completely agree with the proposal to add correctness checker. We started listing down failure scenarios of remote segment store and realized that there are many corner cases and almost impossible to cover them manually.

@Bukhtawar
Copy link
Collaborator Author

@andrross I guess we could start from the existing models that are there and see if they need an update or need more comprehensive cases for remote store and recent engine changes that we are making

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

No branches or pull requests

6 participants