-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safety/correctness checker for OpenSearch software #3866
Comments
cc: @mch2 @sachinpkale @ashking94 @kartg @andrross @reta @nknize |
+1 to look into Jepsen, Elastic has formal TLA+ models published (AL 2.0) [1] which could be quite helpful, although no updates since 2019. [1] https://github.com/elastic/elasticsearch-formal-models |
I think this is great idea @Bukhtawar. What do you think the best way to get started is? We could start from the TLA+ models referenced above and modify as necessary (if anything is out of date from 2019). Or we could start fresh and build on top of that. |
I completely agree with the proposal to add correctness checker. We started listing down failure scenarios of remote segment store and realized that there are many corner cases and almost impossible to cover them manually. |
@andrross I guess we could start from the existing models that are there and see if they need an update or need more comprehensive cases for remote store and recent engine changes that we are making |
Is your feature request related to a problem? Please describe.
High complexity increases the probability of human error in design, code, and operations. Errors in the system could cause loss or corruption of data, or violate other interface contracts that our customers depend on. We should also have formal methods to ensure the correctness of our replication protocols. TLA+ model for verification is an industry standard and maybe with segrep we could probably introduce one
With deep changes being made to core espl in the Engine with the new segment based replication strategy and it's integration with remote segment store and translog, it becomes imperative that we re-evaluate safety, correctness and data integrity properties of the system.
Describe the solution you'd like
Mechanism to inject failures and verify system behaviour for correctness, safety and data integrity
Describe alternatives you've considered
Jepsen is one of the various softwares(not AL2.0) that provides similar capabilities.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: