-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Data corruption on an actively written primary shard can lead to failing all replicas. #803
Comments
Relevant output of the test:
|
Is there not a reproducible |
The issue is reproduced only when a pending replica write is in progress when the corruption happens. The way I achieved it is a bit hacky, I added a few sleeps and parallelized writes to increase the probability of this happening. This is why sometimes you may need to run it multiple times to run into the issue. |
Without looking at the code, maybe corrupted data can be cooked ahead of time, or stored data forcefully corrupted by writing some random bytes for the test to be reliable? |
That's what I did- I have forcefully corrupted the data. The issue is to reproduce the race where the writes are pending on the shard at exactly the same time when the primary fails - again deterministically reproducing the race is not impossible, just that adding sleep was just faster. :) |
@itiyamas I was unable to reproduce the issue in a deterministic manner. Do you have any specific idea on how to do so? |
Hi @itiyamas, I am closing the issue for now, since there is no clear steps to reproduce it. Feel free to re-open if you have more details to add. |
@anasalkouz Not sure if this got fixed as a side effect of some other change. I was able to trace through the code and identify what was causing the bug. Are those code paths not relevant anymore? Did you check that? The ones that are shared in the description. |
@anasalkouz Please re-open this issue. This is an ongoing issue in one of our clusters. If you know of a fix, I would like to backport it to older versions as well. Also, if you are unable to repro this, I would like to pursue this myself. |
I'm looking into this now |
I have cherry picked the modified integration test into the latest on the main branch and have verified that this bug is fairly reliably reproducible. It usually takes me between 3 and 6 attempts to see a failure, never taking more than 10 minutes of running to fail. I'm going to keep digging to see if I can come up with any ideas about how to fix this. |
One of the ways that I can think of solving this:
What do you think? |
Thanks @itiyama! Yes, the approach you've describe is exactly what I'm looking into. |
@itiyama Please take a look at my draft PR. I still need to make the test deterministic, and some of the various sleeps are still in place. The approach I took to create a specific exception type for primary closure and then branch the logic based on that doesn't feel like the most general-purpose way to solve this, but it does appear to fix the issue. Please let me know what you think. I am worried about unintended side effects here (the worst case being that a replica shard that should have been failed is not failed and then promoted to primary and data is lost, but going with a very targeted fix would seem to make that unlikely). |
…t#803) * Bump org.owasp.dependencycheck from 9.0.7 to 9.0.8 Bumps org.owasp.dependencycheck from 9.0.7 to 9.0.8. --- updated-dependencies: - dependency-name: org.owasp.dependencycheck dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Update changelog Signed-off-by: dependabot[bot] <support@github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com>
Describe the bug
When data corruption is detected on primary, the engine fails. This in turn results in failing primary shard and closing it. Closing the shard also causes closing and cancelling any ongoing replication operations on primary. One such operation is shard bulk action on replica. Cancelling this operation in turn checks the reason for failure and because the reason is shard unavailability, the primary marks the replica as stale.
Both failing the primary and failing the replica are sent as shard failure actions on master, which in turn serializes them locally only after they are queued on state update task queue.
As a result, a corruption on primary shard can result in corruption of replica shards in certain cases, especially on high throughput writes.
To Reproduce
Steps to reproduce the behavior:
Please check this commit for the test. You may need to run it 2-3 times to reproduce the behavior.
Expected behavior
Corruption on primary should not result in replica being lost and cluster turning red. Instead, replica should have taken over as primary and then a new replica should have been assigned- turning the cluster green after some time.
The text was updated successfully, but these errors were encountered: