-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock #13844
Conversation
Codecov Report
@@ Coverage Diff @@
## main #13844 +/- ##
==========================================
- Coverage 72.57% 72.37% -0.21%
==========================================
Files 468 468
Lines 38222 38238 +16
==========================================
- Hits 27741 27673 -68
- Misses 8699 8771 +72
- Partials 1782 1794 +12
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
cba5fd2
to
65a691c
Compare
c14ab30
to
e37d481
Compare
The SetConsistentIndex and the applyV3.Apply are supposed to be an atomic operation, obviously there aren't. If the periodic commit happens in between, and the etcd server crashes right after the periodic commit, then etcd sever will be in a bad situation when it starts again. It should be a real issue. But I can still reproduce the data inconsistency issue with this PR. So this PR isn't a complete fix.
|
e37d481
to
b994234
Compare
b994234
to
acc74bf
Compare
acc74bf
to
3a7264e
Compare
@ahrtr Thanks for looking into this, logs you linked are a bug in functional framework and not a prof of reproduction
Means that previous state from last run was not properly cleaned up I recommend running:
Please rerun them and look for hash inconsistency log present in one of the members. I will also try to verify this PR. |
Thanks for the info. I do not get time to read through the functional test's source code so far. What's log which can prove that the issue is reproduced? I will continue to work on this tomorrow. |
Please read the #13838 (comment) |
I also recommend qualifying on the commit that first introduced the bug 5005167 as it will reduce a change of any other issues arising |
Benjamin: Thank you for the code. It's not clear to me that it solves the root problem. I commented in #13766 (comment) that my current thinking is about looking for sources of mid-apply commits so unlocks. They can be either:
It seems that Compaction is really being executed by a concurrent goroutine to the apply itself My current thinking:
|
Thanks @serathius & @ptabor for the feedback, which makes sense.
This PR is just a draft, and I was planning to quickly verify my thought.
Exactly, I will try to get all these sorted out, and provide a more robust solution and send out for review. |
I just tried this PR on top of your PR 13838, and do not reproduce the data inconsistent issue. I do not reproduce the issue without this PR either. Probably it's because my local VM isn't powerful enough, which has 4 CPU (each with Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), and 8GB memory. I will try to get a more robust machine, and change the writing directory, and try again. @serathius Have you tried this PR in your workstation? |
Actually the root cause of this issue isn't coming from the commit 5005167. It just increases the possibility of reproducing this issue. The key point is to make sure we either support atomic operation (just one transaction/Lock & Unlock) or Reentrant transaction. |
Yes, I was not able to reproduce the issue with this fix. I have run tests for over an hour, so I think we are ok.
This would be very worrying, because I was not able to reproduce the issue on previous commits at all. If the issue is not caused by 5005167 then we don't have a good way to qualify our results. If the root cause is to rare to be detected then we need to improve our testing. @ptabor @ahrtr if 5005167 is not a root cause, do you have any inclination of when the real issue was introduced? Any idea how to improve our approach to testing to detect the issue? |
Thanks @serathius for the feedback, it's really good news. I will think about a formal solution/fix in the next step.
No need to worry:), I do not read through the functional test's source code so far, but I believe the reason why we do not see such issue before the commit |
The root cause is coming from server.go#L1774 and server.go#L1804. And of course the addition of |
Yes. And precisely SetConsistentIndex() started to be harmful potentially when any 'Unlock' could prematurely submit transaction. I'm not sure if it's worth to test the issue on commit in the middle of the PR, vs. the whole PR boundary, i.e. f1123d6. |
I just submitted another PR 13854, so closing this one. |
Try to fix 13766.