Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock #13844

ahrtr · 2022-03-28T02:56:36Z

Try to fix 13766.

codecov-commenter · 2022-03-28T03:26:26Z

Codecov Report

Merging #13844 (acc74bf) into main (be29295) will decrease coverage by 0.20%.
The diff coverage is 100.00%.

❗ Current head acc74bf differs from pull request most recent head 3a7264e. Consider uploading reports for the commit 3a7264e to get more accurate results

@@            Coverage Diff             @@
##             main   #13844      +/-   ##
==========================================
- Coverage   72.57%   72.37%   -0.21%     
==========================================
  Files         468      468              
  Lines       38222    38238      +16     
==========================================
- Hits        27741    27673      -68     
- Misses       8699     8771      +72     
- Partials     1782     1794      +12

Flag	Coverage Δ
all	`72.37% <100.00%> (-0.21%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/etcdserver/server.go	`84.18% <100.00%> (-0.67%)`	⬇️
server/storage/backend/backend.go	`84.17% <100.00%> (+0.51%)`	⬆️
server/storage/backend/batch_tx.go	`63.15% <100.00%> (+0.21%)`	⬆️
server/etcdserver/api/v3rpc/lease.go	`74.68% <0.00%> (-7.60%)`	⬇️
client/pkg/v3/fileutil/purge.go	`66.03% <0.00%> (-7.55%)`	⬇️
server/storage/mvcc/watchable_store.go	`84.42% <0.00%> (-5.44%)`	⬇️
raft/rafttest/node.go	`95.00% <0.00%> (-5.00%)`	⬇️
server/etcdserver/api/v3rpc/member.go	`93.54% <0.00%> (-3.23%)`	⬇️
server/etcdserver/cluster_util.go	`70.35% <0.00%> (-3.17%)`	⬇️
server/etcdserver/api/v3election/election.go	`66.66% <0.00%> (-2.78%)`	⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be29295...3a7264e. Read the comment docs.

ahrtr · 2022-03-28T07:11:09Z

The SetConsistentIndex and the applyV3.Apply are supposed to be an atomic operation, obviously there aren't. If the periodic commit happens in between, and the etcd server crashes right after the periodic commit, then etcd sever will be in a bad situation when it starts again. It should be a real issue.

But I can still reproduce the data inconsistency issue with this PR. So this PR isn't a complete fix.


   logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:3379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:2379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	sent request	{"operation": "INITIAL_START_ETCD", "to": "127.0.0.1:1379"}
    logger.go:130: 2022-03-28T00:01:28.982-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:1379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    etcd_tester_test.go:43: Bootstrap failed {error 26 0  "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD", "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD", "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD"}
--- FAIL: TestFunctional (0.01s)
FAIL
FAIL: (code:1):
  % ./bin/etcd-tester --config ./tests/functional/functional.yaml -test.v
ETCD_TESTER_EXIT_CODE: 1

…transaction lock

serathius · 2022-03-28T10:47:15Z

@ahrtr Thanks for looking into this, logs you linked are a bug in functional framework and not a prof of reproduction

   logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:3379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:2379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	sent request	{"operation": "INITIAL_START_ETCD", "to": "127.0.0.1:1379"}
    logger.go:130: 2022-03-28T00:01:28.982-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:1379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    etcd_tester_test.go:43: Bootstrap failed {error 26 0  "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD", "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD", "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD"}

Means that previous state from last run was not properly cleaned up I recommend running:

sudo killall etcd && sudo killall etcd-agent

Please rerun them and look for hash inconsistency log present in one of the members. I will also try to verify this PR.

ahrtr · 2022-03-28T10:57:48Z

Thanks for the info. I do not get time to read through the functional test's source code so far. What's log which can prove that the issue is reproduced?

I will continue to work on this tomorrow.

serathius · 2022-03-28T11:28:10Z

Please read the #13838 (comment)

serathius · 2022-03-28T12:34:21Z

I also recommend qualifying on the commit that first introduced the bug 5005167 as it will reduce a change of any other issues arising

ptabor · 2022-03-28T14:11:40Z

Benjamin: Thank you for the code. It's not clear to me that it solves the root problem. I commented in #13766 (comment) that my current thinking is about looking for sources of mid-apply commits so unlocks. They can be either:

applies that do commit multiple transactions
RW Transaction triggered from outside of 'apply' context.

It seems that Compaction is really being executed by a concurrent goroutine to the apply itself storage/mvcc/kvstore_compaction.go:57 and it triggers 'Unlocks'.

My current thinking:

There should exist ability to wrap Backend with custom implementation that provision hooks to transaction lifecycle.
Out out apply() transactions should use Backend wrapper that has no registered Hooks.
The transactions in Apply() should use Backend that has Hooks() like currently with additional verification that
there was only one Unlock() per transaction called. i.e. Apply calls hook->VerifyApplyEnd() and hook->VerifyApplyBegin() or checks counters...

ahrtr · 2022-03-28T20:54:26Z

Thanks @serathius & @ptabor for the feedback, which makes sense.

It's not clear to me that it solves the root problem.

This PR is just a draft, and I was planning to quickly verify my thought.

My current thinking:

1. There should exist ability to wrap Backend with custom implementation that provision hooks to transaction lifecycle.

2. Out out apply() transactions should use Backend wrapper that has no registered Hooks.

3. The transactions in `Apply()` should use Backend that has Hooks() like currently with additional verification that
   there was only one Unlock() per transaction called. i.e. Apply calls `hook->VerifyApplyEnd()` and `hook->VerifyApplyBegin()` or checks counters...

Exactly, I will try to get all these sorted out, and provide a more robust solution and send out for review.

ahrtr · 2022-03-28T21:41:06Z

Please rerun them and look for hash inconsistency log present in one of the members. I will also try to verify this PR.

I just tried this PR on top of your PR 13838, and do not reproduce the data inconsistent issue. I do not reproduce the issue without this PR either. Probably it's because my local VM isn't powerful enough, which has 4 CPU (each with Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), and 8GB memory.

I will try to get a more robust machine, and change the writing directory, and try again.

@serathius Have you tried this PR in your workstation?

ahrtr · 2022-03-29T07:14:42Z

I also recommend qualifying on the commit that first introduced the bug 5005167 as it will reduce a change of any other issues arising

Actually the root cause of this issue isn't coming from the commit 5005167. It just increases the possibility of reproducing this issue. The key point is to make sure we either support atomic operation (just one transaction/Lock & Unlock) or Reentrant transaction.

serathius · 2022-03-29T09:18:49Z

@serathius Have you tried this PR in your workstation?

Yes, I was not able to reproduce the issue with this fix. I have run tests for over an hour, so I think we are ok.

I also recommend qualifying on the commit that first introduced the bug 5005167 as it will reduce a change of any other issues arising

Actually the root cause of this issue isn't coming from the commit 5005167. It just increases the possibility of reproducing this issue. The key point is to make sure we either support atomic operation (just one transaction/Lock & Unlock) or Reentrant transaction.

This would be very worrying, because I was not able to reproduce the issue on previous commits at all. If the issue is not caused by 5005167 then we don't have a good way to qualify our results. If the root cause is to rare to be detected then we need to improve our testing.

@ptabor @ahrtr if 5005167 is not a root cause, do you have any inclination of when the real issue was introduced? Any idea how to improve our approach to testing to detect the issue?

ahrtr · 2022-03-29T09:54:11Z

Yes, I was not able to reproduce the issue with this fix. I have run tests for over an hour, so I think we are ok.

Thanks @serathius for the feedback, it's really good news. I will think about a formal solution/fix in the next step.

This would be very worrying, because I was not able to reproduce the issue on previous commits at all. If the issue is not caused by 5005167 then we don't have a good way to qualify our results. If the root cause is to rare to be detected then we need to improve our testing.

@ptabor @ahrtr if 5005167 is not a root cause, do you have any inclination of when the real issue was introduced? Any idea how to improve our approach to testing to detect the issue?

No need to worry:), I do not read through the functional test's source code so far, but I believe the reason why we do not see such issue before the commit 505167 is we do not perform operations such as periodic compaction or other operations which may call Commit or ForceCommit in functional test, or at least the frequency is really low.

ahrtr · 2022-03-29T10:06:43Z

do you have any inclination of when the real issue was introduced?

The root cause is coming from server.go#L1774 and server.go#L1804.

And of course the addition of OnPreCommitUnSafe also contributes to the root cause.

ptabor · 2022-03-29T10:21:10Z

The root cause is coming from server.go#L1774 and server.go#L1804.

Yes. And precisely SetConsistentIndex() started to be harmful potentially when any 'Unlock' could prematurely submit transaction. I'm not sure if it's worth to test the issue on commit in the middle of the PR, vs. the whole PR boundary, i.e. f1123d6.

ahrtr · 2022-03-30T19:29:11Z

I just submitted another PR 13854, so closing this one.

ahrtr force-pushed the fix_data_inconsistency branch 2 times, most recently from cba5fd2 to 65a691c Compare March 28, 2022 05:35

ahrtr changed the title ~~Fix the data inconsistency by moving the SetConsistentIndex into the transaction log~~ Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock Mar 28, 2022

ahrtr force-pushed the fix_data_inconsistency branch 2 times, most recently from c14ab30 to e37d481 Compare March 28, 2022 06:52

ahrtr marked this pull request as draft March 28, 2022 08:04

ahrtr force-pushed the fix_data_inconsistency branch from e37d481 to b994234 Compare March 28, 2022 08:35

ahrtr marked this pull request as ready for review March 28, 2022 08:43

ahrtr force-pushed the fix_data_inconsistency branch from b994234 to acc74bf Compare March 28, 2022 08:52

fix the data inconsistency by moving the SetConsistentIndex into the …

3a7264e

…transaction lock

ahrtr force-pushed the fix_data_inconsistency branch from acc74bf to 3a7264e Compare March 28, 2022 09:17

ahrtr marked this pull request as draft March 28, 2022 09:34

serathius mentioned this pull request Mar 28, 2022

server: Attempt to fix data corruption #13840

Closed

tangcong mentioned this pull request Mar 29, 2022

(no)StoreV2 (Part 4): Backend hooks: precommit updates consistency_index #12855

Merged

This was referenced Mar 29, 2022

Data inconsistency in etcd version 3.5.0 (3.5.x rollback> 3.4 upgrade-> 3.5) story #13514

Closed

Fix the data inconsistency issue by moving the SetConsistentIndex into the transaction lock #13854

Merged

ahrtr closed this Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock #13844

Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock #13844

ahrtr commented Mar 28, 2022

codecov-commenter commented Mar 28, 2022 •

edited

Loading

ahrtr commented Mar 28, 2022 •

edited

Loading

serathius commented Mar 28, 2022 •

edited

Loading

ahrtr commented Mar 28, 2022

serathius commented Mar 28, 2022

serathius commented Mar 28, 2022

ptabor commented Mar 28, 2022 •

edited

Loading

ahrtr commented Mar 28, 2022

ahrtr commented Mar 28, 2022 •

edited

Loading

ahrtr commented Mar 29, 2022

serathius commented Mar 29, 2022

ahrtr commented Mar 29, 2022

ahrtr commented Mar 29, 2022

ptabor commented Mar 29, 2022 •

edited

Loading

ahrtr commented Mar 30, 2022

Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock #13844

Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock #13844

Conversation

ahrtr commented Mar 28, 2022

codecov-commenter commented Mar 28, 2022 • edited Loading

Codecov Report

ahrtr commented Mar 28, 2022 • edited Loading

serathius commented Mar 28, 2022 • edited Loading

ahrtr commented Mar 28, 2022

serathius commented Mar 28, 2022

serathius commented Mar 28, 2022

ptabor commented Mar 28, 2022 • edited Loading

ahrtr commented Mar 28, 2022

ahrtr commented Mar 28, 2022 • edited Loading

ahrtr commented Mar 29, 2022

serathius commented Mar 29, 2022

ahrtr commented Mar 29, 2022

ahrtr commented Mar 29, 2022

ptabor commented Mar 29, 2022 • edited Loading

ahrtr commented Mar 30, 2022

codecov-commenter commented Mar 28, 2022 •

edited

Loading

ahrtr commented Mar 28, 2022 •

edited

Loading

serathius commented Mar 28, 2022 •

edited

Loading

ptabor commented Mar 28, 2022 •

edited

Loading

ahrtr commented Mar 28, 2022 •

edited

Loading

ptabor commented Mar 29, 2022 •

edited

Loading