Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft node failing to recover from non-graceful shutdown #807

Closed
jimthematrix opened this issue Aug 21, 2019 · 4 comments · Fixed by #860
Closed

Raft node failing to recover from non-graceful shutdown #807

jimthematrix opened this issue Aug 21, 2019 · 4 comments · Fixed by #860

Comments

@jimthematrix
Copy link
Contributor

We'd like to advocate for this fix to be ported to Quorum:
ethereum/go-ethereum#19862

Reason: if a Quorum node experiences a non-graceful shutdown (equivalent of kill -9), the persisted chain gets corrupted because the head has not been properly flushed from memory (which is performed during graceful shutdown procedure).

Geth/v1.8.18-stable-ef256cb2(quorum-v2.2.3)/linux-amd64/go1.10.1

when the node comes back, it saw the corrupt head and reverted to the last validated block:

WARN [07-31|21:20:27.192] Head state missing, repairing chain      number=537490 hash=60d9c9…302abf
INFO [07-31|21:22:33.580] Rewound blockchain to past state         number=467692 hash=62b814…1d78dd
INFO [07-31|21:22:33.612] Loaded most recent local header          number=537490 hash=60d9c9…302abf td=70449889280 age=0
INFO [07-31|21:22:33.612] Loaded most recent local full block      number=467692 hash=62b814…1d78dd td=61301325824 age=0
INFO [07-31|21:22:33.612] Loaded most recent local fast block      number=537490 hash=60d9c9…302abf td=70449889280 age=0

then later it failed to rebuild the chain from the past block (I tried the same non-graceful shutdown with an IBFT chain, also got it to have to revert to an earlier block but it was able to reconcile and catch up to the head):

INFO [07-31|21:22:52.922] Non-extending block                      block=ae9fe0…044311 parent=60d9c9…302abf head=62b814…1d78dd
INFO [07-31|21:22:52.922] persisted the latest applied index       index=537535
INFO [07-31|21:22:52.922] Handling InvalidRaftOrdering             invalid block=ae9fe0…044311 current head=62b814…1d78dd
INFO [07-31|21:22:52.931] Someone else mined invalid block; ignoring block=ae9fe0…044311
INFO [07-31|21:22:53.409] Non-extending block                      block=dbd87e…15032b parent=ae9fe0…044311 head=62b814…1d78dd
INFO [07-31|21:22:53.409] persisted the latest applied index       index=537536
INFO [07-31|21:22:53.409] start snapshot                           applied index=537536 last snapshot index=537286
INFO [07-31|21:22:53.409] Handling InvalidRaftOrdering             invalid block=dbd87e…15032b current head=62b814…1d78dd
INFO [07-31|21:22:53.418] Someone else mined invalid block; ignoring block=dbd87e…15032b
INFO [07-31|21:22:53.464] compacted log                            index=537536
DEBUG[07-31|21:22:53.698] Recalculated downloader QoS values       rtt=20s confidence=0.750 ttl=1m0s
2019-07-31 21:22:54.920643 W | rafthttp: health check for peer 2 could not connect: <nil>
DEBUG[07-31|21:22:57.690] Transaction pool status report           executable=0 queued=2 stales=0
2019-07-31 21:22:59.920803 W | rafthttp: health check for peer 2 could not connect: <nil>
2019-07-31 21:23:04.920956 W | rafthttp: health check for peer 2 could not connect: <nil>
2019-07-31 21:23:09.924400 W | rafthttp: health check for peer 2 could not connect: <nil>
DEBUG[07-31|21:23:13.699] Recalculated downloader QoS values       rtt=20s confidence=0.875 ttl=1m0s
2019-07-31 21:23:14.927622 W | rafthttp: health check for peer 2 could not connect: <nil>
2019-07-31 21:23:19.927761 W | rafthttp: health check for peer 2 could not connect: <nil>
2019-07-31 21:23:24.930762 W | rafthttp: health check for peer 2 could not connect: <nil>
2019-07-31 21:23:29.933597 W | rafthttp: health check for peer 2 could not connect: <nil>
INFO [07-31|21:23:33.540] Non-extending block                      block=7c5057…f40b82 parent=dbd87e…15032b head=62b814…1d78dd
INFO [07-31|21:23:33.543] persisted the latest applied index       index=537537
INFO [07-31|21:23:33.543] Handling InvalidRaftOrdering             invalid block=7c5057…f40b82 current head=62b814…1d78dd
INFO [07-31|21:23:33.551] Someone else mined invalid block; ignoring block=7c5057…f40b82
DEBUG[07-31|21:23:33.711] Recalculated downloader QoS values       rtt=20s confidence=0.938 ttl=1m0s
INFO [07-31|21:23:34.068] Non-extending block                      block=deaeaf…3b309c parent=7c5057…f40b82 head=62b814…1d78dd
INFO [07-31|21:23:34.068] persisted the latest applied index       index=537538
INFO [07-31|21:23:34.068] Handling InvalidRaftOrdering             invalid block=deaeaf…3b309c current head=62b814…1d78dd
@jpmsam
Copy link
Contributor

jpmsam commented Aug 21, 2019

Thanks @jimthematrix for the reference. We'll look into pulling it from upstream.

@jimthematrix jimthematrix changed the title Node failing to recover from non-graceful shutdown Raft node failing to recover from non-graceful shutdown Aug 22, 2019
@vsmk98
Copy link
Contributor

vsmk98 commented Sep 4, 2019

Hi @jimthematrix, the raft node non-graceful shutdown issues is a separate one and this fix will not solve the same. We are looking at a different solution for raft node failing to recover from non-graceful shutdown. I am raising a separate issue for this.

Since the upstream issue was fixed for Clique consensus, I have tried simulate the issue with Clique and so far have not been able to simulate it. I wanted to check if you had tested with Clique consensus as well and if this issue was observed?

@jimmy-dg
Copy link

Hi @vsmk98
Is there any workaround fix for raft consensus in the mean time?

@vsmk98
Copy link
Contributor

vsmk98 commented Sep 25, 2019

Hi @jimmy-dg if you are using Quorum version 2.2.5 this issue should not happen. If you are using earlier versions then please bring up Geth with --gcmode archive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants