Consul write api hangs for half hour before return rpc error #3738

hehailong5 · 2017-12-12T01:07:40Z

version: 0.8.4

when this happens, the leader repetively print:

2017/12/05 11:10:21 [WARN] raft: AppendEntries to {Voter 10.0.0.9:8300 10.0.0.9:8300} rejected, sending older logs (next: 14336)
2017/12/05 11:10:21 [ERR] raft: Failed to get log at index 14335: log not found

and the follower prints:

2017/12/05 11:10:11 [ERR] consul: failed to reconcile member: {server-1 10.0.0.10 8301 map[build:0.8.4:f436077 vsn_max:3 raft_vsn:2 wan_join_port:8302 dc:dc1 id:c29aeab8-3731-17d0-8796-b7af5d3953ea port:8300 role:consul vsn:2 vsn_min:2] alive 1 5 2 2 5 4}: leadership lost while committing log
2017/12/05 11:10:11 [ERR] consul: failed to reconcile: leadership lost while committing log
2017/12/05 11:10:11 [INFO] consul: cluster leadership lost
2017/12/05 11:10:18 [WARN] raft: Previous log term mis-match: ours: 4 remote: 5
2017/12/05 11:10:18 [INFO] consul: New leader elected: server-1
2017/12/05 11:10:18 [INFO] snapshot: Creating new snapshot at /opt/application/consul-works/data-dir/raft/snapshots/5-14336-1512472218803.tmp
2017/12/05 11:10:18 [INFO] snapshot: reaping snapshot /opt/application/consul-works/data-dir/raft/snapshots/3-12783-1512459076524
2017/12/05 11:10:18 [INFO] raft: Copied 625724 bytes to local snapshot
2017/12/05 11:10:18 [INFO] raft: Compacting logs from 14333 to 4113
2017/12/05 11:10:18 [INFO] raft: Installed remote snapshot
2017/12/05 11:10:18 [WARN] raft: Previous log term mis-match: ours: 4 remote: 5
2017/12/05 11:10:18 [INFO] snapshot: Creating new snapshot at /opt/application/consul-works/data-dir/raft/snapshots/5-14336-1512472218916.tmp

The text was updated successfully, but these errors were encountered:

slackpad · 2017-12-12T14:39:23Z

Hi @hehailong5 can you reproduce this or was this a one-time event? Also, there have been several Raft-related fixes since 0.8.4 so I'd definitely recommend running a newer version of Consul.

hehailong5 · 2017-12-13T01:45:23Z

Only see this once since we upgraded to 0.8.4.
I have two questions though:

Is there any timeout option for consul apis?
We detected this after half hour since we monitor the health of consul cluster via /v1/status/leader while in this case this url works as expected. how do we measure the health for sure then?

slackpad · 2018-01-09T00:48:04Z

Hi @hehailong5 we've fixed two issues in 1.0.0 and later that are probably related to this - #3545 and #3700. There's normally a timeout related to Raft itself where if a leader loses contact with its followers it will step down. With those issues Raft itself was working ok but there was a problem on the leader preventing it from taking writes, so things could get stuck.

What's weird though is that you are seeing "log not found" errors (similar to #2837) that aren't consistent with that, so I think this needs a deeper look.

chymy · 2018-08-27T04:18:06Z

Hi @slackpad I also encountered the problem #3852.When will it be solved?

alitvak69 · 2018-10-05T04:30:18Z

Sorry to piggyback on the issue but as recent as 10/01 we have experienced a system-wide outage because of the similar lockups described also in #3852. With TTL set u to 3 seconds for a number of services we had a wave of times out where TTL was exceeding 3 - 5 seconds. Since this is a fairly small cluster updates normally take mS or even uS for us. Can you please review the issue and let us know what may be done to prevent it from happening again.

Any help is greatly appreciated,

slackpad added type/bug Feature does not function as expected theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization labels Jan 9, 2018

slackpad added this to the Unplanned milestone Jan 9, 2018

slackpad mentioned this issue Feb 5, 2018

Consul Agent wrongly report "critical" warning #3852

Closed

alitvak69 mentioned this issue Oct 5, 2018

TTL service health checks are flapping causes service to be critical #4742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul write api hangs for half hour before return rpc error #3738

Consul write api hangs for half hour before return rpc error #3738

hehailong5 commented Dec 12, 2017 •

edited

Loading

slackpad commented Dec 12, 2017

hehailong5 commented Dec 13, 2017

slackpad commented Jan 9, 2018

chymy commented Aug 27, 2018

alitvak69 commented Oct 5, 2018

Consul write api hangs for half hour before return rpc error #3738

Consul write api hangs for half hour before return rpc error #3738

Comments

hehailong5 commented Dec 12, 2017 • edited Loading

slackpad commented Dec 12, 2017

hehailong5 commented Dec 13, 2017

slackpad commented Jan 9, 2018

chymy commented Aug 27, 2018

alitvak69 commented Oct 5, 2018

hehailong5 commented Dec 12, 2017 •

edited

Loading