Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul write api hangs for half hour before return rpc error #3738

Open
hehailong5 opened this issue Dec 12, 2017 · 5 comments
Open

Consul write api hangs for half hour before return rpc error #3738

hehailong5 opened this issue Dec 12, 2017 · 5 comments
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Milestone

Comments

@hehailong5
Copy link

hehailong5 commented Dec 12, 2017

version: 0.8.4

when this happens, the leader repetively print:

2017/12/05 11:10:21 [WARN] raft: AppendEntries to {Voter 10.0.0.9:8300 10.0.0.9:8300} rejected, sending older logs (next: 14336)
2017/12/05 11:10:21 [ERR] raft: Failed to get log at index 14335: log not found

and the follower prints:

2017/12/05 11:10:11 [ERR] consul: failed to reconcile member: {server-1 10.0.0.10 8301 map[build:0.8.4:f436077 vsn_max:3 raft_vsn:2 wan_join_port:8302 dc:dc1 id:c29aeab8-3731-17d0-8796-b7af5d3953ea port:8300 role:consul vsn:2 vsn_min:2] alive 1 5 2 2 5 4}: leadership lost while committing log
2017/12/05 11:10:11 [ERR] consul: failed to reconcile: leadership lost while committing log
2017/12/05 11:10:11 [INFO] consul: cluster leadership lost
2017/12/05 11:10:18 [WARN] raft: Previous log term mis-match: ours: 4 remote: 5
2017/12/05 11:10:18 [INFO] consul: New leader elected: server-1
2017/12/05 11:10:18 [INFO] snapshot: Creating new snapshot at /opt/application/consul-works/data-dir/raft/snapshots/5-14336-1512472218803.tmp

2017/12/05 11:10:18 [INFO] snapshot: reaping snapshot /opt/application/consul-works/data-dir/raft/snapshots/3-12783-1512459076524
2017/12/05 11:10:18 [INFO] raft: Copied 625724 bytes to local snapshot
2017/12/05 11:10:18 [INFO] raft: Compacting logs from 14333 to 4113
2017/12/05 11:10:18 [INFO] raft: Installed remote snapshot
2017/12/05 11:10:18 [WARN] raft: Previous log term mis-match: ours: 4 remote: 5
2017/12/05 11:10:18 [INFO] snapshot: Creating new snapshot at /opt/application/consul-works/data-dir/raft/snapshots/5-14336-1512472218916.tmp
@slackpad
Copy link
Contributor

Hi @hehailong5 can you reproduce this or was this a one-time event? Also, there have been several Raft-related fixes since 0.8.4 so I'd definitely recommend running a newer version of Consul.

@hehailong5
Copy link
Author

Only see this once since we upgraded to 0.8.4.
I have two questions though:

  1. Is there any timeout option for consul apis?
  2. We detected this after half hour since we monitor the health of consul cluster via /v1/status/leader while in this case this url works as expected. how do we measure the health for sure then?

@slackpad
Copy link
Contributor

slackpad commented Jan 9, 2018

Hi @hehailong5 we've fixed two issues in 1.0.0 and later that are probably related to this - #3545 and #3700. There's normally a timeout related to Raft itself where if a leader loses contact with its followers it will step down. With those issues Raft itself was working ok but there was a problem on the leader preventing it from taking writes, so things could get stuck.

What's weird though is that you are seeing "log not found" errors (similar to #2837) that aren't consistent with that, so I think this needs a deeper look.

@slackpad slackpad added type/bug Feature does not function as expected theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization labels Jan 9, 2018
@slackpad slackpad added this to the Unplanned milestone Jan 9, 2018
@chymy
Copy link

chymy commented Aug 27, 2018

Hi @slackpad I also encountered the problem #3852.When will it be solved?

@alitvak69
Copy link

Sorry to piggyback on the issue but as recent as 10/01 we have experienced a system-wide outage because of the similar lockups described also in #3852. With TTL set u to 3 seconds for a number of services we had a wave of times out where TTL was exceeding 3 - 5 seconds. Since this is a fairly small cluster updates normally take mS or even uS for us. Can you please review the issue and let us know what may be done to prevent it from happening again.

Any help is greatly appreciated,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants