Healthchecks resynced every ~2mins, causes consul index to grow #3259

shays10 · 2017-07-11T18:49:33Z

Hey,

This is very much related to #1244
I'm using blocking HTTP queries and I encountered a strange behavior where consul index is incremented even when healthchecks didn't change their status (or their output)
So I looked at the value of X-Consul-Index, searched for an offending healthcheck that had that exact same number (via /health/checks/any) and presumed that I'll find that the check state was changed.
But all I could find in the log is
https://gist.github.com/shays10/d0bdb43e0ae3664af1686aa139f18965

All checks are synced again, and this happens every 90-120 seconds. I'm guessing that it happens when antiEntropy kicks in and it finds that the service is synced but the checks are not.
I managed to find a correlation between the timestamp of the sync (according to the log) and the time where the ModifyIndex was bumped in all the checks.
Any idea on how to find the root cause? I'm having lots of checks on a lot of nodes and this causes the index to go wild.

I tried to make sure as much as I can that there isn't a flaky health check that fails randomly or returns a different output.

Client: 0.8.3
Server: 0.8.3

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 11
	services = 11
build:
	prerelease = 
	revision = ea2a82b
	version = 0.8.3
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 60
	max_procs = 2
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 243
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1496
	members = 32
	query_queue = 0
	query_time = 1

Server:

agent:
	check_monitors = 3
	check_ttls = 0
	checks = 3
	services = 1
build:
	prerelease = 
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.31.9.104:8300
	server = true
raft:
	applied_index = 2553095
	commit_index = 2553095
	fsm_pending = 0
	last_contact = 0
	last_log_index = 2553095
	last_log_term = 435
	last_snapshot_index = 2551334
	last_snapshot_term = 435
	latest_configuration = [{Suffrage:Voter ID:172.31.13.71:8300 Address:172.31.13.71:8300} {Suffrage:Voter ID:172.31.11.117:8300 Address:172.31.11.117:8300} {Suffrage:Voter ID:172.31.9.104:8300 Address:172.31.9.104:8300}]
	latest_configuration_index = 1180420
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 435
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 247
	max_procs = 2
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 243
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1496
	members = 32
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 92
	members = 3
	query_queue = 0
	query_time = 1

The text was updated successfully, but these errors were encountered:

shays10 · 2017-07-17T14:19:16Z

A quick update: I don't know what was the root cause, but the problem was resolved after we downgraded to 0.7.5

slackpad · 2017-07-17T15:15:21Z

@shays10 thanks for the update. This kind of thing is usually related to some changing output, but the downgrade fixing it doesn't really fit in with that explanation. Do you have ACLs enabled?

shays10 · 2017-07-18T07:13:18Z

@slackpad Thanks for your reply! No, ACLs are currently disabled in our cluster

slackpad · 2017-09-01T23:00:13Z

Hi @shays10

A quick update: I don't know what was the root cause, but the problem was resolved after we downgraded to 0.7.5

Did you downgrade the clients, servers, or both?

shays10 · 2017-09-01T23:32:52Z

Hey again! Both.

ponimas · 2017-09-04T18:32:04Z

I've just faced the same issue with latest consul.
It seems that index is incremented on every http check.
Are http response headers considered as part of output?

client

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 1
        services = 2
build:
        prerelease =
        revision = 75ca2ca
        version = 0.9.2
consul:
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 44
        max_procs = 1
        os = linux
        version = go1.8.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 315
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 638
        members = 33
        query_queue = 0
        query_time = 35

server

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 75ca2ca
        version = 0.9.2
consul:
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 172.31.24.147:8300
        server = true
raft:
        applied_index = 15877761
        commit_index = 15877761
        fsm_pending = 0
        last_contact = 22.483µs
        last_log_index = 15877761
        last_log_term = 784
        last_snapshot_index = 15873877
        last_snapshot_term = 784
        latest_configuration = [{Suffrage:Voter ID:11325065-d2e6-788a-6a23-7d7c0af1a7d9 Address:172.31.22.9:8300} {Suffrage:Voter ID:78ca8bae-0f85-70de-329e-bf947b058a54 Address:172.31.24.147:8300} {Suffrage:Voter ID:f1193568-725f-7048-05a4-12bbd1a1d1e8 Address:172.31.25.174:8300}]
        latest_configuration_index = 15617842
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 784
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 225
        max_procs = 1
        os = linux
        version = go1.8.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 315
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 638
        members = 33
        query_queue = 0
        query_time = 35
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1

magiconair · 2017-09-04T19:17:36Z

No. https://github.com/hashicorp/consul/blob/master/agent/check.go#L400-L415

Have you got a timestamp in the body?

ponimas · 2017-09-04T19:43:52Z

Have you got a timestamp in the body?

Nope.
Sometimes it's empty body and sometimes it's simple json (without any arrays).
Same response every time from the same service.

magiconair · 2017-09-04T21:51:38Z

@ponimas Can you provide the check definition and a full check request/response for testing?

ponimas · 2017-09-05T05:18:40Z

@magiconair

{
          "http": "http://172.31.17.136:3000/",
          "interval": "60s",
          "timeout": "1s"
}

as far as I understand request is a simple GET request
and response is

curl -i http://172.31.17.136:3000/

HTTP/1.1 200 OK
Date: Tue, 05 Sep 2017 05:08:32 GMT
Content-Type: application/json; charset=utf-8
x-request-id: 15045881129851FD6B543
Content-Length: 60
Server: Jetty(9.2.z-SNAPSHOT)

{"status":"ok","message":"V2. Follow me, I know a shortcut"}

curl -i /v1/health/checks/susanin

HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 05 Sep 2017 05:17:39 GMT
Transfer-Encoding: chunked
X-Consul-Index: 15897185
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
[
...
{
        "CheckID": "service:registrator.susanin-prod-cc8b.vpc:susanin:3000",
        "CreateIndex": 15616870,
        "ModifyIndex": 15897062,
        "Name": "Service 'susanin' check",
        "Node": "susanin-prod-cc8b",
        "Notes": "",
        "Output": "HTTP GET http://172.31.17.136:3000/: 200 OK Output: {\"status\":\"ok\",\"message\":\"V2. Follow me, I know a shortcut\"}",
        "ServiceID": "registrator.susanin-prod-cc8b.vpc:susanin:3000",
        "ServiceName": "susanin",
        "ServiceTags": [
            "urlprefix-/internal/susanin strip=/internal/susanin"
        ],
        "Status": "passing"
    },
...
]

ponimas · 2017-09-11T07:20:43Z

@magiconair is there any news on this issue?

123BLiN · 2017-09-14T08:31:22Z

consul 0.9.2 - same behaviour, is it possible that header Date may trigger index to grow?

PS C:\tools\curl> .\curl.exe -k --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
Test consul site
PS C:\tools\curl> .\curl.exe -k -I --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
HTTP/1.1 200 OK
Content-Length: 16
Content-Type: text/html
Last-Modified: Thu, 22 Jun 2017 14:10:40 GMT
Accept-Ranges: bytes
ETag: "b64af75361ebd21:0"
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 14 Sep 2017 11:50:27 GMT

PS C:\tools\curl> .\curl.exe -k -I --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
HTTP/1.1 200 OK
Content-Length: 16
Content-Type: text/html
Last-Modified: Thu, 22 Jun 2017 14:10:40 GMT
Accept-Ranges: bytes
ETag: "b64af75361ebd21:0"
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 14 Sep 2017 11:50:37 GMT

magiconair · 2017-09-14T13:37:21Z

I'm currently a bit sick and also preparing for HashiConf next week. Should be able to catch my breath after that.

ponimas · 2017-10-10T05:25:00Z

@magiconair is that gonna be fixed in 1.0?

magiconair · 2017-10-10T07:08:31Z

@ponimas If I can find the root cause then probably. However, usually this is triggered by the Output field being different. I'll try to have a look this week but it is a short week for me.

slackpad · 2017-10-10T14:12:45Z

@ponimas does that x-request-id: 15045881129851FD6B543 field change on every GET? That would cause some churn if so.

ponimas · 2017-10-11T07:54:46Z

@slackpad yup it changes. But it's header.
AFAIU #3259 (comment) - it shouldn't affect anything

slackpad · 2017-10-11T08:06:54Z

@ponimas ah ok - I thought you were actually running that curl command as the check.

pierresouchay · 2017-11-22T09:21:48Z

@slackpad I think you should have a look to @yfouquet patch.
In our cluster, the X-Consul-Index is increasing continuously while state is not varying that much. We have lots of checks on services with tags and we have very large amount of sync at cluster level, probably for this reason

123BLiN · 2017-11-22T13:27:49Z

btw our indexes are still growing insanely fast w/out health check output. However consul watchers and consul-template do not trigger on them and in general they trigger only on changes - there should be some undocumented magic here, as I always thought that trigger event for such blocking queries is x-consul-index change.

pierresouchay · 2017-11-25T00:39:08Z

Might be related as well: #3712

Since commit 9685bdc, service tags are added to the health checks. Otherwise, when adding a service, tags are not added to its check. In updateSyncState, we compare the checks of the local agent with the checks of the catalog. It appears that the service tags are different (missing in one case), and so the check is synchronized. That increase the ModifyIndex periodically when nothing changes. Fixed it by adding serviceTags to the check. Note that the issue appeared in version 0.8.2. Looks related to hashicorp#3259.

slackpad · 2017-12-14T22:00:55Z

Closing this as fixed by #3642 which looks to be the root cause of this.

42wim · 2018-01-28T23:46:15Z

We've upgraded to consul 1.0.2 but we're seeing this issue with the nomad health check.

Almost every 2 minutes it does a synced check of the Nomad Client HTTP Check
Running nomad 0.7.1

Jan 29 00:28:45 box consul[110606]: 2018/01/29 00:28:45 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:30:17 box consul[110606]: 2018/01/29 00:30:17 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:32:28 box consul[110606]: 2018/01/29 00:32:28 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:34:40 box consul[110606]: 2018/01/29 00:34:40 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:37:02 box consul[110606]: 2018/01/29 00:37:02 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:39:13 box consul[110606]: 2018/01/29 00:39:13 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"

42wim · 2018-01-29T19:45:34Z

After a lot of debugging I found the issue (for my issue), and submitted a PR
cc: @slackpad

slackpad added type/bug Feature does not function as expected theme/health-checks Health Check functionality labels Sep 1, 2017

magiconair self-assigned this Sep 14, 2017

yfouquet mentioned this issue Nov 3, 2017

[Fix] Service tags not added to health checks #3642

Merged

pierresouchay unassigned magiconair Nov 22, 2017

slackpad added this to the 1.0.2 milestone Dec 14, 2017

slackpad closed this as completed Dec 14, 2017

42wim mentioned this issue Jan 29, 2018

Fix service tags not added to health check. Part two #3845

Merged

andriisabitov mentioned this issue Mar 7, 2018

Change method to PUT for Consul health check update according to http… Sixt/ja-micro#62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthchecks resynced every ~2mins, causes consul index to grow #3259

Healthchecks resynced every ~2mins, causes consul index to grow #3259

shays10 commented Jul 11, 2017

shays10 commented Jul 17, 2017

slackpad commented Jul 17, 2017

shays10 commented Jul 18, 2017

slackpad commented Sep 1, 2017

shays10 commented Sep 1, 2017 via email •

edited by magiconair

Loading

ponimas commented Sep 4, 2017

magiconair commented Sep 4, 2017

ponimas commented Sep 4, 2017 •

edited

Loading

magiconair commented Sep 4, 2017

ponimas commented Sep 5, 2017

ponimas commented Sep 11, 2017

123BLiN commented Sep 14, 2017 •

edited

Loading

magiconair commented Sep 14, 2017

ponimas commented Oct 10, 2017

magiconair commented Oct 10, 2017

slackpad commented Oct 10, 2017

ponimas commented Oct 11, 2017

slackpad commented Oct 11, 2017

pierresouchay commented Nov 22, 2017

123BLiN commented Nov 22, 2017 •

edited

Loading

pierresouchay commented Nov 25, 2017

slackpad commented Dec 14, 2017

42wim commented Jan 28, 2018

42wim commented Jan 29, 2018

Healthchecks resynced every ~2mins, causes consul index to grow #3259

Healthchecks resynced every ~2mins, causes consul index to grow #3259

Comments

shays10 commented Jul 11, 2017

shays10 commented Jul 17, 2017

slackpad commented Jul 17, 2017

shays10 commented Jul 18, 2017

slackpad commented Sep 1, 2017

shays10 commented Sep 1, 2017 via email • edited by magiconair Loading

ponimas commented Sep 4, 2017

magiconair commented Sep 4, 2017

ponimas commented Sep 4, 2017 • edited Loading

magiconair commented Sep 4, 2017

ponimas commented Sep 5, 2017

ponimas commented Sep 11, 2017

123BLiN commented Sep 14, 2017 • edited Loading

magiconair commented Sep 14, 2017

ponimas commented Oct 10, 2017

magiconair commented Oct 10, 2017

slackpad commented Oct 10, 2017

ponimas commented Oct 11, 2017

slackpad commented Oct 11, 2017

pierresouchay commented Nov 22, 2017

123BLiN commented Nov 22, 2017 • edited Loading

pierresouchay commented Nov 25, 2017

slackpad commented Dec 14, 2017

42wim commented Jan 28, 2018

42wim commented Jan 29, 2018

shays10 commented Sep 1, 2017 via email •

edited by magiconair

Loading

ponimas commented Sep 4, 2017 •

edited

Loading

123BLiN commented Sep 14, 2017 •

edited

Loading

123BLiN commented Nov 22, 2017 •

edited

Loading