-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking calls on /v1/health/service/<service> never block, resulting in request flooding #1244
Comments
Hi @agemooij What is the motivation for selecting such a large index? Consul is going to keep returning until it hits that index. Your first request should always use index 0. Then, the response will include a response index that you add to the next query. You can see the Go-version of this here: https://github.com/hashicorp/consul-template/blob/a020700c8ed8ea70450460f0eedef21197738ef2/watch/view.go#L120-L170. You need to add some logic to check the return values, specifically the stale and last index values and then store those results for comparing in the next request. What version of Consul are you running? Can you restart your cluster with debug logging and send us a gist? |
The index in that example request is taken from the So the question is therefor why that index is incremented so often. Can this be caused by incorrect usage or incorrect configuration? Example output:
|
@sethvargo just to be clear, what we do for all requests except for the first one is to set the |
We are using Consul 0.5.2. We can restart the cluster in debug mode but it might have to wait until tomorrow because our mutual acquaintance Bas Tichelaar just left for a meeting and he's the one with access. |
I just found this mailing list discussion: https://groups.google.com/forum/#!topic/consul-tool/EP10Wdb1QKQ It seems to indicate health checks can cause idempotent index increments, i.e nothing has changed but the index does increment:
I've compared the output of the sample calls I've posted above and the output is always the same for that service while the index keeps incrementing. I would expect the index for a certain service/endpoint not to be influenced by writes to other, unrelated endpoints (i.e. health checks for other services or unrelated kv writes, etc.). Is that indeed how it works? If not, what is the scope for these indices? Are there other common causes for idempotent writes that might be causing our indices to be incremented so often? |
I did some more tests and I found that sometimes the output of a health check is indeed "different" but those differences are purely in the ordering of objects in JSON arrays, i.e. the order of the nodes in the response is sometimes different. Would that cause the index for that service to be incremented? If so, that would explain the behaviour that we are seeing, although I would argue that in that case it would also mean that the Consul change detection system is broken by the way Consul itself generates JSON (i.e. with non-deterministic JSON array sorting). |
Hi @agemooij - the Consul health checker isn't aware of any format for the check outputs so it does a simple string compare to see if they have changed; this would definitely cause write and index churn even if the health status is stable. Go intentionally randomizes map iteration order which I think causes the non-determinism in our JSON output. If you can filter your health check outputs to make them deterministic before Consul sees them, then they should be truly idempotent: https://github.com/hashicorp/consul/blob/master/command/agent/local.go#L241-L244 |
We are returning JSON arrays in our health check outputs so this might very well be the cause of our index churning. Il'll investigate immediately and report back. If this is really the cause of our problems it might be good to add some kind of "health check best practices" note to the Consul docs to prevent other people making the same mistake. If such a section already exists, consider me eating humble pie and shamefully sneaking off to RTFM 😄 |
Hi @slackpad and @sethvargo, we ran some tests and this is the current situation:
The big question is therefor: what could be triggering index increments for those two endpoints for every single service we have running? Especially if we can verify that their individual health checks do in fact return idempotent responses. Just to be sure, does Consul guarantee that each individual service has its own index and that changes to one service don't trigger an index increment to another service? |
Snippet from the |
Hi guys - it looks like the ids are not scoped per-service for the health checks, it tracks one for the table, so if any check changes it would wake up your watches (even services whose check output doesn't change). If there's one check some place that's non-deterministic then it could be causing the wakeup problem you are seeing. Looking in the agent code, there's a feature that only sends the output every 5 minutes if the service status hasn't changed to throttle these down, but if you had a ton of nodes then this could still be pretty frequent. Alternatively, if something was registering and unregistering a service over and over again (one that had checks associated with it) I think this would ratchet up checks and services as well. Do any of these sound like they could be what you are seeing? |
We might very well have one service with a flaky health check and that is certainly something we are going to investigate ASAP. AFAIK we do not have any flapping services since this is something we monitor pretty well; but we will double-check to be sure. But can I ask the obvious question?
This is extremely surprising and from your language (i.e. "it looks like") it seems you didn't expect this either. Is this as-designed or did we just find a major bug in Consul? What do you mean with "the table"? Which exact endpoints share the same index? From our output it looks like at least all The docs definitely imply that indexes are per-resource/service instead of shared globally. For example look at the "Blocking Queries" section of the HTTP AP docs where it says:
Are the docs wrong or is this a bug? |
Hi @agemooij it's complicated, that's why I had to go look in the code :-) Each entity (node, service, check, kv) has a separate table, and Consul tracks an index per table, so there's no inherent coupling between these tables. If you delete a node that has a service and a check, though, it will bump all three of those indexes because it hit all three tables. For the KV table, we store extra index information associated with each key so that watches can be smarter about determining if a write to the table (which always bumps up the table's index) applied to the key or subtree you were watching. This hasn't been implemented for all the other entity types yet, though we intend to do that soon, so any check update will wake up all the watches on health checks right now. I traced through the path where the agents push check updates to the server and it's through the This definitely looks like there's a check with an output that's non-deterministic spamming up your Raft log and waking up your watches. We are actively working on improving the granularity of these queries so you won't get this behavior (only the churn-ey service would be waking). Sorry for the confusion on this! Also let me know if you don't find the culprit or think it's something else and I can help you track it down. I'll leave this open as a reminder to improve the docs. We should definitely add some notes about how output is handled. |
Thanks @slackpad for the update. A fix to make the granularity of updates match with expectations and documentation would be very much appreciated. We are actively looking for all sources of non-identical health check responses and we have already found some obvious sources; mostly of health checks running against software we don't control, like Cassandra and Prometheus. Those services have timestamped health check responses. In my experience this is quite common behavior so finding out that Consul is not compatible with such situations is troubling. I think a lot of people using consul-template or other forms of watches might be running into this same problem, probably without even noticing except for some mysteriously high loads on their Consul clusters. It seems some subtler, more configurable behavior is probably called for here. Perhaps some way of either ignoring or semi-interpreting the health check responses? |
We managed to find the cause of this issue. First, we made sure that all health check responses are idempotent, but the index was still increasing. So we started removing containers until we found the issue: registrator. Registrator has a Having an index per service would not have solved this, but still makes sense from our perspective. A change in service A should not cause triggering a watch on service B. We would still like to know if and when this behavior will be changed in Consul. I will also create an issue for registrator, and will contribute this to the documentation of registrator. Anyone who is using the |
Thanks for the updates and I appreciate the ping on registrator. @agemooij - I'd hesitate to make Consul aware of the format of any output; wrapping a check in a script that preps and filters is probably better for keeping Consul complexity low. We can make the documentation way better, though, and it might be worth a periodic warning if a healthy service is churning its output so it's more obvious what's going on (and easy to find and fix without having to run through all your checks manually). |
So are you saying the tables are shared across say everything that is a service or is it just everything of a given service on a given node. Like having multiple checks on the same service can trigger a change for the service's index but a change to a different service on the same machine will not trigger a change for all services on the machine? Right now, I have a service that basically cats a file on a all of the machines in the cluster. The file's contents do not change but the index number is incrementing. Is this intended or is it some other service (some containing timestamps in thier output) that is causing it to increment? |
@mbroome as far as I understood the granularity of writes is cross-service at the moment. I'm not exactly sure where the granularity borders of these "tables" are but in practice for us this means that any write for service X on any Consul endpoint (catalog, health, etc.) will also trigger all watches for all other services on all other endpoints. @slackpad It would be great if we could get a better insight of the current granularity of writes/watches. An initial high-level generalization would already be very helpful. |
@slackpad I agree that interpreting health check responses would be the last possible choice I expect you guys to make. I simply threw it into the discussion as an extreme alternative. If the granularity of writes would increase, the side-effects of a chatty/flappy service would be greatly reduced. Next to that I agree that "writes per minute" or something similar would be a very helpful metric to have for a service so you can quickly debug problems like this. |
@mbroome Yeah @agemooij summarized it correctly "any write for service X on any Consul endpoint (catalog, health, etc.) will also trigger all watches for all other services". It doesn't sound like your @agemooij We will see what we can do to make this better - improved granularity of watches plus some better visibility for churn. |
So is there any plans to split the indexes for services up so one service does not effect another? Right now, I'm getting so much churn on indexes, it's not really practical to setup a watch for something that is not changing often. Even if I filter the output of healthchecks to cleanup things like timestamps, the next time a dev produces some content that I'm not aware of, everything will start to get reconfigured constantly even if it's not related to the service the healthcheck was for. In it's current implementation, I can't really use watches. |
@mbroome - this is on our list. I'm not sure of the exact time frame but hopefully shortly after Consul 0.6 ships; we are trying for near the end of the month with that. |
@mbroome for us watches do still "work" but only because both consul-template and out own watches always compare the new response to the old response and only trigger reconfiguration when something has actual changed. The main impact we are seeing is the load produced by all these watches constantly being triggered and then resubscribing. This is not practical unless you have an extremely stable environment that hardly ever changes. @slackpad Great news! Glad to hear it is on your list. |
Hi. Any news on this? Still see it in 0.6.1. We'd like to watch select services for changes, but using health endpoint modification to any service releases all watches. |
Hi @rytutis we are still tracking this and hoping to get it into Consul 0.7. |
Thanks |
Is this still on track for 0.7 ? Any rough idea when that will be? Q2 or H2 ? |
@mrwilby this may not make it into 0.7 given our current scope, but I think Q2 seems pretty likely. Will update this when I have a better estimate. |
Joining the dogpile on this one. This is particularly problematic when combined with consul-template. We are seeing long quiescence periods in environments with a lot of churn that would be helped significantly if these indices were actually per-resource rather than shared. |
We've been seeing the same issue. One observation we've made is that an always failing health check will cause index churn. I would've expected for index to increment only if service's health state transitions, not when it's constantly failing. Is this behavior by design? |
@Loushu when you see that, is the output of the failing check changing run to run? |
Having any service anywhere changing state causing all watchers everywhere to cycle seems like a significant impediment to large scale, I am anxious to see improved granularity to them. |
@jhmartin : We've tracked down index churn to 2 distinct causes. First one is related to failed health checks for non-running services on Windows, where the text message of failed check would vary between 2 messages. One is a timeout message and the second one is connection refused message. |
@slackpad any updates on delivery? version ? The problem exists on 0.7 but less often than 0.6.4. I wonder what has been changed. |
Hi, back to the question above. When is this granularity to be provided? |
Same problem for us. We can frequently have a look at what users put as health checks but don't control it at the source. |
We are trying to watch for changes in the healthy nodes for a certain service in order to adjust the configuration of our API gateway. We do this in a way very similar to consul-template, i.e. by doing a blocking call on
/v1/health/service/<service>
. Here's a typical request:The problem is that it seems these calls never actually block since the
X-Consul-Index
is different for every single request. This results in these watches flooding our servers with continuous requests since we expect them to only return if something has changed or when the requests time out. We have a lot of services so the resulting load on our servers is enormous.Our services are not flapping on their health and the actual health status in the responses is identical so we don't see the reason for these indexes to be different on every request.
It appears from our initial testing that
consul watch
and consul-template are running into the same problem, always returning immediately even if nothing has changed.Is this a bug in the way Consul determines the index of the
v1/health/service/<service>
endpoint or are we making incorrect assumptions?The text was updated successfully, but these errors were encountered: