-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
X-Consul-Index reflects global service churn when watching a single service causing issues with sleeping between blocking calls #3890
Comments
Hi Pierre, The docs on blocking queries say:
So it's not a bug that you get same result necessarily although not blocking at all is unexpected. That said I can't reproduce that behaviour using your script locally, even with 4 second sleeps it still blocks as expected: (note I changed the
|
Ah, the index increases every time in your second output I think are the clue we need. When you hit a blocking endpoint with an So the reason your script works with no sleep is that you are starting the next watch immediately from where you left off. in the case of 4 seconds, you clearly have enough write activity on your cluster to mean that if you want a few seconds the index will have changed therefore you are always calling with a stale index and so always getting an immediate return. In my case it was a local dev cluster with no other writes occurring so even if I waited 4 seconds, the index I used wasn't stale so blocking happens. I don't know that anything else can work - if you want to use blocking queries to stay in sync with state you kind of have to "keep up" with current index state since Consul has no way to know in between blocking calls that you are still interested in that thing, who you are or what index you last saw that thing at. In other words, having a delay between blocking calls is working as expected and generally is not what you want to do on a real cluster with any amount of activity. Let me know if I've missed something and that explanation doesn't make sense to you! |
Ok, I think I understand the problem: Basically, we have updates very frequently (since our cluster is large enough) -> X-Consul-Index is updated every 2-3 seconds. We have an issue because when we watch (for instance /v1/catalog/services ) -> we are downloading continuously data (as soon as your are receiving data, another update comes and you re-download the whole catalog). So, the idea is to add a minimum delay (see my PR here for instance in consul-template: hashicorp/consul-template#1066 ) The problem is that we expected such behavior : /catalog/services/redis index=2 Someone watches /services/redis with X-Consul-Index = 2 => wait until /services/redis is updated... fine But if in the meantime, a new service is updated, for instance a new service my_new_service, we have: /catalog/services/redis index=2 Then, at that time, if a watch is performed on /catalog/services/redis with index=2, then the data is immediatly returned (while the first watcher is still waiting for /catalog/services/redis) This is a bit a problem for us because:
It also means that on a cluster with enough updates/secs, all clients watching are contiuously downloading the watch... => Problem for us. Here is a full test for you you can simulate ton your laptop: Reproduce the issue locallyLaunch Consul agent in dev mode:consul agent -dev Register a service named redis:curl --request PUT -d '{"name": "redis", "port":8000}' localhost:8500/v1/agent/service/register Watch the changes with the script, lets call this output OUTPUT1export CONSUL_SLEEP_DELAY=1; consul_check_services_changes.sh localhost:8500/v1/catalog/service/redis
X-Consul-Index: 2277 at Thu 15 Feb 2018 16:43:03 CET In another window, launch this script:while true; do curl --request PUT -d '{"name": "hello-'"$(date +%s)"'", "port":8000}' localhost:8500/v1/agent/service/register; sleep 1; done => The output of OUTPUT1 is not changing at all => Exactly what we expect Ok, now, launch in parallel the same exact script as OUTPUT1 in a new window, but lets call this OUTPUT2: Watch the changes with the script, lets call this output OUTPUT2 (OUTPUT1 is still running)here is the output: export CONSUL_SLEEP_DELAY=1; consul_check_services_changes.sh localhost:8500/v1/catalog/service/redis
X-Consul-Index: 2351 at Thu 15 Feb 2018 16:44:50 CET diff: No Differences found in service
X-Consul-Index: 2352 at Thu 15 Feb 2018 16:44:51 CET diff: No Differences found in service
X-Consul-Index: 2353 at Thu 15 Feb 2018 16:44:52 CET diff: No Differences found in service
X-Consul-Index: 2354 at Thu 15 Feb 2018 16:44:53 CET diff: No Differences found in service => oh no ! while we have the same parameters as output1, we are continuously downloading content ConclusionSo it means that: The same exact program does not output the same data depending whether it is running while the cluster is loaded or not. It is not possible to efficiently watch changes in a large Consul Cluster. That's a big deal for us. h1. How to fix this Consul behavior ? To me, what should be the behavior: for each endpoint, ie: /services/my_service -> Consul store the Index of last update, let's call it my_service_index When someone is watching /services/my_service if X-Consul-Index provided < my_service_index return immediately, if not watch as usual. |
@banks I think it might make sense to re-open this issue |
@banks Do you think we might re-open this ? |
Your suggestion could work but it to do it robustly amounts to a pretty major rearchitecture of how watches work I think - for one thing it would make the HTTP API stateful or require explicitly registering "persistent" watches so Consul can separately log changes that affect only the subject of that watch. I'm not familiar with the other issue you linked but I'll read through properly and talk to the team. It maybe as @slackpad said in the other issue that there are a few people who have run into these effects and we should consider this case and the options open (even if they are just to document this behaviour better for now). I'll reopen or merge as appropriate. That said it seems to me that the sleep between blocking calls is the main cause of issue for you. With no sleep you might still get unlucky due to concurrent updates within the RTT of the previous call but unless you are doing hundreds of updates a second or have extremely poor latencies most of the time you will get efficient blocking behaviour you want. Note that once a blocking query is made with a non-stale index, it won't return on any other write except for one that actually touches the entity you are watching. But if you sleep externally at a time longer than you average update rate, you never give your blocking client the chance to "catch up". I think you mention "rate limiting" is the reason you want to sleep but it seems to fundamentally be at odds with how stateless long polls work. Is there a real issue with not sleeping between blocking calls? Thanks for you help and patience :) |
I re-read and looked at your It seems the rate limiting is needed because you have some endpoints that legitimately change really often (like watching for all service definitions with lots of service churn) and downloading the full response in a busy loop is too expensive. In that case then yes HTTP long-poll simply can't be an efficient solution and your best option is to explicitly switch to regular polls that don't block (which work fine and let you trade staleness and resource usage). Even if you pass Where I'm not so clear is why you still want to use blocking queries and sleeps for endpoints that don't change very often. Could you elaborate on that part please? I'm trying to understand your use-case fully because I fear the "real" solution is pretty involved :) |
Hello Paul, thank you very much for your response Basically, I am not convinced the sleep is the reason, the reason is simply that as long as X-Consul-Index changes fast enough, a watcher cannot submit its index before it changes, and so the watch won't be effective and return immediately data. So basically, this would happen on a non-sleeping client slow enough (or not using the local consul agent but using a remote one with enough latency) On a API side, all people have been talking to (several of our engineers did implement watches in various languages) do interpret the X-Consul-Index as: the last time I have seen this resource, its version was with this version number, notify me when the version on this endpoint do change. Which looks like a reasonable contract. But: as my tests shows it, the contact is not the following, the contract is:
Why all of this is a concern? For several endpoints, X-Consul-Index is updated very often: /v1/catalog/services for instance, but also any endpoint not existing, for instance see #3712 ) in our large cluster is updated every second (or around that). When watching this endpoint, we then spend our time doing requests to Consul Agent which forwards it to Consul Server => on our clusters several hundreds of MBits/sec between the Consul Agent and the Consul Server for a single client watching for changes (see hashicorp/consul-template#1065 and the corresponding PR: hashicorp/consul-template#1066 ). So, basically, the idea is: the client does not care to have a change every few milliseconds, we want changes in a reasonable time, for instance 5 seconds, so, no need to re-watch a endpoint as soon as we received the data. Let's wait 5 seconds, and re-watch again. BUT the behavior of Consul as described in this bug defeats it, since data will be returned immediately. It means we are basically not doing watches, be are doing active polling (thus X-Consul-Index has no interest at all for large clusters). In order to build libraries using those mechanisms, you want as a developer some reasonable expectations:
This can not be achieved with the current Consul implementation easily, thus, I would suggest to do the following:
It would fix efficiently the issue and be a clearer contract, it would fix lots of bug and would avoid DOSing the large Consul Cluster. Just for some experience. In our company, due to this kind of issues, we have been DOSed by:
All these apps have been written by reasonable, clever and different people, so the current implementation of Consul does not look satisfying I am OK to try implementing this if you want. Regards |
Hello @banks , First POCI've been working on a patch with very good results. Basically, when retrieving the index of changes of a service, the current implementation does this in catalog.go:
Which basically retrieves the max consul index for both nodes and services. I have been doing a patch that computes the maximum index of services when retrieving the list of services by patching the line 777 by the following:
This works quite well (I don't have the issue anymore), it suffers from a drawback however: since I am working on existing services to compute the X-Consul-Index, there might be a race condition when removing services. Consider this example: Max Index := 79 (aka global X-Consul-Index is 79) A user retrieve the result and the X-Consul-Index (whenever we return 50 or 79 in HTTP headers does not change anything) At index=80, instance2 is removed At Index=85 the user come back with its value X-Consul-Index = 79 Computation of Index is now: So the user will block and won't be notified of the deletion of instance2, so, my approach does not work for removal of services, this is a dead end. Possible solutionSo, what I propose is the following:
I think such a patch is not that much intrusive and will really help all tools performing watches. If I submit such patch, would you consider its inclusion in upstream? |
Hi Pierre - thanks for the follow up. I've been trying to fully understand the issue here. From you're last message I think you fully understanding this, but I'm going to spell out the subtleties and my own bad assumptions here for others who are following too.
So for vast majority of users, blocking semantics work perfectly well and actually match the docs and your original expectations for the index parameter's semantics. The issue you have is a combination of factors:
Possible Work-aroundsI'm sure you already looked into this thoroughly but as far as I can see it still works fine currently even at your registration rate, if you don't sleep between blocking calls. I understand for results with a large payload and regular change (e.g. all services in DC) you might want to limit bandwidth with sleep. It seems that is easy to do buy switching to short polls and a delay of your choice though. So a solution with no changes:
Is there some specific reason that you can't take this approach? I realise it would be nice if you never had to think about this and blocking just worked for you every time, but I'm trying to work out if that's a "nice to have" of if there is some fundamental limitation that means you can't take this approach and really do need to have single-service watches that sleep between calls? Your SolutionYour proposed solution is reasonable although it's not a totally trivial change. I initially assumed there was a simpler option just like you suggested but as you found de-registrations would not be caught.
I don't think it's super intrusive but it's not a no-brainer either. As far as I can see the vast majority of users don't run into this problem for the reasons I mentioned so while it would be a good addition I'm not sure if it would be immediately useful to a lot of Consul users. If you think it's important enough to contribute despite the workarounds I mentioned (or because they really won't work for you for some reason) then I think your approach sounds reasonable :) Thanks for all your effort chasing this down. I've changed the issue title to reflect the scope/feature request a little more specifically. |
@banks Thank you very much for your comments as well as your explanations. Actually, we already have the issues without sleeping. On some of our code, the watchers are performing operations on the services, since we have some service with many many machines, the time needed to process the changes + the number of changes per second (around 3 X-Consul-Index per seconds on the biggest cluster) do create this issue (while mitigated by the patch on my colleague #3642 ): some of our apps spent their time garbage collecting since those apps are downloading around 50Mbyte/sec on a single service. I suspect it will improve a lot the performance on tools such as consul-template as well (we have around 1.3k different kind of services with around 5k machines per cluster), I'll try to give you measurements soon if I can. Thank you Regards |
Thanks @pierresouchay for your work getting this improvement in! |
@banks For the record, here is the effect of the patch on our cluster: this graph shows the number of blocking queries the servers are receiving per second.
blue and red servers have been restarted around the same time. The inflexion point of the red line is when we applied the patch on one of our 3 servers. On network bandwidth side, a traffic went from an average of 720Mbps (peaks around 1.5 Gpbs) to 361Mbps (peak around 450Mbps) ... This patch really changes the performance on our side. Thank you for Merging it! |
Linked to hashicorp/consul#3890 Having duration of more than 3 seconds might actually increase the bandwidth. Change-Id: Iff9104af5f1ce52cadc449a10de657d5c1b2a968
Just a note, this change probably should be labeled as a breaking change in the changelog. There's a possibility that code might rely on the old behavior, to reduce number of open connections to the consul agent (e.g. when monitoring 10 services it opens a single connection and when X-Consul-Index changes it will check status of other 9 services) |
@takeda this kind of usage is unlikely: when timing was right, the call was still blocking (see the description here: #3890 (comment)), so having this pattern did need: a very hi write rate in cluster AND to delay a call. But you had no confidence it would block or not. It means that for small clusters it does not change anything, for large, it might, but very hard to get it right, so unlikely someone uses this behavior for its own good. What we are seeing in our clusters is a huge increase of apps waiting for watches and a very significant decrease of load, CPU and bandwidth on all Consul agent/servers |
@pierresouchay thanks for the follow-up info - really good to see. @takeda we are not considering it a "breaking change" which would imply being held until a major version release. You are right there are potential edge-case clients that might rely on undocumented side-effects of the old implementation and we've been careful to word that adequately in the upgrade guide as a clear notice. Hopefully that is adequate. See the changes in PR: https://github.com/pierresouchay/consul/blob/7b81e2c3ad133ef5c460dd642ab44dd07bf08c8a/website/source/docs/upgrading.html.md#upgrade-from-version-106-to-higher I've also made the CHANGELOG entry more explicit (good catch thanks!). https://github.com/hashicorp/consul/blob/master/CHANGELOG.md |
The reason why I mentioned is is that my code will be broken by this change, and if I did it, chances are that other people did it too. I personally think this type of change shouldn't be included when only patchlevel number is increased (i.e. the last number of the version). If you have to include upgrade instructions like these: https://github.com/hashicorp/consul/blob/master/website/source/docs/upgrading.html.md#upgrade-from-version-106-to-higher that's a sign that the change should be included at least with the minor version increase (i.e. the second digit). Most projects follow this convention. Having said that I already know that I need to make changes in my code before upgrading to 1.0.7 so will be careful around that version, thanks also adding the extra note. |
Oh, ok, sorry about that :( Just curious, how did it break the code? What is the assumption broken? |
The assumption that X-Consul-Index is global for all health checks, undocumented but still. Basically your issue is that with a large cluster servers are bombarded with requests, so this was a measure to mitigate that. Since previously all health checks were sharing the same index my code was just tried to minimize requests. If there are 10 services to check, the code only was checking state of the first service, if the version was changed it then queried other 9 services for changes. I think overall this change is for the better, but I wouldn't call it a small changes that warrants to be in 1.0.7 unless there wont be 1.0.7 and 1.1.0 will be released instead. To be frank I really wish Consul would come with a similar protocol to Zookeeper, where clients were using long standing single connections. And upon connecting they just specify all paths that they watch and ZK would only send notification about what changed. No need to make multiple connections, reconnecting after wait timer expires and no X-Consul-Index to track. |
Thanks @takeda, It's a fine line to walk - we originally were extremely cautious about this patch for reasons like this and requested those upgrade docs even though it wasn't changing a documented guarantee of the API. We also verified that official tools will all be OK with it. But we'll review again as we decide on 1.0.7 cut if the risk seems too great. The fact that it will effect at least one person who managed to spot it in time is interesting. |
After #3712 (which is not really a big issue for us) we found another weird issue with X-Consul-Index while trying to improve Consul discovery within Prometheus: prometheus/prometheus#3814
When watching a service using a X-Consul-Index, if waiting more than 3 seconds before watching again the same path, the next call won't block and will return immediately the same data (which has not changed)
Is there a reason for that or is it a bug ?
In order to reproduce, here is a script:
When launching at the same time on our cluster the same script with a different delay between watches, here is what we observe:
With delay < 3 : the calls do block
With delay > 3, the calls do not block, and return immediately
The text was updated successfully, but these errors were encountered: