-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve blocking queries on services that do not exist #4810
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This looks like a great enhancement to reduce furthermore the load on a cluster where clients are watching a non-existing service without any kind of rate limit. It also ensure that any client observing a soon-to-exist service will be notified quite less and will, because the full disappearance of a service happens far less than updates on services.
We had example (such as prometheus with Consul support) of implementation literally DOSing our clusters with this kind of behaviour. On another topic, the cache implementation will benefit from it as well.
Did you have time to take a look at this ? During our last Consul incident several of our services died, resulting in lots of requests because the blocking queries watching them did not block anymore. This put an additional load on the servers which were in bad shape already and has a worrying snowballing effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Aestek.
This is simple and I understand the case it solves thanks!
My one concern is that although it might be manageable in practice, this does result in potentially unbounded server memory growth over time.
In a workload for example where services are registered with fairly dynamic names so they come and go a lot, this will leave behind index records for every service name that ever existed in the cluster with no way to purge those.
It would have to be a really pathological case to cause a major problem, but it's still not ideal to be adding something which by-design "leaks" state in this way with no bounds or way to recover the space.
In the near future we are likely to add a concept of global config for a service rather than only the registrations of individual nodes. This feature is pre-design so not a firm commitment and no details decided yet, but it's likely to appear in the next months. With that, there is a simple answer - you allow explicitly registering the (logical) service as "persistent" in it's global config regardless of if any actual instances are up and then we can enable the storage of these "deregistration" indexes only for "persistent" services. If the service goes away for good, the operator can remove it's global config or set persistent to false, and at that point servers can go delete all the indexes.
So I'm not quite sure whether to leave this a while until we have a firmer idea on that stuff and do it "properly", or build some other mechanism that does the same but will potentially be replaced soon? I'd be reluctant to make any public config or API changes to support this that we then have to keep forever even though they'll possibly be obsolete soon.
We could though, if you think it's worth the effort for the interim, add a fixed TTL like a week and then introduce a reaper goroutine on the leader that checks every hour for deregistered_index
entries that have not had any service registered for a week and remove them?
Do you have opinions on that?
agent/consul/state/catalog.go
Outdated
@@ -12,7 +12,8 @@ import ( | |||
) | |||
|
|||
const ( | |||
servicesTableName = "services" | |||
servicesTableName = "services" | |||
serviceMissingIndexName = "service_missing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe naming it "service_deregistered" would be clearer?
Maybe I'm missing something but this should not cause a memory leak : we only keep track of the last index any service was deregistered at. This is a single entry in the radix regardless of the number of the number of service. |
You are right - my apologies I assumed this was per-service like the other index in use in the same function. That does remove that concern and is a lot simpler which is great. I guess that does make a pathalogical case where services are being registered and de-registered (so that zero instances of the service any more) every second no better than now though since that index will be incrementing all the time, but that seems pretty rare in practice so I think it's a good solution for now. Could we rename that index and references in this PR to be more reflective of it's actual semantic, I suggested something before but given that its global maybe Then I think this would be great to merge as it is! Thanks for all the thought here. |
24fa561
to
a8c9ac3
Compare
@banks done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
agent/consul/state/catalog.go
Outdated
} | ||
// If the index is not set for the service, it will return the missing | ||
// service index. | ||
// The service_last_extinction is set to the last raf index when a service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The service_last_extinction is set to the last raf index when a service | |
// The service_last_extinction is set to the last raft index when a service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this typo is fixed I will merge this.
When making a blocking query on a missing service (was never registered, or is not registered anymore) the query returns as soon as any service is updated. On clusters with frequent updates these queries virtually do not block, and clients with no protections againt this waste ressources on the agent and server side. Clients that do protect against this get updates later than they should because of the backoff time they implement between requests. To remedy that we track the last index when a service was last unregistered and instead block on it when a query targets a missing service. This reduces the number of unnecessary updates and still correctly notify client when a change do happen.
a8c9ac3
to
0f4a0a9
Compare
@mkeeler woops, fixed |
Thanks @Aestek |
Background
When making a blocking query on a missing service (was never registered, or is not registered anymore) the query returns as soon as any service is updated.
On clusters with frequent updates (5~10 updates/s in our DCs) these queries virtually do not block, and clients with no protections againt this waste ressources on the agent and server side. Clients that do protect against this get updates later than they should because of the backoff time they implement between requests.
Implementation
While reducing the number of unnecessary updates we still want :
To reduce the number of unnecessary updates we need to block when a request to a missing service is made. However in the following case :
client1
makes a query for servicefoo
, gets back a node and X-Consul-Index 42foo
is unregisteredclient1
makes a query forfoo
withindex=42
->foo
does not exist, the query blocks andclient1
is not notified of the change onfoo
We could store the last raft index when each service was last alive to know wether we should block on the incoming query or not, but that list could grow indefinetly.
We instead store the last raft index when a service was unregistered and use it when a query targets a service that does not exist.
When a service
srv
is unregistered this "missing service index" is always greater than any X-Consul-Index held by the clients whilesrv
was up, allowing us to immediatly notify them.client1
makes a query for servicefoo
, gets back a node andX-Consul-Index: 42
foo
is unregistered, we set the "missing service index" to 43client1
makes a blocking query forfoo
withindex=42
->foo
does not exist, we check against the "missing service index" and return immediatly withX-Consul-Index: 43
client1
makes a blocking query forfoo
withindex=43
-> we blockfoo
is registered again on index 62 ->foo
exists and its index is greater than 43, we unblock the query