-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connect: health API blocking doesn't notice changes to proxy health checks #5506
Labels
theme/connect
Anything related to Consul Connect, Service Mesh, Side Car Proxies
type/bug
Feature does not function as expected
Milestone
Comments
banks
added
type/bug
Feature does not function as expected
theme/connect
Anything related to Consul Connect, Service Mesh, Side Car Proxies
labels
Mar 18, 2019
This is the implementation of option 2 and seems to work from ad-hoc testing. Adding it here just because I've not written tests yet and want to unblock myself on other things. diff --git a/agent/consul/state/catalog.go b/agent/consul/state/catalog.go
index 4452298b7..ac28624c8 100644
--- a/agent/consul/state/catalog.go
+++ b/agent/consul/state/catalog.go
@@ -1896,8 +1896,18 @@ func (s *Store) checkServiceNodes(ws memdb.WatchSet, serviceName string, connect
// Return the results.
var results structs.ServiceNodes
+ // For connect queries we need a list of any proxy service names in the result
+ // set
+ var serviceNames map[string]struct{}
+ if connect {
+ serviceNames = make(map[string]struct{})
+ }
for service := iter.Next(); service != nil; service = iter.Next() {
- results = append(results, service.(*structs.ServiceNode))
+ sn := service.(*structs.ServiceNode)
+ results = append(results, sn)
+ if serviceNames != nil && sn.ServiceName != serviceName {
+ serviceNames[sn.ServiceName] = struct{}{}
+ }
}
// Get the table index.
@@ -1923,6 +1933,34 @@ func (s *Store) checkServiceNodes(ws memdb.WatchSet, serviceName string, connect
// when watchLimit is hit (~682 service instances). See
// https://github.com/hashicorp/consul/issues/4984
ws.Add(ch)
+
+ if connect {
+ // If this is a connect query then there is a subtlety to watch out for.
+ // In addition to watching the proxy service indexes for changes below, we
+ // need to still keep an eye on the connect service index in case a new
+ // proxy with a new name registers - we are only watching proxy service
+ // names we know about below so we'd miss that otherwise. Thankfully this
+ // is only ever one extra chan to watch and will catch any changes to
+ // proxy registrations for this target service.
+ ws.Add(iter.WatchCh())
+ }
+ }
+
+ if len(serviceNames) > 0 {
+ // Also need to fetch the indexes and watch chans for the proxy service
+ // names in the result set.
+ for proxySvc := range serviceNames {
+ proxyIdx, proxyCh := maxIndexAndWatchChForService(tx, proxySvc, true, true)
+ if idx < proxyIdx {
+ idx = proxyIdx
+ }
+ // Only need to watch this index if we are using the index watch
+ // optimization above if not we'll be watching all the radix nodes that
+ // matter anyway.
+ if ch != nil {
+ ws.Add(proxyCh)
+ }
+ }
}
return s.parseCheckServiceNodes(tx, fallbackWS, idx, serviceName, results, err) |
banks
added a commit
that referenced
this issue
Mar 19, 2019
…ptimal number of watch chans. Fixes #5506.
This was referenced Mar 19, 2019
banks
added a commit
that referenced
this issue
Mar 21, 2019
* Make Connect health queryies unblock correctly in all cases and use optimal number of watch chans. Fixes #5506. * Node check test cases and clearer bug test doc * Comment update
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
theme/connect
Anything related to Consul Connect, Service Mesh, Side Car Proxies
type/bug
Feature does not function as expected
While investigating #5499 and #5332 I discovered that
/health/connect/:service
doesn't unblock watchers if a health check on a proxy instance changes state. TheX-Consul-Index
also doesn't update.At first I thought it was related to my optimization in #5449, but even before that the same bug existed for connect discovery with proxies.
The problem seems to be this:
checkServiceNodes
, we switch the index for connect queries, but then we continue to use the target service name when we choose the consul index:consul/agent/consul/state/catalog.go
Line 1904 in 69ac5d0
WatchSet
would have unblocked watchers because it would have been watching the proxy checks loaded inparseCheckServiceNodes
explicitly too, but since the index returned from the re-query was the same, the query would still not return. So after Optimize health watching to single chan/goroutine. #5449 we have identical (broken) behaviour but at least we no longer waste CPU cycles re-querying when we won't actually return the result!We should:
Option 1
Add a new index type called
service-connect
which is updated all the same times that the per-service index is updated, and also updated any time a proxy instance that would target that service is updated.This is "simple" but it requires more storage and crucially plumbing through all the same places that our current service index update happens.
Even worse, it requires plumbing context about whether the service is a connect proxy (or looking it up again) into places like
ensureCheckTxn
where it's not needed now.Option 2
A less invasive option that requires no addition updates or storage and only a single additional lookup/watch chan in the common case would be:
This is almost ideal, the one edge case it doesn't catch is if a proxy is registered with a new proxy service name that no other proxy is using (maybe it's the first proxy for that service or has a snowflake naming convention). In that case, watchers would miss the addition if it didn't also happen at the same time as a change to the actual target service registration.
We can solve for this one edge case by:
services.connect
index iterator which will be invalidated by any changes to the proxies that are registered.This keeps the number of watch chans small - typically 3 - without any invasive updates or negative impact on other query types triggering more often than they need etc.
Unless I'm missing something still Option 2 seems the clear winner.
The text was updated successfully, but these errors were encountered: