Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy cluster creation issue with conflicting service names #4354

Closed
cidrick opened this issue Jul 26, 2022 · 4 comments · Fixed by #4366
Closed

Envoy cluster creation issue with conflicting service names #4354

cidrick opened this issue Jul 26, 2022 · 4 comments · Fixed by #4366

Comments

@cidrick
Copy link

cidrick commented Jul 26, 2022

Describe the bug
When emissary is creating envoy clusters, emissary is evidently constrained by a character limit for the length of the cluster and must handle potential naming conflicts. If a mapping and service are named identically in different namespaces, and the name of the service is over a certain number of characters, there seems to be a condition that causes envoy cluster creation for one of those mappings to fail.

To Reproduce

  1. Create a service that maps to a pod with a name that is over 30 characters or so, but in different namespaces. I've reproduced this issue in minikube with the following manifest.
apiVersion: v1
kind: Service
metadata:
  name: hello-this-name-is-too-long-for-emissary
  namespace: alpha
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: hello-alpha
---
apiVersion: v1
kind: Service
metadata:
  name: hello-this-name-is-too-long-for-emissary
  namespace: beta
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: hello-beta

I have a simple nginx pod in each namespace with hello-alpha and hello-beta as the matching label selector.

  1. Create a mapping in the beta namespace:
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: hello
  namespace: beta
spec:
  host: "*"
  prefix: /hello-beta
  service: http://hello-this-name-is-too-long-for-emissary

The following route will be created in envoy according to the debug GUI

{
  "match": {
    "case_sensitive": true,
    "prefix": "/hello-beta",
    "runtime_fraction": {
      "default_value": {
        "denominator": "HUNDRED",
        "numerator": 100
      },
      "runtime_key": "routing.traffic_shift.cluster_http___hello_this_name_is_too_lo-0"
    }
  },
  "route": {
    "cluster": "cluster_http___hello_this_name_is_too_lo-0",
    "prefix_rewrite": "/",
    "timeout": "3.000s"
  }
}
  1. Create a similar mapping but in the alpha namespace.
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: hello
  namespace: alpha
spec:
  host: "*"
  prefix: /hello-alpha
  service: http://hello-this-name-is-too-long-for-emissary

Once the mapping is created, it will now "take over" the envoy cluster with the -0 suffix, and the endpoints for this cluster will be updated to point to the alpha pod.

{
  "match": {
    "case_sensitive": true,
    "prefix": "/hello-alpha",
    "runtime_fraction": {
      "default_value": {
        "denominator": "HUNDRED",
        "numerator": 100
      },
      "runtime_key": "routing.traffic_shift.cluster_http___hello_this_name_is_too_lo-0"
    }
  },
  "route": {
    "cluster": "cluster_http___hello_this_name_is_too_lo-0",
    "prefix_rewrite": "/",
    "priority": null,
    "timeout": "3.000s"
  }
}

However, the mapping for beta will still be pointing to the envoy cluster named cluster_http___hello_this_name_is_too_lo-0. This causes traffic destined for the beta instance to be routed to the alpha instance instead.

Before creating the alpha mapping, if I hit the ClusterIP of the emissary-ingress service, requests for /hello-beta are routed to the beta pod correctly.

 testpod ~ $ curl -s emissary-ingress.emissary.svc.cluster.local/hello-alpha | grep name
<p><span>Server&nbsp;name:</span> <span>hello-alpha</span></p>
 testpod ~ $ curl -s emissary-ingress.emissary.svc.cluster.local/hello-beta | grep name
<p><span>Server&nbsp;name:</span> <span>hello-beta</span></p>

After creating the alpha mapping, requests for /hello-beta are send to the hello-alpha pod

 testpod ~ $ curl -s emissary-ingress.emissary.svc.cluster.local/hello-alpha | grep name
<p><span>Server&nbsp;name:</span> <span>hello-alpha</span></p>

 testpod ~ $ curl -s emissary-ingress.emissary.svc.cluster.local/hello-beta | grep name
<p><span>Server&nbsp;name:</span> <span>hello-alpha</span></p>

Additionally, the emissary pod logs the following error:

2022-07-26 16:05:18 diagd 3.0.0 [P17TAEW] ERROR: {
...
<trim long JSON dump of envoy config>
...
could not validate the envoy configuration above after 5 retries, failed with error
[2022-07-26 16:05:18.276][140][info][main] [source/server/server.cc:786] runtime: layers:
  - name: static_layer
    static_layer:
      re2.max_program_size.error_level: 200
      envoy.reloadable_features.no_extension_lookup_by_name: false

[2022-07-26 16:05:18.280][140][critical][main] [source/server/config_validation/server.cc:63] error initializing configuration '/ambassador/snapshots/econf-tmp.json': route: unknown cluster 'cluster_http___hello_this_name_is_too_lo-1'

(exit code 1)
Aborting update...

This would suggest to me that the beta mapping should get a route that points to a new envoy cluster named cluster_http___hello_this_name_is_too_lo-1. However, this cluster is not being created correctly. I haven't been able to spot anything in the logs that would suggest why this cluster is not being created.

Expected behavior
The new envoy cluster should be created and the "beta" route should be correctly updated to point to the new cluster.

Versions:

  • emissary-ingress 3.0.0 (can reproduce on 2.3.1 as well)
  • Kubernetes 1.21, minikube 1.26.0
@cidrick
Copy link
Author

cidrick commented Jul 26, 2022

@kflynn has requested debug logs via Slack, so I'm attaching them here. I began capturing these logs from just prior to applying the mapping in the alpha namespace.

alphamapping.txt

@kflynn
Copy link
Member

kflynn commented Jul 26, 2022

Here's the smoking gun from the logs:

2022-07-26 21:13:04 diagd 3.0.0 [P17TAEW] DEBUG: COLLISION: compress cluster_http___hello_this_name_is_too_long_for_emissary_alpha_alpha to cluster_http___hello_this_name_is_too_lo
2022-07-26 21:13:04 diagd 3.0.0 [P17TAEW] DEBUG: COLLISION: compress cluster_http___hello_this_name_is_too_long_for_emissary_beta_beta to cluster_http___hello_this_name_is_too_lo
2022-07-26 21:13:04 diagd 3.0.0 [P17TAEW] DEBUG: COLLISION: mangle cluster_http___hello_this_name_is_too_long_for_emissary_alpha_alpha => cluster_http___hello_this_name_is_too_lo-0

The service name hello-this-name-is-too-long-for-emissary is long enough all by itself that the namespace name is not being taken into account when name mangling happens. 🤦‍♂️

There are a few possible fixes, but I think I would start by trying this patch:

diff --git a/python/ambassador/ir/ir.py b/python/ambassador/ir/ir.py
index abd6f4bee..7993a52d7 100644
--- a/python/ambassador/ir/ir.py
+++ b/python/ambassador/ir/ir.py
@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License
-import json
+import hashlib
 import logging
 import os
 from ipaddress import ip_address
@@ -480,7 +480,11 @@ class IR:
             if len(name) > 60:
                 # Too long. Gather this cluster by name prefix and normalize
                 # its name below.
-                short_name = name[0:40]
+                h = hashlib.new("sha1")
+                h.update(name.encode("utf-8"))
+                hd = h.hexdigest()[0:16].upper()
+
+                short_name = name[0:40] + "-" + hd

                 cluster = self.clusters[name]
                 self.logger.debug(f"COLLISION: compress {name} to {short_name}")

Unfortunately I won't be able to test that until at least tomorrow, but I think it should work – it makes sure that the short name both has some human-readable aspects for matching clusters to Mappings, and takes the whole name into account.

@cidrick
Copy link
Author

cidrick commented Jul 27, 2022

I can confirm that the above patch has resolved our issues in our dev cluster. We'll attempt to deploy a custom docker image with changes to production in the coming days.

Thanks a ton for the quick turnaround!

kflynn added a commit that referenced this issue Jul 31, 2022
…hat don't collide.

Fixes #4354.

Signed-off-by: Flynn <emissary@flynn.kodachi.com>
kflynn added a commit that referenced this issue Jul 31, 2022
…hat don't collide.

Fixes #4354.

Signed-off-by: Flynn <emissary@flynn.kodachi.com>
@vashibhavin
Copy link

We deployed a locally built patched version of emissary to our production cluster 2 weeks ago and are happy to report that the patch resolved the issue for us. Thanks again for the quick turnaround on this.

kflynn added a commit that referenced this issue Aug 25, 2022
…hat don't collide.

Fixes #4354.

Signed-off-by: Flynn <emissary@flynn.kodachi.com>
kflynn added a commit that referenced this issue Aug 26, 2022
…hat don't collide.

Fixes #4354.

Signed-off-by: Flynn <emissary@flynn.kodachi.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants