Envoy cluster creation issue with conflicting service names #4354

cidrick · 2022-07-26T17:31:08Z

Describe the bug
When emissary is creating envoy clusters, emissary is evidently constrained by a character limit for the length of the cluster and must handle potential naming conflicts. If a mapping and service are named identically in different namespaces, and the name of the service is over a certain number of characters, there seems to be a condition that causes envoy cluster creation for one of those mappings to fail.

To Reproduce

Create a service that maps to a pod with a name that is over 30 characters or so, but in different namespaces. I've reproduced this issue in minikube with the following manifest.

apiVersion: v1
kind: Service
metadata:
  name: hello-this-name-is-too-long-for-emissary
  namespace: alpha
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: hello-alpha
---
apiVersion: v1
kind: Service
metadata:
  name: hello-this-name-is-too-long-for-emissary
  namespace: beta
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: hello-beta

I have a simple nginx pod in each namespace with hello-alpha and hello-beta as the matching label selector.

Create a mapping in the beta namespace:

apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: hello
  namespace: beta
spec:
  host: "*"
  prefix: /hello-beta
  service: http://hello-this-name-is-too-long-for-emissary

The following route will be created in envoy according to the debug GUI

{
  "match": {
    "case_sensitive": true,
    "prefix": "/hello-beta",
    "runtime_fraction": {
      "default_value": {
        "denominator": "HUNDRED",
        "numerator": 100
      },
      "runtime_key": "routing.traffic_shift.cluster_http___hello_this_name_is_too_lo-0"
    }
  },
  "route": {
    "cluster": "cluster_http___hello_this_name_is_too_lo-0",
    "prefix_rewrite": "/",
    "timeout": "3.000s"
  }
}

Create a similar mapping but in the alpha namespace.

apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: hello
  namespace: alpha
spec:
  host: "*"
  prefix: /hello-alpha
  service: http://hello-this-name-is-too-long-for-emissary

Once the mapping is created, it will now "take over" the envoy cluster with the -0 suffix, and the endpoints for this cluster will be updated to point to the alpha pod.

{
  "match": {
    "case_sensitive": true,
    "prefix": "/hello-alpha",
    "runtime_fraction": {
      "default_value": {
        "denominator": "HUNDRED",
        "numerator": 100
      },
      "runtime_key": "routing.traffic_shift.cluster_http___hello_this_name_is_too_lo-0"
    }
  },
  "route": {
    "cluster": "cluster_http___hello_this_name_is_too_lo-0",
    "prefix_rewrite": "/",
    "priority": null,
    "timeout": "3.000s"
  }
}

However, the mapping for beta will still be pointing to the envoy cluster named cluster_http___hello_this_name_is_too_lo-0. This causes traffic destined for the beta instance to be routed to the alpha instance instead.

Before creating the alpha mapping, if I hit the ClusterIP of the emissary-ingress service, requests for /hello-beta are routed to the beta pod correctly.

 testpod ~ $ curl -s emissary-ingress.emissary.svc.cluster.local/hello-alpha | grep name
<p><span>Server&nbsp;name:</span> <span>hello-alpha</span></p>
 testpod ~ $ curl -s emissary-ingress.emissary.svc.cluster.local/hello-beta | grep name
<p><span>Server&nbsp;name:</span> <span>hello-beta</span></p>

After creating the alpha mapping, requests for /hello-beta are send to the hello-alpha pod

 testpod ~ $ curl -s emissary-ingress.emissary.svc.cluster.local/hello-alpha | grep name
<p><span>Server&nbsp;name:</span> <span>hello-alpha</span></p>

 testpod ~ $ curl -s emissary-ingress.emissary.svc.cluster.local/hello-beta | grep name
<p><span>Server&nbsp;name:</span> <span>hello-alpha</span></p>

Additionally, the emissary pod logs the following error:

2022-07-26 16:05:18 diagd 3.0.0 [P17TAEW] ERROR: {
...
<trim long JSON dump of envoy config>
...
could not validate the envoy configuration above after 5 retries, failed with error
[2022-07-26 16:05:18.276][140][info][main] [source/server/server.cc:786] runtime: layers:
  - name: static_layer
    static_layer:
      re2.max_program_size.error_level: 200
      envoy.reloadable_features.no_extension_lookup_by_name: false

[2022-07-26 16:05:18.280][140][critical][main] [source/server/config_validation/server.cc:63] error initializing configuration '/ambassador/snapshots/econf-tmp.json': route: unknown cluster 'cluster_http___hello_this_name_is_too_lo-1'

(exit code 1)
Aborting update...

This would suggest to me that the beta mapping should get a route that points to a new envoy cluster named cluster_http___hello_this_name_is_too_lo-1. However, this cluster is not being created correctly. I haven't been able to spot anything in the logs that would suggest why this cluster is not being created.

Expected behavior
The new envoy cluster should be created and the "beta" route should be correctly updated to point to the new cluster.

Versions:

emissary-ingress 3.0.0 (can reproduce on 2.3.1 as well)
Kubernetes 1.21, minikube 1.26.0

The text was updated successfully, but these errors were encountered:

cidrick · 2022-07-26T21:01:12Z

@kflynn has requested debug logs via Slack, so I'm attaching them here. I began capturing these logs from just prior to applying the mapping in the alpha namespace.

alphamapping.txt

kflynn · 2022-07-26T21:37:27Z

Here's the smoking gun from the logs:

2022-07-26 21:13:04 diagd 3.0.0 [P17TAEW] DEBUG: COLLISION: compress cluster_http___hello_this_name_is_too_long_for_emissary_alpha_alpha to cluster_http___hello_this_name_is_too_lo
2022-07-26 21:13:04 diagd 3.0.0 [P17TAEW] DEBUG: COLLISION: compress cluster_http___hello_this_name_is_too_long_for_emissary_beta_beta to cluster_http___hello_this_name_is_too_lo
2022-07-26 21:13:04 diagd 3.0.0 [P17TAEW] DEBUG: COLLISION: mangle cluster_http___hello_this_name_is_too_long_for_emissary_alpha_alpha => cluster_http___hello_this_name_is_too_lo-0

The service name hello-this-name-is-too-long-for-emissary is long enough all by itself that the namespace name is not being taken into account when name mangling happens. 🤦‍♂️

There are a few possible fixes, but I think I would start by trying this patch:

diff --git a/python/ambassador/ir/ir.py b/python/ambassador/ir/ir.py
index abd6f4bee..7993a52d7 100644
--- a/python/ambassador/ir/ir.py
+++ b/python/ambassador/ir/ir.py
@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License
-import json
+import hashlib
 import logging
 import os
 from ipaddress import ip_address
@@ -480,7 +480,11 @@ class IR:
             if len(name) > 60:
                 # Too long. Gather this cluster by name prefix and normalize
                 # its name below.
-                short_name = name[0:40]
+                h = hashlib.new("sha1")
+                h.update(name.encode("utf-8"))
+                hd = h.hexdigest()[0:16].upper()
+
+                short_name = name[0:40] + "-" + hd

                 cluster = self.clusters[name]
                 self.logger.debug(f"COLLISION: compress {name} to {short_name}")

Unfortunately I won't be able to test that until at least tomorrow, but I think it should work – it makes sure that the short name both has some human-readable aspects for matching clusters to Mappings, and takes the whole name into account.

cidrick · 2022-07-27T17:42:19Z

I can confirm that the above patch has resolved our issues in our dev cluster. We'll attempt to deploy a custom docker image with changes to production in the coming days.

Thanks a ton for the quick turnaround!

…hat don't collide. Fixes #4354. Signed-off-by: Flynn <emissary@flynn.kodachi.com>

vashibhavin · 2022-08-15T14:38:17Z

We deployed a locally built patched version of emissary to our production cluster 2 weeks ago and are happy to report that the patch resolved the issue for us. Thanks again for the quick turnaround on this.

…hat don't collide. Fixes #4354. Signed-off-by: Flynn <emissary@flynn.kodachi.com>

kflynn added a commit that referenced this issue Jul 31, 2022

Make sure really long service names still end up with cluster names t…

594d9ee

…hat don't collide. Fixes #4354. Signed-off-by: Flynn <emissary@flynn.kodachi.com>

kflynn added a commit that referenced this issue Jul 31, 2022

Make sure really long service names still end up with cluster names t…

60c0ee8

…hat don't collide. Fixes #4354. Signed-off-by: Flynn <emissary@flynn.kodachi.com>

kflynn mentioned this issue Jul 31, 2022

Fix cluster naming for really long service names #4366

Merged

5 tasks

kflynn added a commit that referenced this issue Aug 25, 2022

Make sure really long service names still end up with cluster names t…

2699e0a

…hat don't collide. Fixes #4354. Signed-off-by: Flynn <emissary@flynn.kodachi.com>

kflynn added a commit that referenced this issue Aug 26, 2022

Make sure really long service names still end up with cluster names t…

01fd8ec

…hat don't collide. Fixes #4354. Signed-off-by: Flynn <emissary@flynn.kodachi.com>

kflynn closed this as completed in #4366 Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy cluster creation issue with conflicting service names #4354

Envoy cluster creation issue with conflicting service names #4354

cidrick commented Jul 26, 2022 •

edited

Loading

cidrick commented Jul 26, 2022 •

edited

Loading

kflynn commented Jul 26, 2022

cidrick commented Jul 27, 2022

vashibhavin commented Aug 15, 2022

Envoy cluster creation issue with conflicting service names #4354

Envoy cluster creation issue with conflicting service names #4354

Comments

cidrick commented Jul 26, 2022 • edited Loading

cidrick commented Jul 26, 2022 • edited Loading

kflynn commented Jul 26, 2022

cidrick commented Jul 27, 2022

vashibhavin commented Aug 15, 2022

cidrick commented Jul 26, 2022 •

edited

Loading

cidrick commented Jul 26, 2022 •

edited

Loading