Better documentation on the expectations for `networkValidator.connectAddr` for `linkerd-control-plane` helm chart values #12797

djryanj · 2024-07-02T20:12:46Z

What is the issue?

When deploying linkerd into an air-gapped or more restrictive environment using helm, the default value of 1.1.1.1:20001 for networkValidator.connectAddr (see: https://github.com/linkerd/linkerd2/blob/main/charts/linkerd-control-plane/values.yaml#L327) causes the init container to fail. The note in the chart is

we expect this to be rewritten

But there is no documentation anywhere that I can find on what it is expected to be rewritten to, or what the expectations of that endpoint are.

E.g., Is it a simple TCP connection which completes a 3-way handshake (this is presumed to be the case)? Does an https webserver fit the bill?

As an aside, the fact that 1.1.1.1:20001 actually works is weird, because documentation on what's running there is, again, difficult to find.

Also, this is sort of (but not really) related to https://linkerd.buoyant.io/t/error-linkerd-network-validator-in-air-gapped-installation/302 and was also a factor in my troubleshooting of #7945.

How can it be reproduced?

Install linkerd using helm and cniEnabled: true on a cluster where outbound connections to 1.1.1.1:20001 are blocked; observe failed pod.
Try to find documentation on that value; fail.
Open network port
Pods succeed

Logs, error output, etc

2024-07-02T20:03:06.514474Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-07-02T20:03:06.514493Z DEBUG linkerd_network_validator: token="<redacted>\n"
2024-07-02T20:03:06.514500Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2024-07-02T20:03:06.514929Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.244.1.51:34290
2024-07-02T20:03:16.515844Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

Observe firewall logs blocking this.

output of `linkerd check -o short`

N/A

Environment

Kubernetes version: 1.30.0
Host OS: Talos 1.7
Linkerd version: Helm chart 2024.6.4, version edge-24.6.4

Possible solution

Better documentation on what's needed for this value, and a clearer indication when installing linkerd-cni that it's probably necessary to change the value.

Additional context

No response

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

djryanj · 2024-07-04T20:39:39Z

So, I was wrong; the default value is fine. If linkerd-cni is working correctly, it will actually intercept that value and essentially reflect it back to the container which indicates that iptables is set up correctly (in other words, the Please ensure iptables rules are rewriting traffic as expected. error message is actually correct.)

In my case, the root cause was that I had cilium as my primary CNI and had also set cni.exclusive=false (the default) in the helm chart, which was deleting all additional CNI configurations (e.g., linkerd-cni) so it wasn't working at all.

However, I still think a case can be made to either remove this variable from this helm chart (in what cases is it needed?), or improve the documentation so id-ten-t's like me don't get lost.

mateiidavid · 2024-07-09T12:28:05Z

I'm happy you managed to troubleshoot your CNI plugin set-up and resolve this :) We've tried to succinctly describe the behaviour of the validator in the docs. I appreciate that some of its internals are opaque and it may be hard to reason with the output you're getting. Some of it is intentional, I think users shouldn't be aware of how the validator works under the hood to operate Linkerd (and there is only so much space we can use to document how the tool works).

As an aside, the fact that 1.1.1.1:20001 actually works is weird, because documentation on what's running there is, again, difficult to find.

1.1.1.1 is CloudFlare's DNS address. It's probably better for the configuration to remain exposed. The validator works in two steps:

It attempts to connect to a remote endpoint (1.1.1.1 but it could be anything else) and starts up a local service impersonating the proxy (i.e. binds to the same port).
It reads a few bytes off the socket. If the bytes are not what it expects, or if it cannot read, it fails.

There are two possible points of error. (1) we can't connect, (2) we connect but to the wrong host. I think (1) doesn't automatically imply that iptables rules are not working, it could be that the connection to our own server failed. If the environment is air gapped and you can't connect to 1.1.1.1 then that should be re-written to some service you can connect to (e.g. kubernetes).

We're always happy to improve the documentation to make it more obvious to users on how these things should work. If you have any concrete suggestions, please let us know. I'm also happy to go into more details on how the validator works if you'd like to open up a PR yourself for the docs contribution.

djryanj · 2024-07-09T16:13:52Z

@mateiidavid thanks for the reply.

As mentioned it turns out that the endpoint isn't really relevant and it was kind of a red herring. That said, I think I will open a PR on the chart repo because I have a suggestion on how to document that just a touch better (while still maintaining succinctness and desired "opaqueness").

mateiidavid · 2024-07-10T08:22:01Z

@djryanj of course! Happy to help.

I know the endpoint turned out to be a red herring but thought I'd explain how it all works to demystify it for you (or anyone else that might run into this issue at some point in the future).

Thanks for taking this on. Let me know if you have any questions.

The networkValidator.connectAddr helm chart parameter is ambiguously documented which leads to confusion. This PR amends the documentation to improve comprehension. Fixes #12797 Signed-off-by: Ryan Jacobs <ryan@ryanjjacobs.com>

@mateiidavid

Fixes #12864 The linkerd-cni network-validator container was binding to the IPv4 wildcard and connecting to an IPv4 address. This wasn't breaking things in IPv6 clusters but it was only validating the iptables rules and not the ip6tables ones. This change introduces logic to use addresses according to the value of `disableIPv6`. If IPv6 is enabled, then the ip6tables rules would get exercised. Note that a more complete change would also exercise both iptables and ip6tables, but for now we're defaulting to ip6tables. This implied changing the helm value `networkValidator.connectAddr` to `connectPort`. @mateiidavid could you please validate if this entry with its simplified doc still makes sense, in light of #12797 ? Similarly was the case with repair-controller, but given the IPv4 wildcard was used for the admin server, in IPv6 clusters the kubelet wasn't able to reach the probe endpoints and the container was failing. In this case the fix is just have the admin server bind to `[::]`, which works for IPv4 and IPv6 clusters.

@mateiidavid

Fixes #12864 The linkerd-cni network-validator container was binding to the IPv4 wildcard and connecting to an IPv4 address. This wasn't breaking things in IPv6 clusters but it was only validating the iptables rules and not the ip6tables ones. This change introduces logic to use addresses according to the value of `disableIPv6`. If IPv6 is enabled, then the ip6tables rules would get exercised. Note that a more complete change would also exercise both iptables and ip6tables, but for now we're defaulting to ip6tables. This implied changing the helm value `networkValidator.connectAddr` to `connectPort`. @mateiidavid could you please validate if this entry with its simplified doc still makes sense, in light of #12797 ? Similarly was the case with repair-controller, but given the IPv4 wildcard was used for the admin server, in IPv6 clusters the kubelet wasn't able to reach the probe endpoints and the container was failing. In this case the fix is just have the admin server bind to `[::]`, which works for IPv4 and IPv6 clusters.

djryanj added the bug label Jul 2, 2024

djryanj mentioned this issue Jul 10, 2024

Clarifies documentation on connectAddr (helm chart) #12827

Merged

mateiidavid closed this as completed in #12827 Jul 15, 2024

alpeb mentioned this issue Jul 22, 2024

Configure network-validator and repair-controller to work with IPv6 #12874

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better documentation on the expectations for `networkValidator.connectAddr` for `linkerd-control-plane` helm chart values #12797

Better documentation on the expectations for `networkValidator.connectAddr` for `linkerd-control-plane` helm chart values #12797

djryanj commented Jul 2, 2024

djryanj commented Jul 4, 2024

mateiidavid commented Jul 9, 2024 •

edited

Loading

djryanj commented Jul 9, 2024

mateiidavid commented Jul 10, 2024

Better documentation on the expectations for networkValidator.connectAddr for linkerd-control-plane helm chart values #12797

Better documentation on the expectations for networkValidator.connectAddr for linkerd-control-plane helm chart values #12797

Comments

djryanj commented Jul 2, 2024

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

djryanj commented Jul 4, 2024

mateiidavid commented Jul 9, 2024 • edited Loading

djryanj commented Jul 9, 2024

mateiidavid commented Jul 10, 2024

Better documentation on the expectations for `networkValidator.connectAddr` for `linkerd-control-plane` helm chart values #12797

Better documentation on the expectations for `networkValidator.connectAddr` for `linkerd-control-plane` helm chart values #12797

output of `linkerd check -o short`

mateiidavid commented Jul 9, 2024 •

edited

Loading