Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better documentation on the expectations for networkValidator.connectAddr for linkerd-control-plane helm chart values #12797

Closed
djryanj opened this issue Jul 2, 2024 · 4 comments · Fixed by #12827
Labels

Comments

@djryanj
Copy link
Contributor

djryanj commented Jul 2, 2024

What is the issue?

When deploying linkerd into an air-gapped or more restrictive environment using helm, the default value of 1.1.1.1:20001 for networkValidator.connectAddr (see: https://github.com/linkerd/linkerd2/blob/main/charts/linkerd-control-plane/values.yaml#L327) causes the init container to fail. The note in the chart is

we expect this to be rewritten

But there is no documentation anywhere that I can find on what it is expected to be rewritten to, or what the expectations of that endpoint are.

E.g., Is it a simple TCP connection which completes a 3-way handshake (this is presumed to be the case)? Does an https webserver fit the bill?

As an aside, the fact that 1.1.1.1:20001 actually works is weird, because documentation on what's running there is, again, difficult to find.

Also, this is sort of (but not really) related to https://linkerd.buoyant.io/t/error-linkerd-network-validator-in-air-gapped-installation/302 and was also a factor in my troubleshooting of #7945.

How can it be reproduced?

  • Install linkerd using helm and cniEnabled: true on a cluster where outbound connections to 1.1.1.1:20001 are blocked; observe failed pod.
  • Try to find documentation on that value; fail.
  • Open network port
  • Pods succeed

Logs, error output, etc

2024-07-02T20:03:06.514474Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-07-02T20:03:06.514493Z DEBUG linkerd_network_validator: token="<redacted>\n"
2024-07-02T20:03:06.514500Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2024-07-02T20:03:06.514929Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.244.1.51:34290
2024-07-02T20:03:16.515844Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

Observe firewall logs blocking this.

output of linkerd check -o short

N/A

Environment

  • Kubernetes version: 1.30.0
  • Host OS: Talos 1.7
  • Linkerd version: Helm chart 2024.6.4, version edge-24.6.4

Possible solution

  • Better documentation on what's needed for this value, and a clearer indication when installing linkerd-cni that it's probably necessary to change the value.

Additional context

No response

Would you like to work on fixing this bug?

None

@djryanj djryanj added the bug label Jul 2, 2024
@djryanj
Copy link
Contributor Author

djryanj commented Jul 4, 2024

So, I was wrong; the default value is fine. If linkerd-cni is working correctly, it will actually intercept that value and essentially reflect it back to the container which indicates that iptables is set up correctly (in other words, the Please ensure iptables rules are rewriting traffic as expected. error message is actually correct.)

In my case, the root cause was that I had cilium as my primary CNI and had also set cni.exclusive=false (the default) in the helm chart, which was deleting all additional CNI configurations (e.g., linkerd-cni) so it wasn't working at all.

However, I still think a case can be made to either remove this variable from this helm chart (in what cases is it needed?), or improve the documentation so id-ten-t's like me don't get lost.

@mateiidavid
Copy link
Member

mateiidavid commented Jul 9, 2024

I'm happy you managed to troubleshoot your CNI plugin set-up and resolve this :) We've tried to succinctly describe the behaviour of the validator in the docs. I appreciate that some of its internals are opaque and it may be hard to reason with the output you're getting. Some of it is intentional, I think users shouldn't be aware of how the validator works under the hood to operate Linkerd (and there is only so much space we can use to document how the tool works).

As an aside, the fact that 1.1.1.1:20001 actually works is weird, because documentation on what's running there is, again, difficult to find.

1.1.1.1 is CloudFlare's DNS address. It's probably better for the configuration to remain exposed. The validator works in two steps:

  1. It attempts to connect to a remote endpoint (1.1.1.1 but it could be anything else) and starts up a local service impersonating the proxy (i.e. binds to the same port).
  2. It reads a few bytes off the socket. If the bytes are not what it expects, or if it cannot read, it fails.

There are two possible points of error. (1) we can't connect, (2) we connect but to the wrong host. I think (1) doesn't automatically imply that iptables rules are not working, it could be that the connection to our own server failed. If the environment is air gapped and you can't connect to 1.1.1.1 then that should be re-written to some service you can connect to (e.g. kubernetes).

We're always happy to improve the documentation to make it more obvious to users on how these things should work. If you have any concrete suggestions, please let us know. I'm also happy to go into more details on how the validator works if you'd like to open up a PR yourself for the docs contribution.

@djryanj
Copy link
Contributor Author

djryanj commented Jul 9, 2024

@mateiidavid thanks for the reply.

As mentioned it turns out that the endpoint isn't really relevant and it was kind of a red herring. That said, I think I will open a PR on the chart repo because I have a suggestion on how to document that just a touch better (while still maintaining succinctness and desired "opaqueness").

@mateiidavid
Copy link
Member

@djryanj of course! Happy to help.

I know the endpoint turned out to be a red herring but thought I'd explain how it all works to demystify it for you (or anyone else that might run into this issue at some point in the future).

Thanks for taking this on. Let me know if you have any questions.

mateiidavid pushed a commit that referenced this issue Jul 15, 2024
The networkValidator.connectAddr helm chart parameter
is ambiguously documented which leads to confusion.

This PR amends the documentation to improve comprehension.

Fixes #12797

Signed-off-by: Ryan Jacobs <ryan@ryanjjacobs.com>
alpeb added a commit that referenced this issue Jul 22, 2024
Fixes #12864

The linkerd-cni network-validator container was binding to the IPv4
wildcard and connecting to an IPv4 address. This wasn't breaking things
in IPv6 clusters but it was only validating the iptables rules and not
the ip6tables ones. This change introduces logic to use addresses
according to the value of `disableIPv6`. If IPv6 is enabled, then the
ip6tables rules would get exercised. Note that a more complete change
would also exercise both iptables and ip6tables, but for now we're
defaulting to ip6tables.

This implied changing the helm value `networkValidator.connectAddr` to
`connectPort`. @mateiidavid could you please validate if this entry with
its simplified doc still makes sense, in light of #12797 ?

Similarly was the case with repair-controller, but given the IPv4
wildcard was used for the admin server, in IPv6 clusters the kubelet
wasn't able to reach the probe endpoints and the container was failing.
In this case the fix is just have the admin server bind to `[::]`, which
works for IPv4 and IPv6 clusters.
alpeb added a commit that referenced this issue Jul 23, 2024
Fixes #12864

The linkerd-cni network-validator container was binding to the IPv4
wildcard and connecting to an IPv4 address. This wasn't breaking things
in IPv6 clusters but it was only validating the iptables rules and not
the ip6tables ones. This change introduces logic to use addresses
according to the value of `disableIPv6`. If IPv6 is enabled, then the
ip6tables rules would get exercised. Note that a more complete change
would also exercise both iptables and ip6tables, but for now we're
defaulting to ip6tables.

This implied changing the helm value `networkValidator.connectAddr` to
`connectPort`. @mateiidavid could you please validate if this entry with
its simplified doc still makes sense, in light of #12797 ?

Similarly was the case with repair-controller, but given the IPv4
wildcard was used for the admin server, in IPv6 clusters the kubelet
wasn't able to reach the probe endpoints and the container was failing.
In this case the fix is just have the admin server bind to `[::]`, which
works for IPv4 and IPv6 clusters.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants