-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better documentation on the expectations for networkValidator.connectAddr
for linkerd-control-plane
helm chart values
#12797
Comments
So, I was wrong; the default value is fine. If In my case, the root cause was that I had cilium as my primary CNI and had also set However, I still think a case can be made to either remove this variable from this helm chart (in what cases is it needed?), or improve the documentation so id-ten-t's like me don't get lost. |
I'm happy you managed to troubleshoot your CNI plugin set-up and resolve this :) We've tried to succinctly describe the behaviour of the validator in the docs. I appreciate that some of its internals are opaque and it may be hard to reason with the output you're getting. Some of it is intentional, I think users shouldn't be aware of how the validator works under the hood to operate Linkerd (and there is only so much space we can use to document how the tool works).
There are two possible points of error. (1) we can't connect, (2) we connect but to the wrong host. I think (1) doesn't automatically imply that iptables rules are not working, it could be that the connection to our own server failed. If the environment is air gapped and you can't connect to We're always happy to improve the documentation to make it more obvious to users on how these things should work. If you have any concrete suggestions, please let us know. I'm also happy to go into more details on how the validator works if you'd like to open up a PR yourself for the docs contribution. |
@mateiidavid thanks for the reply. As mentioned it turns out that the endpoint isn't really relevant and it was kind of a red herring. That said, I think I will open a PR on the chart repo because I have a suggestion on how to document that just a touch better (while still maintaining succinctness and desired "opaqueness"). |
@djryanj of course! Happy to help. I know the endpoint turned out to be a red herring but thought I'd explain how it all works to demystify it for you (or anyone else that might run into this issue at some point in the future). Thanks for taking this on. Let me know if you have any questions. |
The networkValidator.connectAddr helm chart parameter is ambiguously documented which leads to confusion. This PR amends the documentation to improve comprehension. Fixes #12797 Signed-off-by: Ryan Jacobs <ryan@ryanjjacobs.com>
Fixes #12864 The linkerd-cni network-validator container was binding to the IPv4 wildcard and connecting to an IPv4 address. This wasn't breaking things in IPv6 clusters but it was only validating the iptables rules and not the ip6tables ones. This change introduces logic to use addresses according to the value of `disableIPv6`. If IPv6 is enabled, then the ip6tables rules would get exercised. Note that a more complete change would also exercise both iptables and ip6tables, but for now we're defaulting to ip6tables. This implied changing the helm value `networkValidator.connectAddr` to `connectPort`. @mateiidavid could you please validate if this entry with its simplified doc still makes sense, in light of #12797 ? Similarly was the case with repair-controller, but given the IPv4 wildcard was used for the admin server, in IPv6 clusters the kubelet wasn't able to reach the probe endpoints and the container was failing. In this case the fix is just have the admin server bind to `[::]`, which works for IPv4 and IPv6 clusters.
Fixes #12864 The linkerd-cni network-validator container was binding to the IPv4 wildcard and connecting to an IPv4 address. This wasn't breaking things in IPv6 clusters but it was only validating the iptables rules and not the ip6tables ones. This change introduces logic to use addresses according to the value of `disableIPv6`. If IPv6 is enabled, then the ip6tables rules would get exercised. Note that a more complete change would also exercise both iptables and ip6tables, but for now we're defaulting to ip6tables. This implied changing the helm value `networkValidator.connectAddr` to `connectPort`. @mateiidavid could you please validate if this entry with its simplified doc still makes sense, in light of #12797 ? Similarly was the case with repair-controller, but given the IPv4 wildcard was used for the admin server, in IPv6 clusters the kubelet wasn't able to reach the probe endpoints and the container was failing. In this case the fix is just have the admin server bind to `[::]`, which works for IPv4 and IPv6 clusters.
What is the issue?
When deploying linkerd into an air-gapped or more restrictive environment using helm, the default value of
1.1.1.1:20001
fornetworkValidator.connectAddr
(see: https://github.com/linkerd/linkerd2/blob/main/charts/linkerd-control-plane/values.yaml#L327) causes the init container to fail. The note in the chart isBut there is no documentation anywhere that I can find on what it is expected to be rewritten to, or what the expectations of that endpoint are.
E.g., Is it a simple TCP connection which completes a 3-way handshake (this is presumed to be the case)? Does an https webserver fit the bill?
As an aside, the fact that 1.1.1.1:20001 actually works is weird, because documentation on what's running there is, again, difficult to find.
Also, this is sort of (but not really) related to https://linkerd.buoyant.io/t/error-linkerd-network-validator-in-air-gapped-installation/302 and was also a factor in my troubleshooting of #7945.
How can it be reproduced?
cniEnabled: true
on a cluster where outbound connections to 1.1.1.1:20001 are blocked; observe failed pod.Logs, error output, etc
Observe firewall logs blocking this.
output of
linkerd check -o short
N/A
Environment
Possible solution
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: